In this blog post, I delve into the comparison of memory consumption between asynchronous and multi-threaded programming across popular languages like Rust, Go, Java, C#, Python, Node.js and Elixir.
A user opens an issue complaining the server-side
application you developed for them frequently crashes with “too many open files” error
under heavy load. What do you do?
Admin: “Just tell them to raise their file descriptor limits”.
Software developer: “No, no, hold my beer, I can fix it in the app”.
Imagine you wrote a program for a pleasingly parallel problem,
where each thread does its own independent piece of work,
and the threads don’t need to coordinate except joining the results at the end.
Obviously you’d expect the more cores it runs on, the faster it is.
You benchmark it on a laptop first and indeed you find out it scales
nearly perfectly on all of the 4 available cores. Then you run it on a big, fancy, multiprocessor
machine, expecting even better performance, only to see it actually runs slower
than the laptop, no matter how many cores you give it. Uh. That has just happened to me recently.
Some programming languages like Java or Scala offer more than one way to express
a concept of “lack of value”. Traditionally, a special null value is used to denote
references that don’t reference any value at all. However, over time we
have learned that using nulls can be very error-prone and can cause many troubles like
NullPointerException errors crashing a program in the most unexpected moment.
Therefore, modern programming style recommends avoiding nulls wherever possible
in favor of a much better Option, Optional or Maybe data type
(called differently in many languages, but the concept is the same).
Unfortunately, it is believed that optional values in Java may come with a
performance penalty. In this blog post, I’ll try to answer whether
it is true, and if the performance penalty really exists, how serious it is.
In the earlier post I showed how accessing data on
an SSD in parallel can greatly improve read performance. However, that technique
is not very effective for data stored on spinning drives. In some cases parallel access
can even deteriorate performance significantly. Fortunately, there exists a class of optimizations
that can strongly help with HDDs: request ordering. By requesting data in proper order,
the disk seek latency can be reduced by an order of magnitude. Since I introduced that
optimization in fclones 0.9, fclones became the
fastest duplicate file finder I know of.
Physicists say that a measurement result given without an error estimate is worthless. This applies
to benchmarking as well. We not only want to know how performant a computer program or a system is,
but we also want to know if we can trust the performance numbers. This article explains how to compute
uncertainty intervals and how to avoid some traps caused by applying commonly known
statistical methods without validating their assumptions first.
In the previous post I showed how to use asynchronous
Rust to measure throughput and response times of a Cassandra cluster.
That approach works pretty well on a developer’s laptop, but it turned out it doesn’t scale to bigger machines.
I’ve hit a hard limit around 150k requests per
second, and it wouldn’t go faster regardless of the performance of the server.
In this post I share a different approach that doesn’t have these scalability problems.
I was able to saturate a 24-core single node Cassandra server
at 800k read queries per second with a single client machine.
Performance of a database system depends on many factors: hardware, configuration,
database schema, amount of data, workload type, network latency, and many others.
Therefore, one typically can’t tell the actual performance of such system without
first measuring it. In this blog post I’m describing how to build a benchmarking tool
for Apache Cassandra from scratch in Rust and how to avoid many pitfalls.
The techniques I show are applicable to any system with an async API.
Recently I came across a blog post
whose author claims, from the perspective of good coding practices, polymorphism is strictly superior to branching.
In the post they make general statements about how branching statements lead to unreadable, unmaintainable, inflexible code and
how they are a sign of immaturity. However, in my opinion, the topic is much deeper and in this post
I try to objectively discuss the reasons for and against branching.
In the previous post, I showed how processing
file data in parallel can either boost or hurt performance
depending on the workload and device capabilities. Therefore, in complex programs that mix tasks
of different types using different physical resources, e.g. CPU, storage (e.g. HDD/SSD)
or network I/O, a need may arise to configure parallelism levels differently for each task type.
This is typically solved by scheduling tasks of different types on dedicated thread pools.
In this post I’m showing how to implement a solution in Rust with Rayon.
One of the well-known ways of speeding up a data processing task is partitioning the data into smaller
chunks and processing the chunks in parallel. Let’s assume we can partition the task easily, or the input data is already
partitioned into separate files which all reside on a single storage device. Let’s also assume the algorithm we run on those
data is simple enough so that the computation time is not a bottleneck. How much performance can we gain by reading the files in parallel?
Can we lose any?