Piotr Kołaczkowski

How a Single Line of Code Made a 24-core Server Slower Than a Laptop

Imagine you wrote a program for a pleasingly parallel problem, where each thread does its own independent piece of work, and the threads don’t need to coordinate except joining the results at the end. Obviously you’d expect the more cores it runs on, the faster it is. You benchmark it on a laptop first and indeed you find out it scales nearly perfectly on all of the 4 available cores. Then you run it on a big, fancy, multiprocessor machine, expecting even better performance, only to see it actually runs slower
than the laptop, no matter how many cores you give it. Uh. That has just happened to me recently.

read more

Overhead of Returning Optional Values in Java and Rust

Some programming languages like Java or Scala offer more than one way to express a concept of “lack of value”. Traditionally, a special null value is used to denote references that don’t reference any value at all. However, over time we have learned that using nulls can be very error-prone and can cause many troubles like NullPointerException errors crashing a program in the most unexpected moment. Therefore, modern programming style recommends avoiding nulls wherever possible in favor of a much better Option, Optional or Maybe data type (called differently in many languages, but the concept is the same). Unfortunately, it is believed that optional values in Java may come with a performance penalty. In this blog post, I’ll try to answer whether it is true, and if the performance penalty really exists, how serious it is.

read more

Ordering Requests to Accelerate Disk I/O

In the earlier post I showed how accessing data on an SSD in parallel can greatly improve read performance. However, that technique is not very effective for data stored on spinning drives. In some cases parallel access can even deteriorate performance significantly. Fortunately, there exists a class of optimizations that can strongly help with HDDs: request ordering. By requesting data in proper order, the disk seek latency can be reduced by an order of magnitude. Since I introduced that optimization in fclones 0.9, fclones became the fastest duplicate file finder I know of.

read more

Estimating Benchmark Results Uncertainty

Physicists say that a measurement result given without an error estimate is worthless. This applies to benchmarking as well. We not only want to know how performant a computer program or a system is, but we also want to know if we can trust the performance numbers. This article explains how to compute uncertainty intervals and how to avoid some traps caused by applying commonly known statistical methods without validating their assumptions first.

read more

Scalable Benchmarking with Rust Streams

In the previous post I showed how to use asynchronous Rust to measure throughput and response times of a Cassandra cluster. That approach works pretty well on a developer’s laptop, but it turned out it doesn’t scale to bigger machines. I’ve hit a hard limit around 150k requests per second, and it wouldn’t go faster regardless of the performance of the server. In this post I share a different approach that doesn’t have these scalability problems. I was able to saturate a 24-core single node Cassandra server at 800k read queries per second with a single client machine.

read more

Benchmarking Apache Cassandra with Rust

Performance of a database system depends on many factors: hardware, configuration, database schema, amount of data, workload type, network latency, and many others. Therefore, one typically can’t tell the actual performance of such system without first measuring it. In this blog post I’m describing how to build a benchmarking tool for Apache Cassandra from scratch in Rust and how to avoid many pitfalls. The techniques I show are applicable to any system with an async API.

read more

In Defense of a Switch

Recently I came across a blog post whose author claims, from the perspective of good coding practices, polymorphism is strictly superior to branching. In the post they make general statements about how branching statements lead to unreadable, unmaintainable, inflexible code and how they are a sign of immaturity. However, in my opinion, the topic is much deeper and in this post I try to objectively discuss the reasons for and against branching.

read more

Multiple Thread Pools in Rust

In the previous post, I showed how processing file data in parallel can either boost or hurt performance depending on the workload and device capabilities. Therefore, in complex programs that mix tasks of different types using different physical resources, e.g. CPU, storage (e.g. HDD/SSD) or network I/O, a need may arise to configure parallelism levels differently for each task type. This is typically solved by scheduling tasks of different types on dedicated thread pools. In this post I’m showing how to implement a solution in Rust with Rayon.

read more

Performance Impact of Parallel Disk Access

One of the well-known ways of speeding up a data processing task is partitioning the data into smaller chunks and processing the chunks in parallel. Let’s assume we can partition the task easily, or the input data is already partitioned into separate files which all reside on a single storage device. Let’s also assume the algorithm we run on those data is simple enough so that the computation time is not a bottleneck. How much performance can we gain by reading the files in parallel? Can we lose any?

read more