Piotr Kołaczkowski

Estimating Benchmark Results Uncertainty

Physicists say that a measurement result given without an error estimate is worthless. This applies to benchmarking as well. We not only want to know how performant a computer program or a system is, but we also want to know if we can trust the performance numbers. This article explains how to compute uncertainty intervals and how to avoid some traps caused by applying commonly known statistical methods without validating their assumptions first.

In my previous posts I focused on building a tool that can measure performance of a database system. That simple tool gave a single value as its output. If you ran it multiple times, you’d notice, that even when measuring the performance of the same server again and again, the results are slightly different each time.

Sources of Variance

There are several sources of variance of performance:

There are probably many other reasons that I missed. While it is worth spending some time on minimizing the biggest sources of noise, it may be uneconomical or even outright impossible to get rid of all of the noise. Therefore your results will always have some degree of variance.

Basic Statistics

Because the measured value behaves randomly, it can be modelled by a random variable. For further reference, let’s call it XX. We can characterize the performance of the system by the statistical distribution of XX. Of course, we never know the true distribution, but we can get some estimate of it by measuring (observing) the value NN times and applying some exciting maths to them to get a few useful numbers.

Programmers very frequently compute basic satictics such as:

Standard Error of the Mean

The standard error sXˉs_{\bar{X}} of the mean is a useful metric that describes how accurately we estimated the expected value. We can estimate it from the data by using the following formulas:

s^=1N1i=1N(xixˉ)2\hat{s} = \sqrt{ \frac{1}{N - 1} \sum_{i = 1}^{N} (x_i - \bar{x})^2 } s^Xˉ=s^N\hat{s}_{\bar{X}} = \frac{\hat{s}}{\sqrt{N}}

where xˉ\bar{x} denotes the mean of the observations and ss denotes the standard deviation from the mean. You don’t have to remember that formula nor code it manually, because probably every statistical library in most programming languages offers it out-of-the-box.

Why is the standard error such a useful metric? If we average a large enough number of observations, by the Central Limit Theorem, the sample average Xˉ\bar{X} will be distributed very closely to the normal distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2) where μxˉ\mu \approx \bar{x} and σs^Xˉ\sigma \approx \hat{s}_{\bar{X}}. The really good news is that typically N=10N = 10 is already “large enough” and also the distribution of XX doesn’t need to be normal – it is enough it has finite mean and variance. BTW: If your benchmark results have infinite variance, then you may have a bigger problem than calculating the error of the mean.

This conclusion allows us to obtain confidence intervals for xˉ\bar{x}:

lower bound upper bound probability EX\mathrm{E}X lies within the bounds
xˉs^Xˉ\bar{x} - \hat{s}_{\bar{X}} xˉ+s^Xˉ\bar{x} + \hat{s}_{\bar{X}} 0.6827
xˉ2s^Xˉ\bar{x} - 2 \hat{s}_{\bar{X}} xˉ+2s^Xˉ\bar{x} + 2 \hat{s}_{\bar{X}} 0.9545
xˉ3s^Xˉ\bar{x} - 3 \hat{s}_{\bar{X}} xˉ+3s^Xˉ\bar{x} + 3 \hat{s}_{\bar{X}} 0.9973

The normal distributuion probability function is another nice thing that’s available in most statistical libraries, so you could compute custom confidence intervals for any probability you wish easily.

There is a Trap

Computing has this nice property that, unlike in biology, geography, medical or social sciences, it makes it easy and cheap to obtain huge samples. Every quadrupling of the sample size makes the confidence interval twice narrower for the same probability, because there is N\sqrt{N} in the denominator of the formula for the standard error, and the enumerator depends only on the distribution of XX, but not the number of observations. So, theoretically, we could estimate the expected value with any accuracy we wish, and we’re only limited by the time we’re allowed to run the test.

Let’s take for example, that we want to compute the average response time of the server. We issue NN requests and calculate their mean. The local single node Casssandra server installed on my development laptop can easily handle ~180k requests per second. Whoa! This makes my N=180000N = 180000 for a benchmark that takes only a single second. I can easily collect 1 million data points in less than 10 seconds. 1 million data points should make the error of the average response time estimate very small.

Indeed, with 99.9% confidence interval and N=1N = 1 million, I obtained a result like this:

Avg response time: 2.45 ± 0.01 ms

What a lovely tiny error! If it takes only a few seconds, why not run it a few more times to make really sure the result is correct?

Avg response time: 2.51 ± 0.01 ms
Avg response time: 2.48 ± 0.01 ms
Avg response time: 2.57 ± 0.01 ms
Avg response time: 2.42 ± 0.01 ms
Avg response time: 2.47 ± 0.01 ms

Wait… did I say 99.9% confidence interval? Why are my numbers fluctuating so much?! The differences definitely exceed the error interval.

After double checking the implementation and running more tests to verify I’m not just getting those ugly results by an extreme unlikely bad luck, I found that…

Assumptions Matter

There is a nice joke about biologists and statisticians going to a conference by train.

3 biologists and 3 statisticians were going to a scientific conference in the same car compartment. The biologists talk about how lucky they were to get a 5% discount on their train tickets. The statisticians responded: “We’ve paid even less – we’ve bought only one ticket for all three of us!”
“How so?” – the biologists got confused. “You’ll see”. When the statisticians notice the ticket controller appear in the car, all of them quickly go to a lavatory and close the door. When the controller sees the lavatory is occupied, he knocks on the door, the statisticians give him the (single) ticket through a small gap below the door. Biologists are amazed seeing this.

After the conference, the biologists buy only one ticket for the return home and want to use the statisticians’ method. They meet with the same 3 statisticians in the compartment and tell them their plan. The statisticians respond: “Well, that’s a nice idea, but we’ve already improved our method. See, this time we’ve bought no tickets at all!”. “How so?!”. “You’ll see”. When the controller appears, the biologists rush to the lavatory. The statisticians follow them and knock on the door. The biologists give the ticket they’ve bought, the statisticians take it and lock themselves in another lavatory. The conclusion: don’t use statistical methods you don’t understand!

The Central Limit Theorem holds if results of each experiment XiX_i have the same distribution and are independent from each other. Unfortunately often neither of these assumption holds in the real world.

Imagine the benchmark code is executed on a computer with a CPU that has some kind of “turbo mode” i.e. the operating system boosts its frequency if the CPU is cool, but then lowers it once it heats up. When running a sequence of experiments on such a machine, the results obtained at the beginning of the sequence would likely have different expected value than the results obtained near the end. The longer you run the experiment and the more data points you collect, the lower the mean performance would be, because the initial data points that were collected when the CPU frequency was boosted would matter less relative to all the collected data. The best way of fixing that problem is to run the benchmarks in an environment that doesn’t change performance in response to the load generated by the benchmark. However, after turning off any turbo modes and locking the CPU frequency to a constant value, the variability of the results was still a lot larger than what the standard error predicted.

Let’s look closer at where the formula for sxˉs_{\bar{x}} is coming from. Assume we draw a sample {x1,x2,...,xN}\{x_1, x_2, ..., x_N\} from a sequence of random variables {X1,X2,...,XN}\{X_1, X_2, ..., X_N\}. A sequence of such variables is called a stochastic process. Let’s assume variables XiX_i are eqally distributed and have a finite expected value and variance:

EX1=EX2=...=EXN=EX(1)\mathrm{E}X_1 = \mathrm{E}X_2 = ... = \mathrm{E}X_N = \mathrm{E}X \tag{1} VarX1=VarX2=...=VarXN=VarX(2)\mathrm{Var}X_1 = \mathrm{Var}X_2 = ... = \mathrm{Var}X_N = \mathrm{Var}X \tag{2}

Let Xˉ\bar{X} be the random variable denoting the mean of all the observations:

Xˉ=1Ni=1NXi\bar{X} = \frac{1}{N}\sum_{i = 1}^{N}X_i

From (1) we can conclude that our mean will have the same expected value as the expected value of XX:

EXˉ=1Ni=1NEXi=1NNEX=EX\mathrm{E}\bar{X} = \frac{1}{N}\sum_{i = 1}^N\mathrm{E}X_i = \frac{1}{N}N\mathrm{E}X = \mathrm{E}X

What about variance? We can use (2) in a similar way, however one must be careful when taking constants out from under the variance, because they become squared:

VarXˉ=Var(1Ni=1NXi)=1N2Var(i=1NXi)\mathrm{Var}\bar{X} = \mathrm{Var}\left(\frac{1}{N}\sum_{i = 1}^{N}X_i\right) = \frac{1}{N^2}\mathrm{Var}\left(\sum_{i = 1}^{N}X_i\right)

If we additionally assume that XiX_i is independent from XjX_j for iji \neq j, the variance of a sum of variables is equal to the sum of their variances, hence we obtain:

VarXˉ=1N2i=1NVarXi=NN2VarX=1NVarX(3)\mathrm{Var}\bar{X} = \frac{1}{N^2}\sum_{i = 1}^{N}\mathrm{Var}X_i = \frac{N}{N^2}\mathrm{Var}X = \frac{1}{N} \mathrm{Var}X \tag{3}

Finally, standard deviation ss is the square root of the variance:

sXˉ=1NVarX=sNs_{\bar{X}} = \frac{1}{\sqrt{N}}\sqrt{\mathrm{Var}X} = \frac{s}{\sqrt{N}}

which is the formula that’s used to compute the standard error we mentioned earlier.

However, if Xi{X_i} is not independent from Xj{X_j} for any iji \neq j, then the formula (3) is incorrect, because the variance of a sum is not the sum of variances! The general formula for summing variances is a bit more complex and contains additional covariance terms:

Var(i=1NXi)=i=1Nj=1NCov(Xi,Xj)=i=1NVarXi+2i=2Nj<iCov(Xi,Xj)\mathrm{Var}\left(\sum_{i = 1}^{N}X_i\right) = \sum_{i = 1}^N\sum_{j = 1}^N \mathrm{Cov}(X_i, X_j) = \sum_{i = 1}^{N}\mathrm{Var}X_i + 2 \sum_{i = 2}^N\sum_{j < i} \mathrm{Cov(X_i, X_j)}

After dividing by N2N^2 we get:

VarXˉ=1N2i=1Nj=1NCov(Xi,Xj)=1N(VarX+2Ni=2Nj<iCov(Xi,Xj))(4)\mathrm{Var}\bar{X} = \frac{1}{N^2}\sum_{i = 1}^N\sum_{j = 1}^N \mathrm{Cov}(X_i, X_j) = \frac{1}{N}\left(\mathrm{Var}X + \frac{2}{N} \sum_{i = 2}^N\sum_{j < i} \mathrm{Cov(X_i, X_j)}\right) \tag{4}

We can see that any dependency between experiments can cause the second term to become non-zero. If variables XiX_i are positively correlated, the covariance term will be positive and the actual variance of the mean would be larger. In extreme case, if all data points were 100% positively correlated with each other, that is Cov(Xi,Xj)=Var(X)\mathrm{Cov}(X_i, X_j) = \mathrm{Var}(X) for all ii and jj, formula (4) would become:

VarXˉ=VarX\mathrm{Var}\bar{X} = \mathrm{Var}X

This means dependence between the results makes the effective size of the sample smaller than NN. In the extreme case it can make effective NN equal 1, regardless of the number of the observations we make.

There is a good thought experiment that can explain this phenomenon without formulas. Imagine you’re periodically checking the outside temperature to compute the yearly average. In experiment A you take a measurement once per second. In experiment B, you take a measurement every 10 ms, so you get 100 more data points than in experiment A. But can you really say the expected error of the yearly average is really 100\sqrt{100} times lower in experiment B than in A? You can’t conclude that, because there is an extremely strong correlation between measurements taken with such a high frequency and many of your data points are just exactly the same. By sampling every 10 ms, you artificially increased the number of data points, but due to strong autocorrelation, the amount of real information you collected didn’t really increase (much).

This problem exists in benchmarking as well, although it is probably not as strong as with the weather. Imagine that during the test, the operating system decides to put the benchmarking process on hold for a while and perform another unrealated task. Or a JVM decides to start GC. Or Cassandra starts flushing a memtable. If our data point is just a single database request, then several hundreds requests executed at that time can be affected. This makes response times not truly independent from each other.

Better Estimate Of Variance Of The Mean

Let’s put formula (4) into practice. We need to find a way to compute these Cov(Xi,Xj)\mathrm{Cov}(X_i, X_j) terms. The nested sum looks a bit scary though, and estimating covariance from data itself requires O(N)O(N) steps, so altogether it would be O(N3)O(N^3).

We can simplify it a bit by adding an assumption that our stochastic process XX is weakly stationary, i.e. its basic statistical properties like mean, variance and autocovariance do not change when shifted in time. In this case not only EXi=EXi+τ\mathrm{E}X_i = \mathrm{E}X_{i + \tau} and VarXi=VarXi+τ\mathrm{Var}X_i = \mathrm{Var}X_{i + \tau} but also Cov(Xi,Xj)=Cov(Xi+τ,Xj+τ)\mathrm{Cov}(X_{i}, X_{j}) = \mathrm{Cov}(X_{i + \tau}, X_{j + \tau}) for all values of ii and τ\tau.

For convenience, let’s define autocovariance with lag kk as:

γX(k)=Cov(X1,X1+k)=Cov(X2,X2+k)=...=Cov(XNk,XN)\gamma_X(k) = \mathrm{Cov}(X_{1}, X_{1 + k}) = \mathrm{Cov}(X_{2}, X_{2 + k}) = ... = \mathrm{Cov}(X_{N - k}, X_{N}) γX(0)=VarX\gamma_X(0) = \mathrm{Var}X

And now we can rewrite formula (4) into:

VarXˉ=1N(γX(0)+2Nk=1N1(Nk)γX(k))(5)\mathrm{Var}\bar{X} = \frac{1}{N}\left(\gamma_X(0) + \frac{2}{N} \sum_{k = 1}^{N-1} (N - k)\gamma_X(k)\right) \tag{5}

Note the term (Nk)(N - k) that replaces the inner loop in (4). There are (Nk)(N - k) pairs (Xi,Xi+k)(X_i, X_{i+k}) in our series XX.

We can estimate variance and autocovariance from the data:

γ^X(k)=1Ni=1Nk(xixˉ)(xi+kxˉ)\hat{\gamma}_X(k) = \frac{1}{N}\sum_{i = 1}^{N - k}(x_i - \bar{x})(x_{i+k} - \bar{x})

Finally, the autocovariance-corrected formula for empirical error of the mean is:

s^Xˉ=1N(γ^X(0)+2Nk=1N1(Nk)γ^X(k))(6)\hat{s}_{\bar{X}} = \sqrt{\frac{1}{N}\left(\hat{\gamma}_X(0) + \frac{2}{N} \sum_{k = 1}^{N-1} (N - k)\hat{\gamma}_X(k)\right)} \tag{6}

Another Trap

Recipe (6) has a few problems when transformed directly into code:

A method that works reasonably well in practice is limiting kk to a value much lower than NN, e.g. N\sqrt{N}:

s^Xˉ=1N(γ^X(0)+2Nk=1N(Nk)γ^X(k))\hat{s}_{\bar{X}} = \sqrt{\frac{1}{N}\left(\hat{\gamma}_X(0) + \frac{2}{N} \sum_{k = 1}^{\sqrt{N}} (N - k)\hat{\gamma}_X(k)\right)}

This is based on the assumption that data points that are distant in time are much less correlated with each other than points that are close to each other, therefore truncating these higher-lag autocovariances doesn’t remove much useful information. This assumption is not true in general case, however, many stochastic processes we can observe in practice in benchmarking have such property or are “close enough”. You will need a bit of experimentation to find out the best lag cut-off for your system. Additionally, a nice side-effect of limiting the number of autocovariance terms is reducing the asymptotic complexity.

Another, more general method is multiplying autocovariance terms by diminishing weights less than 1, approaching zero when kk reaches a hard limit. A list of various weighting schemes is given by [1]. You can also find a broader discussion about applying this formula in [2].


Putting it all together, we can write the following function for estimating error of the mean:

pub fn error_of_the_mean(v: &[f64]) -> f64 {
    if v.len() <= 1 {
        return f64::NAN;
    let len = v.len() as f64;

    let mut mean = 0.0;
    for &x in v.iter() {
        mean += x;
    mean /= len;

    let mut var = 0.0;
    for &x in v.iter() {
        let diff = x - mean;
        var += diff * diff;
    var /= len;

    let bandwidth = len.powf(0.5);
    let max_lag = min(v.len(), bandwidth.ceil() as usize);
    let mut cov = 0.0;
    for lag in 1..max_lag {
        let weight = 1.0 - lag / len;
        for i in lag..v.len() {
            let diff_1 = v[i] - mean;
            let diff_2 = v[i - lag] - mean;
            cov += 2.0 * diff_1 * diff_2 * weight;
    cov /= len;

    ((var + cov).max(0.0) / len).sqrt()


  1. [1] D. W. K. Andrews, “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation,” Econometrica, vol. 59, no. 3, pp. 817–858, 1991.
  2. [2] R. Okui, “Asymptotically Unbiased Estimation Of Autocovariances And Autocorrelations With Long Panel Data,” Econometric Theory, vol. 26, no. 5, pp. 1263–1304, 2010.
Share on: