Piotr Kołaczkowski

Overhead of Returning Optional Values in Java and Rust

Some programming languages like Java or Scala offer more than one way to express a concept of “lack of value”. Traditionally, a special null value is used to denote references that don’t reference any value at all. However, over time we have learned that using nulls can be very error-prone and can cause many troubles like NullPointerException errors crashing a program in the most unexpected moment. Therefore, modern programming style recommends avoiding nulls wherever possible in favor of a much better Option, Optional or Maybe data type (called differently in many languages, but the concept is the same). Unfortunately, it is believed that optional values in Java may come with a performance penalty. In this blog post, I’ll try to answer whether it is true, and if the performance penalty really exists, how serious it is.

Before we start, let me point out that if you don’t care much about extreme runtime performance, you should never use null in Java nor Scala, except for in rare cases of working with legacy APIs. Just forget null ever existed in the language. And if you receive an occasional null from an API you can’t control, just wrap it immediately in Optional<T> / Option[T]. Don’t let nulls bite you. You won’t regret it. To learn the reasons why null pointers / null references are a bad idea, watch this talk by Tony Hoare.

Ok, but if we care about performance? How much do optionals actually cost? Some say optionals are expensive because they cause heap allocation and often force the underlying type to be boxed. Not only heap allocation is costly, but it also forces the garbage collector to run more frequently. Others say that it doesn’t really matter, because the JVM compiler will eliminate the overhead thanks to inlining and escape analysis, and in most cases it should transform the code to an equivalent of code using null pointers.

Who is right? Let’s make a quick benchmark.

Java Benchmark

How to measure Optional overhead? The time to create an Optional object instance, even assuming a pessimistic case of allocating it on the heap, is likely too small to measure it directly. We need to create many Optional objects in a tight loop and time the whole loop.

We also need to make sure the compiler doesn’t notice the code does nothing or does something trivial. We don’t want the loop to be eliminated. A proper way of dealing with this problem is using all of the Optional objects we create so that their values affect the final result computed in the benchmark. Then the final result must be consumed by a black hole. Fortunately a benchmarking tool like JMH makes it all easy.

I created three variants of code solving the same task. The task was to compute a sum of all the numbers, skipping the number whenever it is equal to a magic constant. The variants differ by the way how skipping is realized:

  1. We return primitive longs and check if we need to skip by performing a comparison with the magic value directly in the summing loop.
  2. We return boxed Longs and we return null whenever we need to skip a number.
  3. We return boxed Longs wrapped in Optional and we return Optional.empty() whenever we need to skip a number.
@State(Scope.Benchmark)
@Fork(1)
@Warmup(iterations = 2)
@Measurement(iterations = 5)
public class OptionBenchmark {

    private final long MAGIC_NUMBER = 7;

    // Variant 1.
    // Probably the simplest way to sum numbers.
    // No boxing, no objects involved, just primitive long values everywhere.
    // This is probably what a C-programmer converted to Java would write ;)
    private long getNumber(long i) {
        return i & 0xFF;
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public long sumSimple() {
        long sum = 0;
        for (long i = 0; i < 1_000_000; ++i) {
            long n = getNumber(i);
            if (n != MAGIC_NUMBER)
                sum += n;
        }
        return sum;
    }

    // Variant 2.
    // Replace MAGIC_NUMBER with a null.
    // To be able to return null, we need to box long into a Long object.
    private Long getNumberOrNull(long i) {
        long n = i & 0xFF;
        return n == MAGIC_NUMBER ? null : n;
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public long sumNulls() {
        long sum = 0;
        for (long i = 0; i < 1_000_000; ++i) {
            Long n = getNumberOrNull(i);
            if (n != null) {
                sum += n;
            }
        }
        return sum;
    }


    // Variant 3.
    // Replace MAGIC_NUMBER with Optional.empty().
    // Now we not only need to box the value into a Long, but also create the Optionsl wrapper.
    private Optional<Long> getOptionalNumber(long i) {
        long n = i & 0xFF;
        return n == MAGIC_NUMBER ? Optional.empty() : Optional.of(n);
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public long sumOptional() {
        long sum = 0;
        for (long i = 0; i < 1_000_000; ++i) {
            Optional<Long> n = getOptionalNumber(i);
            if (n.isPresent()) {
                sum += n.get();
            }
        }
        return sum;
    }
}

Results

Obviously all variants compute the same sum, so a sufficiently smart compiler should be able to generate exactly the same machine code for all of them. The method returning the number is very short and trivial, so I was expecting the compiler to inline it, remove all of object overhead through escape analysis, unroll and maybe even vectorize the loop. The number of invocations is constant and known at compile time, and the computations are trivial. Real code in real projects is usually not as simple as this microbenchmark, so we’re essentially testing the best-case scenario here.

OpenJDK 8 (JDK 1.8.0_292, OpenJDK 64-Bit Server VM, 25.292-b10):

Benchmark                    Mode  Cnt     Score     Error  Units
OptionBenchmark.sumSimple    avgt    5  1141,557 ± 366,538  us/op
OptionBenchmark.sumNulls     avgt    5  2324,597 ±  79,033  us/op
OptionBenchmark.sumOptional  avgt    5  4576,383 ± 288,571  us/op

OpenJDK 11 (Java HotSpot(TM) 64-Bit Server VM, 11+28):

Benchmark                    Mode  Cnt     Score     Error  Units
OptionBenchmark.sumSimple    avgt    5   603,683 ±  13,246  us/op
OptionBenchmark.sumNulls     avgt    5  2444,626 ±  35,236  us/op
OptionBenchmark.sumOptional  avgt    5  4303,527 ± 109,900  us/op

Graal Community 21.2 (JDK 16.0.2, OpenJDK 64-Bit Server VM, 16.0.2+7-jvmci-21.2-b08):

OptionBenchmark.sumSimple    avgt    5   868,155 ±  42,103  us/op
OptionBenchmark.sumNulls     avgt    5  1937,866 ±  29,938  us/op
OptionBenchmark.sumOptional  avgt    5  4201,451 ± 613,179  us/op

OpenJDK 17 (JDK 17-ea, OpenJDK 64-Bit Server VM, 17-ea+19-Ubuntu-1ubuntu1):

Benchmark                    Mode  Cnt     Score     Error  Units
OptionBenchmark.sumSimple    avgt    5   449,811 ±  60,889  us/op
OptionBenchmark.sumNulls     avgt    5   952,622 ±  84,138  us/op
OptionBenchmark.sumOptional  avgt    5  4002,787 ± 264,937  us/op

Seriously, these outcomes are totally not what I expected. In all cases JVM did a poor job of eliminating both the boxing and the optionals. This led to about 8x worse timings for looping over optionals than looping over primitive longs on JDK 17. I even reran the benchmarks with explicit -XX:+DoEscapeAnalysis -XX:+Inline to make sure these are turned on, but that hasn’t changed anything (they should be enabled by default anyway).

At least, there is a steady progress in performance between different JDK versions.

Disabling Inlining

In real production code it can also happen that a method that returns an Optional doesn’t get inlined. How does it affect the overhead? I reran the benchmarks with -XX:-Inline option:

JDK 1.8.0_292, OpenJDK 64-Bit Server VM, 25.292-b10:

Benchmark                    Mode  Cnt      Score      Error  Units
OptionBenchmark.sumSimple    avgt    5   2244,281 ± 1271,115  us/op
OptionBenchmark.sumNulls     avgt    5   9024,197 ±  391,510  us/op
OptionBenchmark.sumOptional  avgt    5  20045,933 ±  777,194  us/op

JDK 11, Java HotSpot(TM) 64-Bit Server VM, 11+28

Benchmark                    Mode  Cnt      Score      Error  Units
OptionBenchmark.sumSimple    avgt    5   2385,597 ±  323,662  us/op
OptionBenchmark.sumNulls     avgt    5   7900,140 ± 2841,416  us/op
OptionBenchmark.sumOptional  avgt    5  20507,346 ± 5236,451  us/op

JDK 17-ea, OpenJDK 64-Bit Server VM, 17-ea+19-Ubuntu-1ubuntu1:

Benchmark                    Mode  Cnt      Score      Error  Units
OptionBenchmark.sumSimple    avgt    5   2382,109 ± 1346,017  us/op
OptionBenchmark.sumNulls     avgt    5   7664,463 ± 2040,463  us/op
OptionBenchmark.sumOptional  avgt    5  17961,942 ±  773,224  us/op

Everything got much slower, but optionals and boxing overhead are still clearly visible and the ratio is still more than 7x. The difference between JDKs is much smaller this time.

Rust Benchmark

I’ve rewritten everything to Rust because Rust also has optional values, and I was curious if a similar difference would be present. The code was trivial to port, except the variant that was using nulls, because Rust has no nulls (which is IMHO a good thing). Instead, I added two more variants:

use std::num::NonZeroU64;

const MAGIC: u64 = 7;

fn get_int(n: u64) -> u64 {
    n & 0xFF
}

pub fn sum_simple() -> u64 {
    let mut sum = 0;
    for i in 0..1_000_000 {
        let n = get_int(i);
        if n != MAGIC {
            sum += n;
        }
    }
    sum
}

fn get_optional_int(n: u64) -> Option<u64> {
    let i = n & 0xFF;
    if i == MAGIC { None } else { Some(i) }
}


pub fn sum_optional() -> u64 {
    let mut sum = 0;
    for i in 0..1_000_000 {
        let value = get_optional_int(i);
        if let Some(k) = value {
            sum += k;
        }
    }
    sum
}

fn get_optional_non_zero(n: u64) -> Option<NonZeroU64> {
    let i = n & 0xFF;
    if i == MAGIC { None } else { NonZeroU64::new(i) }
}

pub fn sum_optional_non_zero() -> u64 {
    let mut sum = 0;
    for i in 0..1_000_000 {
        let value = get_optional_non_zero(i);
        if let Some(k) = value {
            sum += k.get();
        }
    }
    sum
}


fn get_int_boxed(n: u64) -> Box<Option<u64>> {
    let opt = if n == MAGIC { None } else { Some(n) };
    Box::new(opt)
}

pub fn sum_boxed() -> u64 {
    let mut sum = 0;
    for i in 0..1_000_000 {
        let n = get_int_boxed(i);
        if let Some(k) = *n {
            sum += k;
        }
    }
    sum
}

Results

Let the numbers speak for themselves.

With inlining, compiled specifically for my CPU with RUSTFLAGS="-C target-cpu=native":

sum_simple              time:   [121.33 us 121.60 us 121.89 us]                        
sum_optional            time:   [121.65 us 121.97 us 122.29 us]                         
sum_optional_non_zero   time:   [123.22 us 123.71 us 124.26 us]                                  
sum_boxed               time:   [122.36 us 123.19 us 124.21 us]    

With default release options:

sum_simple              time:   [375.58 us 377.02 us 378.73 us]                        
sum_optional            time:   [373.19 us 374.20 us 375.31 us]                         
sum_optional_non_zero   time:   [373.09 us 374.24 us 375.60 us]                                  
sum_boxed               time:   [313.62 us 314.76 us 316.09 us]                      

With default release options, but with inlining of the inner functions returning the number blocked by #[inline(never)]:

sum_simple              time:   [1.3899 ms 1.3940 ms 1.3987 ms]                         
sum_optional            time:   [1.4066 ms 1.4117 ms 1.4177 ms]                          
sum_optional_non_zero   time:   [1.1041 ms 1.1089 ms 1.1145 ms]                                   
sum_boxed               time:   [14.882 ms 14.936 ms 14.994 ms]                      

Analysis

Why has Java scored so bad compared to Rust?

I added the -perfasm option to the JMH options of the sumOptional Java benchmark and it resulted in the following disassembly output (I’m actually showing only a relevant fragment of it):

  0,01%   ╭ │     0x00007fcc78d3a3eb:   jmp    0x00007fcc78d3a495           ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │ │                                                               ; - pk.OptionBenchmark::sumOptional@45 (line 73)
  0,31%   │ │  ↗  0x00007fcc78d3a3f0:   mov    QWORD PTR [r15+0x108],rsi
  3,08%   │ │  │  0x00007fcc78d3a3f7:   prefetchw BYTE PTR [rsi+0xc0]
  3,34%   │ │  │  0x00007fcc78d3a3fe:   mov    QWORD PTR [r10],0x1
  0,24%   │ │  │  0x00007fcc78d3a405:   mov    DWORD PTR [r10+0x8],0x49770  ;   {metadata('java/lang/Long')}
  0,54%   │ │  │  0x00007fcc78d3a40d:   mov    DWORD PTR [r10+0xc],r12d     ;*new {reexecute=0 rethrow=0 return_oop=0}
          │ │  │                                                            ; - java.lang.Long::valueOf@31 (line 1211)
          │ │  │                                                            ; - pk.OptionBenchmark::getOptionalNumber@21 (line 65)
          │ │  │                                                            ; - pk.OptionBenchmark::sumOptional@14 (line 74)
  2,55%   │ │  │  0x00007fcc78d3a411:   mov    QWORD PTR [r10+0x10],r11
  1,58%   │ │ ↗│  0x00007fcc78d3a415:   mov    rax,QWORD PTR [r15+0x108]
  2,71%   │ │ ││  0x00007fcc78d3a41c:   mov    r11,rax
  0,72%   │ │ ││  0x00007fcc78d3a41f:   add    r11,0x10
  3,48%   │ │ ││  0x00007fcc78d3a423:   cmp    r11,QWORD PTR [r15+0x118]
          │ │ ││  0x00007fcc78d3a42a:   jae    0x00007fcc78d3a574           ;*goto {reexecute=0 rethrow=0 return_oop=0}
          │ │ ││                                                            ; - pk.OptionBenchmark::sumOptional@45 (line 73)
  1,90%   │ │ ││  0x00007fcc78d3a430:   mov    QWORD PTR [r15+0x108],r11
  1,80%   │ │ ││  0x00007fcc78d3a437:   prefetchw BYTE PTR [r11+0xc0]
  5,22%   │ │ ││  0x00007fcc78d3a43f:   mov    QWORD PTR [rax],0x1
  3,78%   │ │ ││  0x00007fcc78d3a446:   mov    DWORD PTR [rax+0x8],0x11af00 ;*new {reexecute=0 rethrow=0 return_oop=0}
          │ │ ││                                                            ; - java.util.Optional::of@0 (line 113)
          │ │ ││                                                            ; - pk.OptionBenchmark::getOptionalNumber@24 (line 65)
          │ │ ││                                                            ; - pk.OptionBenchmark::sumOptional@14 (line 74)
          │ │ ││                                                            ;   {metadata('java/util/Optional')}
  2,20%   │ │ ││  0x00007fcc78d3a44d:   mov    r11,r10
  1,54%   │ │ ││  0x00007fcc78d3a450:   shr    r11,0x3
  0,87%   │ │ ││  0x00007fcc78d3a454:   mov    DWORD PTR [rax+0xc],r11d     ;*synchronization entry
          │ │ ││                                                            ; - pk.OptionBenchmark::getOptionalNumber@-1 (line 64)
          │ │ ││                                                            ; - pk.OptionBenchmark::sumOptional@14 (line 74)
  3,23%   │ │ ││  0x00007fcc78d3a458:   mov    r11d,DWORD PTR [rax+0xc]     ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
          │ │ ││                                                            ; - java.util.Optional::isPresent@1 (line 154)
          │ │ ││                                                            ; - pk.OptionBenchmark::sumOptional@21 (line 75)
  2,24%   │ │ ││  0x00007fcc78d3a45c:   nop    DWORD PTR [rax+0x0]
  1,47%   │ │ ││  0x00007fcc78d3a460:   test   r11d,r11d
          │╭│ ││  0x00007fcc78d3a463:   je     0x00007fcc78d3a47f
  2,72%   │││ ││  0x00007fcc78d3a465:   mov    r10d,DWORD PTR [r12+r11*8+0x8]
 11,00%   │││ ││  0x00007fcc78d3a46a:   shl    r11,0x3
  0,92%   │││ ││  0x00007fcc78d3a46e:   cmp    r10d,0x49770                 ;*synchronization entry
          │││ ││                                                            ; - pk.OptionBenchmark::getOptionalNumber@-1 (line 64)
          │││ ││                                                            ; - pk.OptionBenchmark::sumOptional@14 (line 74)
          │││ ││                                                            ;   {metadata('java/lang/Long')}
          │││ ││  0x00007fcc78d3a475:   jne    0x00007fcc78d3a687           ;*checkcast {reexecute=0 rethrow=0 return_oop=0}
          │││ ││                                                            ; - pk.OptionBenchmark::sumOptional@33 (line 76)
  4,53%   │││ ││  0x00007fcc78d3a47b:   add    rcx,QWORD PTR [r11+0x10]     ;*lload_3 {reexecute=0 rethrow=0 return_oop=0}
          │││ ││                                                            ; - pk.OptionBenchmark::sumOptional@41 (line 73)
  5,89%   │↘│ ││  0x00007fcc78d3a47f:   add    r8,0x1
  0,92%   │ │ ││  0x00007fcc78d3a483:   mov    r11,r8                       ;*synchronization entry
          │ │ ││                                                            ; - pk.OptionBenchmark::getOptionalNumber@-1 (line 64)
          │ │ ││                                                            ; - pk.OptionBenchmark::sumOptional@14 (line 74)
  0,62%   │ │ ││  0x00007fcc78d3a486:   movzx  r11,r11b                     ;*land {reexecute=0 rethrow=0 return_oop=0}
          │ │ ││                                                            ; - pk.OptionBenchmark::getOptionalNumber@4 (line 64)
          │ │ ││                                                            ; - pk.OptionBenchmark::sumOptional@14 (line 74)
  1,14%   │ │ ││  0x00007fcc78d3a48a:   inc    edi
  4,42%   │ │ ││  0x00007fcc78d3a48c:   cmp    edi,r9d
          │ ╰ ││  0x00007fcc78d3a48f:   jge    0x00007fcc78d3a3b3

My knowledge of Intel assembly is quite limited, but I can clearly see a few things:

To confirm my interpretation of the disassembly, I also ran all the benchmarks with -perf gc:

Benchmark                                                     Mode  Cnt         Score        Error   Units
OptionBenchmark.sumSimple:·gc.alloc.rate                      avgt    5        ≈ 10⁻⁴               MB/sec
OptionBenchmark.sumSimple:·gc.count                           avgt    5           ≈ 0               counts

OptionBenchmark.sumNulls:·gc.alloc.rate                       avgt    5        ≈ 10⁻⁴               MB/sec
OptionBenchmark.sumNulls:·gc.count                            avgt    5           ≈ 0               counts

OptionBenchmark.sumOptional:·gc.alloc.rate                    avgt    5      6193,432 ±    424,465  MB/sec
OptionBenchmark.sumOptional:·gc.count                         avgt    5       546,000               counts

As you can see, the sumOptional benchmark was allocating at rate over 6 GB/s!

Rust output code

The code generated by Rust using default options looks much simpler:

            Disassembly of section .text:
               
            0000000000050f40 <benchmark::sum_optional>:
            benchmark::sum_optional:
              mov    $0x4,%ecx
              xor    %edx,%edx
              xor    %eax,%eax
              nop           
  2,67  10:   lea    -0x4(%rcx),%esi
  5,65        lea    -0x3(%rcx),%edi
  2,61        movzbl %sil,%esi
  4,29        cmp    $0x7,%rsi
  2,76        cmove  %rdx,%rsi
  5,71        add    %rax,%rsi
  2,47        lea    -0x2(%rcx),%eax
  4,41        movzbl %dil,%edi
  2,69        cmp    $0x7,%rdi
  6,05        cmove  %rdx,%rdi
  3,29        add    %rsi,%rdi
  3,98        lea    -0x1(%rcx),%esi
  2,53        movzbl %al,%eax
  5,09        cmp    $0x7,%rax
  3,37        cmove  %rdx,%rax
  4,76        add    %rdi,%rax
  2,75        movzbl %sil,%esi
  4,80        cmp    $0x7,%rsi
  3,15        cmove  %rdx,%rsi
  5,64        add    %rax,%rsi
  2,23        movzbl %cl,%eax
  4,18        cmp    $0x7,%rax
  2,78        cmove  %rdx,%rax
  5,31        add    %rsi,%rax
  2,44        add    $0x5,%rcx
              cmp    $0xf4244,%rcx
  4,40      ↑ jne    10     
  0,01      ← ret           

The code is much shorter, even despite the fact the loop has been unrolled. It is also way simpler and easier to understand. In the body of the loop there are no moves between memory and registers, and there are no branches. Skipping the number is performed with a conditional move instruction which looks like a very natural choice to me.

It is worth noting the code for all of four variants is identical and the compiler even merged copies into one. The compiler managed to get rid of all the overhead of Option and Box. These abstrations turned out to be truly zero-cost!

As a final check I compiled the benchmark with RUSTFLAGS="-C target-cpu=native":

             Disassembly of section .text:
               
             0000000000057d50 <benchmark::sum_classic>:
             btree_benchmark::sum_optional:
               sub          $0x38,%rsp
  0,00         vmovdqa      anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x4d0,%ymm0
               vpxor        %xmm11,%xmm11,%xmm11
               mov          $0xf4240,%eax
               vpbroadcastq 0x1d5b31(%rip),%ymm1        # 22d8a0 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x4f0>
               vmovdqu      %ymm1,(%rsp)
               vpbroadcastq 0x1d5b2b(%rip),%ymm2        # 22d8a8 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x4f8>
               vpbroadcastq 0x1d5b2a(%rip),%ymm3        # 22d8b0 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x500>
               vpbroadcastq 0x1d5b29(%rip),%ymm4        # 22d8b8 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x508>
               vpbroadcastq 0x1d5b28(%rip),%ymm5        # 22d8c0 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x510>
               vpbroadcastq 0x1d5b27(%rip),%ymm6        # 22d8c8 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x518>
               vpbroadcastq 0x1d5b26(%rip),%ymm7        # 22d8d0 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x520>
               vpbroadcastq 0x1d5b25(%rip),%ymm8        # 22d8d8 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x528>
               vpbroadcastq 0x1d5b24(%rip),%ymm9        # 22d8e0 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x530>
               vpbroadcastq 0x1d5b23(%rip),%ymm10        # 22d8e8 <anon.b26f45dd7d0ea1d60480ae6c3c88753d.34.llvm.65588711349491208+0x538>
               vpxor        %xmm12,%xmm12,%xmm12
               vpxor        %xmm13,%xmm13,%xmm13
               vpxor        %xmm14,%xmm14,%xmm14
               data16       data16 cs nopw 0x0(%rax,%rax,1)
  1,34   90:   vpand        %ymm4,%ymm0,%ymm15
  3,66         vpcmpeqq     %ymm5,%ymm15,%ymm1
  3,04         vpandn       %ymm15,%ymm1,%ymm1
  0,96         vpaddq       (%rsp),%ymm0,%ymm15
  1,25         vpand        %ymm4,%ymm15,%ymm15
  3,51         vpaddq       %ymm1,%ymm11,%ymm1
  2,94         vpcmpeqq     %ymm5,%ymm15,%ymm11
  2,25         vpandn       %ymm15,%ymm11,%ymm11
  0,80         vpaddq       %ymm2,%ymm0,%ymm15
  3,11         vpand        %ymm4,%ymm15,%ymm15
  3,38         vpaddq       %ymm12,%ymm11,%ymm12
  2,45         vpcmpeqq     %ymm5,%ymm15,%ymm11
  1,73         vpandn       %ymm15,%ymm11,%ymm11
  1,92         vpaddq       %ymm3,%ymm0,%ymm15
  2,93         vpand        %ymm4,%ymm15,%ymm15
  4,20         vpaddq       %ymm13,%ymm11,%ymm13
  1,64         vpcmpeqq     %ymm5,%ymm15,%ymm11
  2,02         vpandn       %ymm15,%ymm11,%ymm11
  4,99         vpaddq       %ymm14,%ymm11,%ymm14
  2,13         vpaddq       %ymm6,%ymm0,%ymm11
  0,66         vpand        %ymm4,%ymm11,%ymm11
  1,97         vpcmpeqq     %ymm5,%ymm11,%ymm15
  4,79         vpandn       %ymm11,%ymm15,%ymm11
  1,72         vpaddq       %ymm7,%ymm0,%ymm15
  0,67         vpand        %ymm4,%ymm15,%ymm15
  2,54         vpaddq       %ymm1,%ymm11,%ymm11
  3,66         vpcmpeqq     %ymm5,%ymm15,%ymm1
  2,39         vpandn       %ymm15,%ymm1,%ymm1
  0,42         vpaddq       %ymm0,%ymm8,%ymm15
  2,02         vpand        %ymm4,%ymm15,%ymm15
  4,75         vpaddq       %ymm1,%ymm12,%ymm12
  2,23         vpcmpeqq     %ymm5,%ymm15,%ymm1
  1,99         vpandn       %ymm15,%ymm1,%ymm1
  1,44         vpaddq       %ymm0,%ymm9,%ymm15
  3,94         vpand        %ymm4,%ymm15,%ymm15
  4,01         vpaddq       %ymm1,%ymm13,%ymm13
  1,84         vpcmpeqq     %ymm5,%ymm15,%ymm1
  1,90         vpandn       %ymm15,%ymm1,%ymm1
  3,97         vpaddq       %ymm1,%ymm14,%ymm14
  1,96         vpaddq       %ymm0,%ymm10,%ymm0
  0,00         add          $0xffffffffffffffe0,%rax
  0,84       ↑ jne          90
  0,00         vpaddq       %ymm11,%ymm12,%ymm0
  0,00         vpaddq       %ymm0,%ymm13,%ymm0
               vpaddq       %ymm0,%ymm14,%ymm0
               vextracti128 $0x1,%ymm0,%xmm1
               vpaddq       %xmm1,%xmm0,%xmm0
               vpshufd      $0xee,%xmm0,%xmm1
  0,00         vpaddq       %xmm1,%xmm0,%xmm0
               vmovq        %xmm0,%rax
               add          $0x38,%rsp
               vzeroupper     
  0,00       ← ret            

The loop got nicely vectorised and there are no branches inside the body of the loop either. It is a bit longer, but it does more per each cycle of the loop and needs fewer cycles.

Conclusions

Disclaimer: Do measure performance by yourself and don’t assume particular optimizations will / won’t take place because you saw someone mentioning them in the blog post. YMMV.

Share on: