I was playing with assembly today, trying to understand performance impact of breaking data dependencies.

Optimizing compilers tend to rewrite the human-obvious sum implementation to use multiple registers to keep track of parts of the total sum, adding them up at the end. For example here’s an implementation of sum in Rust 1.75 with -C opt-level=3 and the associated disassembly (shortened to only keep the relevant bits):

pub fn sum(arr: &[i32]) -> i32 {
    let mut result = 0;

    for n in arr.iter() {
        result += n
    }

    result
}
sum:
        ; Set xmm0 to 0.
        pxor    xmm0, xmm0
        xor     eax, eax
        ; Set xmm1 to 0.
        pxor    xmm1, xmm1 ; Set xmm1 to 0
.loop:
        ; Move 4 i32 values from our array into xmm2
        movdqu  xmm2, xmmword ptr [rdi + 4*rax]
        ; Add the 4 i32 values in xmm2 to xmm0.
        paddd   xmm0, xmm2
        ; Move the next 4 i32 values from our array into xmm2.
        movdqu  xmm2, xmmword ptr [rdi + 4*rax + 16]
        ; Add the 4 i32 values in xmm2 to xmm1, not xmm0 this time
        paddd   xmm1, xmm2

        ; Move index by 8 and if we're not done jump back to loop.
        add     rax, 8
        cmp     r8, rax
        jne     .loop

        ; If we are done, add up what's in xmm0 and xmm1 and return.
        paddd   xmm1, xmm0
        pshufd  xmm0, xmm1, 238
        paddd   xmm0, xmm1
        pshufd  xmm1, xmm0, 85
        paddd   xmm1, xmm0
        movd    eax, xmm1
        ret


As expected the compiler splits the sum between two registers (xmm0, xmm1). My hypothesis, therefore, was that the following hand-written assembly:


sum:
        lea     rcx, [rdi + 4*rsi]
        xor     eax, eax    
.loop:
        ; Use EAX for both additions
        add     eax, dword [rdi]    
        add     eax, dword [rdi + 4]
        lea     rdi, [rdi + 8] 
        cmp     rdi, rcx      
        jb      .loop        
        ret                         

Would be measurably slower than this multi-register one.

sum:
        lea     rcx, [rdi + 4*rsi]
        xor     eax, eax
        xor     ebx, ebx
.loop:
        ; Split the addition between EAX and EBX
        add     eax, dword [rdi]
        add     ebx, dword [rdi + 4]
        lea     rdi, [rdi + 8]
        cmp     rdi, rcx
        jb      .loop
        add     eax, ebx
        ret

Except… It’s not? If anything the first version is slightly faster.

Is my measuring methodology horrible (hyperfine on shared vCPU isn’t great, but I would expect it to be merely noisy, not biased)? Is this register renaming making things efficient behind the scenes?

Curious!