The C file to be attached is compiled to code where gratuitous memory references are inserted in the inner loop even though there are plenty of registers available. It even looks like gcc is actually allocating the variables in registers, but then forgets it in the inner loop. I am attaching both the source file and the generated assembly. The relevant part of the assembly is this: .L11: movl %ebx, -16(%ebp) <- the variables are in registers here movl %ecx, -36(%ebp) .L6: #APP # begin inner #NO_APP movl -36(%ebp), %edi <- memory ref - should use %ecx instead. addl $4, -36(%ebp) <- same movl (%edi), %eax movl -16(%ebp), %edi <- memory ref - should use %ebx instead orl $-16777216, %eax movl %eax, (%edi) addl $4, %edi movl %edi, -16(%ebp) <- same #APP # end inner #NO_APP subl $1, %edx cmpw $-1, %dx je .L4 jmp .L6 When compiled with -O3, the problem goes away except for one apparently gratuitous memory write in the loop, but I'd think that even -O2 should get this right. dhcp83-218:~% rpm -q gcc gcc-4.1.2-13 gcc commandline: gcc -O2 -S gcc-register.c
Created attachment 157224 [details] The test case
Created attachment 157225 [details] The generated assembly
Both gcc 4.1.x and 4.2.x behave this way, in *.lreg this is (insn:HI 60 58 61 5 (parallel [ (set (reg:SI 90) (ior:SI (mem:SI (reg/v/f:SI 63 [ src ]) [3 S4 A32]) (const_int -16777216 [0xffffffffff000000]))) (clobber (reg:CC 17 flags)) ]) 318 {*iorsi_1} (nil) (expr_list:REG_EQUIV (mem:SI (reg/v/f:SI 65 [ dst ]) [3 S4 A32]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn:HI 61 60 62 5 (set (mem:SI (reg/v/f:SI 65 [ dst ]) [3 S4 A32]) (reg:SI 90)) 40 {*movsi_1} (insn_list:REG_DEP_TRUE 60 (nil)) (expr_list:REG_DEAD (reg:SI 90) (expr_list:REG_EQUAL (ior:SI (mem:SI (reg/v/f:SI 63 [ src ]) [3 S4 A32]) (const_int -16777216 [0xffffffffff000000])) (nil)))) (plus src/dst bump and w decrement), but after global alloc and reload the code is terrible. Both 3.4.x and the trunk happen to assign different hard registers to src and dst and so the loop looks nicer, but I'm not sure if that isn't just a coincidence. Anyway, register allocator is a known painful spot in gcc, Vlad is working on that area, but unless the fix turns out to be very obvious the chances of backporting this to 4.1.x-RH are close to nil, it would be terribly risky change.
Yeah, I wasn't really expecting any back porting. Feel free to close this bug if it isn't useful. Note though that this issue is a real problem for the cairo and X server rendering code.
cairo or X can work around this: s/uint16_t w;/uint32_t w;/ while (height--) { dst = dstLine; dstLine += dstStride; src = srcLine; srcLine += srcStride; for (w = 0; w < width; w++) dst[w] = src[w] | 0xFF000000; } .L6: movl -4(%edx,%ebx,4), %eax orl $-16777216, %eax movl %eax, -4(%ecx,%ebx,4) addl $1, %ebx cmpl -24(%ebp), %ebx je .L4 jmp .L6 The question is if it is only better code on register starved i?86 (which ought to die soon), or other arches too.
Tracking upstream.