Bug 244575

Summary: Problem with gcc i386 register allocation
Product: [Fedora] Fedora Reporter: Søren Sandmann Pedersen <sandmann>
Component: gccAssignee: Jakub Jelinek <jakub>
Status: CLOSED UPSTREAM QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: kem, vmakarov
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-06-20 09:24:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
The test case
none
The generated assembly none

Description Søren Sandmann Pedersen 2007-06-17 15:27:03 UTC
The C file to be attached is compiled to code where gratuitous memory
references are inserted in the inner loop even though there are plenty of
registers available. It even looks like gcc is actually allocating the
variables in registers, but then forgets it in the inner loop. 

I am attaching both the source file and the generated assembly. The relevant
part of the assembly is this:

.L11:
        movl    %ebx, -16(%ebp)        <- the variables are in registers here
        movl    %ecx, -36(%ebp)
.L6:
#APP
        # begin inner
#NO_APP
        movl    -36(%ebp), %edi        <- memory ref - should use %ecx instead.
        addl    $4, -36(%ebp)          <- same
        movl    (%edi), %eax
        movl    -16(%ebp), %edi        <- memory ref - should use %ebx instead
        orl     $-16777216, %eax
        movl    %eax, (%edi)
        addl    $4, %edi
        movl    %edi, -16(%ebp)        <- same
#APP
        # end inner
#NO_APP
        subl    $1, %edx
        cmpw    $-1, %dx
        je      .L4
        jmp     .L6

When compiled with -O3, the problem goes away except for one apparently
gratuitous memory write in the loop, but I'd think that even -O2 should get
this right.

dhcp83-218:~% rpm -q gcc
gcc-4.1.2-13

gcc commandline:
gcc -O2 -S gcc-register.c

Comment 1 Søren Sandmann Pedersen 2007-06-17 15:27:03 UTC
Created attachment 157224 [details]
The test case

Comment 2 Søren Sandmann Pedersen 2007-06-17 15:28:16 UTC
Created attachment 157225 [details]
The generated assembly

Comment 3 Jakub Jelinek 2007-06-18 16:40:43 UTC
Both gcc 4.1.x and 4.2.x behave this way, in *.lreg this is
(insn:HI 60 58 61 5 (parallel [
            (set (reg:SI 90)
                (ior:SI (mem:SI (reg/v/f:SI 63 [ src ]) [3 S4 A32])
                    (const_int -16777216 [0xffffffffff000000])))
            (clobber (reg:CC 17 flags))
        ]) 318 {*iorsi_1} (nil)
    (expr_list:REG_EQUIV (mem:SI (reg/v/f:SI 65 [ dst ]) [3 S4 A32])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))

(insn:HI 61 60 62 5 (set (mem:SI (reg/v/f:SI 65 [ dst ]) [3 S4 A32])
        (reg:SI 90)) 40 {*movsi_1} (insn_list:REG_DEP_TRUE 60 (nil))
    (expr_list:REG_DEAD (reg:SI 90)
        (expr_list:REG_EQUAL (ior:SI (mem:SI (reg/v/f:SI 63 [ src ]) [3 S4 A32])
                (const_int -16777216 [0xffffffffff000000]))
            (nil))))
(plus src/dst bump and w decrement), but after global alloc and reload the
code is terrible.  Both 3.4.x and the trunk happen to assign different hard
registers to src and dst and so the loop looks nicer, but I'm not sure if
that isn't just a coincidence.  Anyway, register allocator is a known painful
spot in gcc, Vlad is working on that area, but unless the fix turns out to be
very obvious the chances of backporting this to 4.1.x-RH are close to nil, it
would be terribly risky change.

Comment 4 Søren Sandmann Pedersen 2007-06-18 20:25:35 UTC
Yeah, I wasn't really expecting any back porting. Feel free to close this bug if
it isn't useful.

Note though that this issue is a real problem for the cairo and X server
rendering code.


Comment 5 Jakub Jelinek 2007-06-18 23:48:52 UTC
cairo or X can work around this:
s/uint16_t w;/uint32_t w;/
    while (height--)
    {
        dst = dstLine;
        dstLine += dstStride;
        src = srcLine;
        srcLine += srcStride;
        for (w = 0; w < width; w++)
          dst[w] = src[w] | 0xFF000000;
    }
.L6:
        movl    -4(%edx,%ebx,4), %eax
        orl     $-16777216, %eax
        movl    %eax, -4(%ecx,%ebx,4)
        addl    $1, %ebx
        cmpl    -24(%ebp), %ebx
        je      .L4
        jmp     .L6

The question is if it is only better code on register starved i?86 (which ought
to die soon), or other arches too.

Comment 6 Jakub Jelinek 2007-06-20 09:24:48 UTC
Tracking upstream.