Bug 50651 - gcc 2.96-85 or 3.0 produces code that runs 1/2 as fast as 2.95 on an Athlon
gcc 2.96-85 or 3.0 produces code that runs 1/2 as fast as 2.95 on an Athlon
Product: Red Hat Linux
Classification: Retired
Component: gcc (Show other bugs)
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Jakub Jelinek
David Lawrence
Depends On:
  Show dependency treegraph
Reported: 2001-08-01 17:40 EDT by R. Clint Whaley
Modified: 2007-04-18 12:35 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2001-08-01 17:40:39 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description R. Clint Whaley 2001-08-01 17:40:35 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.12-20 i686; en-US; rv:0.9)

Description of problem:
This is a floating point intensive code, which uses register blocking.
Apparently, 2.96-8x and gcc 3.0 use a load scheduling algorithm which is a
disaster for athlons.  Once the load pattern is fixed manually, it appears
that the fpu stack is used less efficiently by 3.0, but 2.96-8x is adequate
once optimization is increased.

How reproducible:

Steps to Reproduce:
1.Given on the URL above, here's the direct URL:

Actual Results:  You will see that gcc 2.95.2 gets 1350Mflops for this
example routine, while the newer compilers get roughly 730Mflops.  Routines
showing it is the changed fetch operations are given

Expected Results:  New compiler should produce code in same performance
range as old.

Additional info:
Comment 1 Jakub Jelinek 2001-11-19 15:16:44 EST
The thing is that GCC's reg-stack doesn't map very well on this Athon misfeature.
According to Jan Hubicka, this would need complete reg-stack rewrite.
With newer AMD CPUs which will come with SSE2 this will be non-issue.
Certainly, this is not going to be fixed neither in gcc-2.96-RH, nor in gcc 3.0.x
(both are currently in bugfix only mode), nor in gcc 3.1.x (which is going
to freeze on December 15th).
Comment 2 R. Clint Whaley 2001-11-19 15:27:00 EST
I note that SSE2, if the newer compilers support it well (vectorizing is not
always trivial) will lessen the need for x87 support in hammer and Netburst
archs.  However, since this bug slows down Athlon, PIII, PII, PPRO archs, are
you really sure that blowing it off is the right answer?  Since the present
scheme offers no speedup on any arch, what is the argument for keeping it?
Comment 3 Jakub Jelinek 2001-11-19 15:35:37 EST
Don't understand the question. reg-stack is the fixup code which changes
flat numbered float registers to the crappy Intel stack based FPU.
GCC was always using reg-stack, it is just that various more and more aggressive
optimizations just lead to RTL which needs to be fixed up for stack FPU regs
more and more cleverly.
So there is no question whether to keep reg-stack or not, GCC wouldn't work
at all for Intel FPUs without it. The question is whether somebody improves
the current reg-stack or whether somebody rewrites it from scratch, both of
which are time consuming efforts.
Comment 4 R. Clint Whaley 2001-11-19 15:41:47 EST
The point is that something changed between 2.95.3 and 2.96/3.0, which results
in poorer register stack usage on all x86 architectures.  My question,
therefore, is if the change is bad for all architectures, why make it?  Why not
go back to the register handling of the 2.95.3 code, at a minimum?  As I say,
the new code is worse on all architectures.

The Athlon-specific problem is memory fetch pattern, which I think ought to be
fixed, but I would certainly better understand just blowing off.  This problem,
unlike the register stack, can be fixed by the programmer by hand-doing the
Comment 5 R. Clint Whaley 2001-11-19 23:39:54 EST
Just reviewed the above, and realized it is possible that it is not clear what I
am talking about.  I was assuming you took the link,

Where I discuss that there are two problems, one specific to the Athlon
(scheduling that is good for Pentium, bad for Athlon), and one that is bad for
all x86 archs in existance (excessive use of fxch in assembler).  gcc2.95.3 had
neither of these problems.  I can see going for Pentium fetch to the detriment
of the Athlon, though it seems funny you can't keep the old one around, but
since the register stack usage is inferior on all machines, I can't imagine the
argument for using it.  I'm sure it makes things cleaner at the compiler level,
or something, but a performance drop on all x86 archs seems a high price to pay,
when a better algorithm was already preexisting . . .

Note You need to log in before you can comment on or make changes to this bug.