Red Hat Bugzilla – Bug 50651
gcc 2.96-85 or 3.0 produces code that runs 1/2 as fast as 2.95 on an Athlon
Last modified: 2007-04-18 12:35:23 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.12-20 i686; en-US; rv:0.9)
Description of problem:
This is a floating point intensive code, which uses register blocking.
Apparently, 2.96-8x and gcc 3.0 use a load scheduling algorithm which is a
disaster for athlons. Once the load pattern is fixed manually, it appears
that the fpu stack is used less efficiently by 3.0, but 2.96-8x is adequate
once optimization is increased.
Steps to Reproduce:
1.Given on the URL above, here's the direct URL:
Actual Results: You will see that gcc 2.95.2 gets 1350Mflops for this
example routine, while the newer compilers get roughly 730Mflops. Routines
showing it is the changed fetch operations are given
Expected Results: New compiler should produce code in same performance
range as old.
The thing is that GCC's reg-stack doesn't map very well on this Athon misfeature.
According to Jan Hubicka, this would need complete reg-stack rewrite.
With newer AMD CPUs which will come with SSE2 this will be non-issue.
Certainly, this is not going to be fixed neither in gcc-2.96-RH, nor in gcc 3.0.x
(both are currently in bugfix only mode), nor in gcc 3.1.x (which is going
to freeze on December 15th).
I note that SSE2, if the newer compilers support it well (vectorizing is not
always trivial) will lessen the need for x87 support in hammer and Netburst
archs. However, since this bug slows down Athlon, PIII, PII, PPRO archs, are
you really sure that blowing it off is the right answer? Since the present
scheme offers no speedup on any arch, what is the argument for keeping it?
Don't understand the question. reg-stack is the fixup code which changes
flat numbered float registers to the crappy Intel stack based FPU.
GCC was always using reg-stack, it is just that various more and more aggressive
optimizations just lead to RTL which needs to be fixed up for stack FPU regs
more and more cleverly.
So there is no question whether to keep reg-stack or not, GCC wouldn't work
at all for Intel FPUs without it. The question is whether somebody improves
the current reg-stack or whether somebody rewrites it from scratch, both of
which are time consuming efforts.
The point is that something changed between 2.95.3 and 2.96/3.0, which results
in poorer register stack usage on all x86 architectures. My question,
therefore, is if the change is bad for all architectures, why make it? Why not
go back to the register handling of the 2.95.3 code, at a minimum? As I say,
the new code is worse on all architectures.
The Athlon-specific problem is memory fetch pattern, which I think ought to be
fixed, but I would certainly better understand just blowing off. This problem,
unlike the register stack, can be fixed by the programmer by hand-doing the
Just reviewed the above, and realized it is possible that it is not clear what I
am talking about. I was assuming you took the link,
Where I discuss that there are two problems, one specific to the Athlon
(scheduling that is good for Pentium, bad for Athlon), and one that is bad for
all x86 archs in existance (excessive use of fxch in assembler). gcc2.95.3 had
neither of these problems. I can see going for Pentium fetch to the detriment
of the Athlon, though it seems funny you can't keep the old one around, but
since the register stack usage is inferior on all machines, I can't imagine the
argument for using it. I'm sure it makes things cleaner at the compiler level,
or something, but a performance drop on all x86 archs seems a high price to pay,
when a better algorithm was already preexisting . . .