From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.12-20 i686; en-US; rv:0.9) Gecko/20010507 Description of problem: This is a floating point intensive code, which uses register blocking. Apparently, 2.96-8x and gcc 3.0 use a load scheduling algorithm which is a disaster for athlons. Once the load pattern is fixed manually, it appears that the fpu stack is used less efficiently by 3.0, but 2.96-8x is adequate once optimization is increased. How reproducible: Always Steps to Reproduce: 1.Given on the URL above, here's the direct URL: http://www.cs.utk.edu/~rwhaley/ATLAS/gcc30.html#dup Actual Results: You will see that gcc 2.95.2 gets 1350Mflops for this example routine, while the newer compilers get roughly 730Mflops. Routines showing it is the changed fetch operations are given Expected Results: New compiler should produce code in same performance range as old. Additional info:
The thing is that GCC's reg-stack doesn't map very well on this Athon misfeature. According to Jan Hubicka, this would need complete reg-stack rewrite. With newer AMD CPUs which will come with SSE2 this will be non-issue. Certainly, this is not going to be fixed neither in gcc-2.96-RH, nor in gcc 3.0.x (both are currently in bugfix only mode), nor in gcc 3.1.x (which is going to freeze on December 15th).
I note that SSE2, if the newer compilers support it well (vectorizing is not always trivial) will lessen the need for x87 support in hammer and Netburst archs. However, since this bug slows down Athlon, PIII, PII, PPRO archs, are you really sure that blowing it off is the right answer? Since the present scheme offers no speedup on any arch, what is the argument for keeping it?
Don't understand the question. reg-stack is the fixup code which changes flat numbered float registers to the crappy Intel stack based FPU. GCC was always using reg-stack, it is just that various more and more aggressive optimizations just lead to RTL which needs to be fixed up for stack FPU regs more and more cleverly. So there is no question whether to keep reg-stack or not, GCC wouldn't work at all for Intel FPUs without it. The question is whether somebody improves the current reg-stack or whether somebody rewrites it from scratch, both of which are time consuming efforts.
The point is that something changed between 2.95.3 and 2.96/3.0, which results in poorer register stack usage on all x86 architectures. My question, therefore, is if the change is bad for all architectures, why make it? Why not go back to the register handling of the 2.95.3 code, at a minimum? As I say, the new code is worse on all architectures. The Athlon-specific problem is memory fetch pattern, which I think ought to be fixed, but I would certainly better understand just blowing off. This problem, unlike the register stack, can be fixed by the programmer by hand-doing the scheduling.
Just reviewed the above, and realized it is possible that it is not clear what I am talking about. I was assuming you took the link, http://www.cs.utk.edu/~rwhaley/ATLAS/gcc30.html Where I discuss that there are two problems, one specific to the Athlon (scheduling that is good for Pentium, bad for Athlon), and one that is bad for all x86 archs in existance (excessive use of fxch in assembler). gcc2.95.3 had neither of these problems. I can see going for Pentium fetch to the detriment of the Athlon, though it seems funny you can't keep the old one around, but since the register stack usage is inferior on all machines, I can't imagine the argument for using it. I'm sure it makes things cleaner at the compiler level, or something, but a performance drop on all x86 archs seems a high price to pay, when a better algorithm was already preexisting . . .