50651 – gcc 2.96-85 or 3.0 produces code that runs 1/2 as fast as 2.95 on an Athlon

Bug 50651 - gcc 2.96-85 or 3.0 produces code that runs 1/2 as fast as 2.95 on an Athlon

Summary: gcc 2.96-85 or 3.0 produces code that runs 1/2 as fast as 2.95 on an Athlon

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	gcc
Sub Component:
Version:	7.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	David Lawrence
Docs Contact:
URL:	http://www.cs.utk.edu/~rwhaley/ATLAS/...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-08-01 21:40 UTC by R. Clint Whaley
Modified:	2007-04-18 16:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2001-08-01 21:40:39 UTC
Embargoed:

Attachments	(Terms of Use)

Description R. Clint Whaley 2001-08-01 21:40:35 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux 2.2.12-20 i686; en-US; rv:0.9)
Gecko/20010507

Description of problem:
This is a floating point intensive code, which uses register blocking.
Apparently, 2.96-8x and gcc 3.0 use a load scheduling algorithm which is a
disaster for athlons.  Once the load pattern is fixed manually, it appears
that the fpu stack is used less efficiently by 3.0, but 2.96-8x is adequate
once optimization is increased.

How reproducible:
Always

Steps to Reproduce:
1.Given on the URL above, here's the direct URL:
   http://www.cs.utk.edu/~rwhaley/ATLAS/gcc30.html#dup	

Actual Results:  You will see that gcc 2.95.2 gets 1350Mflops for this
example routine, while the newer compilers get roughly 730Mflops.  Routines
showing it is the changed fetch operations are given

Expected Results:  New compiler should produce code in same performance
range as old.

Additional info:

Comment 1 Jakub Jelinek 2001-11-19 20:16:44 UTC

The thing is that GCC's reg-stack doesn't map very well on this Athon misfeature.
According to Jan Hubicka, this would need complete reg-stack rewrite.
With newer AMD CPUs which will come with SSE2 this will be non-issue.
Certainly, this is not going to be fixed neither in gcc-2.96-RH, nor in gcc 3.0.x
(both are currently in bugfix only mode), nor in gcc 3.1.x (which is going
to freeze on December 15th).

Comment 2 R. Clint Whaley 2001-11-19 20:27:00 UTC

I note that SSE2, if the newer compilers support it well (vectorizing is not
always trivial) will lessen the need for x87 support in hammer and Netburst
archs.  However, since this bug slows down Athlon, PIII, PII, PPRO archs, are
you really sure that blowing it off is the right answer?  Since the present
scheme offers no speedup on any arch, what is the argument for keeping it?

Comment 3 Jakub Jelinek 2001-11-19 20:35:37 UTC

Don't understand the question. reg-stack is the fixup code which changes
flat numbered float registers to the crappy Intel stack based FPU.
GCC was always using reg-stack, it is just that various more and more aggressive
optimizations just lead to RTL which needs to be fixed up for stack FPU regs
more and more cleverly.
So there is no question whether to keep reg-stack or not, GCC wouldn't work
at all for Intel FPUs without it. The question is whether somebody improves
the current reg-stack or whether somebody rewrites it from scratch, both of
which are time consuming efforts.

Comment 4 R. Clint Whaley 2001-11-19 20:41:47 UTC

The point is that something changed between 2.95.3 and 2.96/3.0, which results
in poorer register stack usage on all x86 architectures.  My question,
therefore, is if the change is bad for all architectures, why make it?  Why not
go back to the register handling of the 2.95.3 code, at a minimum?  As I say,
the new code is worse on all architectures.

The Athlon-specific problem is memory fetch pattern, which I think ought to be
fixed, but I would certainly better understand just blowing off.  This problem,
unlike the register stack, can be fixed by the programmer by hand-doing the
scheduling.

Comment 5 R. Clint Whaley 2001-11-20 04:39:54 UTC

Just reviewed the above, and realized it is possible that it is not clear what I
am talking about.  I was assuming you took the link,
   http://www.cs.utk.edu/~rwhaley/ATLAS/gcc30.html

Where I discuss that there are two problems, one specific to the Athlon
(scheduling that is good for Pentium, bad for Athlon), and one that is bad for
all x86 archs in existance (excessive use of fxch in assembler).  gcc2.95.3 had
neither of these problems.  I can see going for Pentium fetch to the detriment
of the Athlon, though it seems funny you can't keep the old one around, but
since the register stack usage is inferior on all machines, I can't imagine the
argument for using it.  I'm sure it makes things cleaner at the compiler level,
or something, but a performance drop on all x86 archs seems a high price to pay,
when a better algorithm was already preexisting . . .

Note You need to log in before you can comment on or make changes to this bug.