Bug 109358 - Huge performance loss when running linpack and commercial numerical apps
Summary: Huge performance loss when running linpack and commercial numerical apps
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 9
Hardware: athlon
OS: Linux
high
high
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact:
URL: none
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2003-11-07 00:27 UTC by Ognjen Milic
Modified: 2005-10-31 22:00 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-09-30 15:41:42 UTC
Embargoed:


Attachments (Terms of Use)
Linpack bench converted to C for Athlon vs. P4 speed test (21.87 KB, text/plain)
2003-11-07 00:40 UTC, Ognjen Milic
no flags Details

Description Ognjen Milic 2003-11-07 00:27:51 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624

Description of problem:
The system with AthlonMP 2800 runs linpack benchmark
as well as commercial numerical apps from Synopsys 50% slower then P4
2.4GHz. Both machines have single channel DR266 memory. 
Under Win32, the machines with matched CPU performance and 
matched memory architecture show no difference between P4 and 
Athlon linpack performance (i.e. P4 2.0Ghz vs. Athlon 2000+ both
running DDR266).
Please see:
http://www.tech-report.com/reviews/2002q1/northwood-vs-2000/index.x?pg=3

(look at rightmost part of the graph)
The numbers I have gotten are :

P4 2.4 Ghz - 180MFlops 2.4.20-8
AthlonMP  2800 - 100 Mflops 2.4.20-20.9smp
AthlonMP  2200 - 78 Mflops 2.4.2-2smp

I will be happy to provide you with the linpack source I used in this
test. 
I have used gcc with following commands
cc -DDP -DUNROLL -O1 clinpack1000.c -lm -o clinpack1000.exe



Version-Release number of selected component (if applicable):
kernel-2.4.20-20.9smp and earlier

How reproducible:
Always

Steps to Reproduce:
1.Compile the clinpack1000.c source for matrix order of 1000 (n=1000)
2. Run on P4
3. Run on Athlon
4. Compare the performance
    

Actual Results:  Athlon has roughly 50% performance of P4.

Expected Results:  Somewhat better Athlon 2800 performance over P4 2.4
GHz.

Additional info:

my email address is ognjen.milic
the person in Synopsys Inc. who can confirm these findings is
andrey.kucherov

Comment 1 Ognjen Milic 2003-11-07 00:40:39 UTC
Created attachment 95780 [details]
Linpack bench converted to C  for Athlon vs. P4 speed test

please compile with the following commands

cc -DDP -DUNROLL -O1 clinpack1000.c -lm -o clinpack1000.exe

Comment 2 Kostas Georgiou 2004-05-12 11:44:58 UTC
In a AlthlonMP 2000+ here (rh9, tyan tiger S2466, one cpu only) i get
120 Mflops with your options.

If i optimize with:
-O3 -march=athlon -msse -mfpmath=sse -malign-double
-mpreferred-stack-boundary=4 -falign-loops=4
it goes up to 130 Mflops.

For the record in a p4 2Ghz with DDRAM i get 170 Mflops and in a a
dual 2Ghz with RDRAM i get 250 Mflops with a linpack binary compiled
with ifc (version 7 if i remember).

Linpack performance is directly correlated to memory bandwidth, since
you get bad performance  i suspect something is wrong with your
ram/motherboard.


Comment 3 Ognjen Milic 2004-05-12 17:08:37 UTC
It is nice to see that by tweaking compiler options one can achieve
better performance, however that is beside the point here. 
The issue is that there is huge difference between P4 and 2P Athlon
performance when run with the same executable even though
P4 and 2P Athlon run the same type and speed of memory, in this
case DDR266. This is contrary to the win32 linpack performance as seen
from tech-report review. I have tested the problem on about 12
machines, all
with AMD760MPX chipset and they all have the same problem. 
The problem is that memory controller driver for AMD762 north-bridge
does not work or is non-existent in Linux distributions when installed
as-is. Did you do any kernel recompilation? One thing though,
I never tested it on Athlon MP platform with only one CPU! Should not
matter as linpack is single-threaded app. but who knows. I am 100%
percent sure that this is not an isolated case, as I tested it
on number of other AMD760MPX platforms, running various kernel
versions. All had 2 processors installed.Also, the same machine with 
2800+ processors was tested for memory performance under win32 and
it passed with flying colors. That rules out single-machine hardware
issue. Maybe plugging in second processor causes the memory controller
to divide the bandwidth between the processors by default and there is
no driver, or is malfunctioning, to rectify this. In your case it is
obvious that single CPU is getting all the bandwidth and thus
performs well. 

Comment 4 Bugzilla owner 2004-09-30 15:41:42 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/



Note You need to log in before you can comment on or make changes to this bug.