Bug 109358 - Huge performance loss when running linpack and commercial numerical apps
Huge performance loss when running linpack and commercial numerical apps
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
athlon Linux
high Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Depends On:
  Show dependency treegraph
Reported: 2003-11-06 19:27 EST by Ognjen Milic
Modified: 2005-10-31 17:00 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2004-09-30 11:41:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Linpack bench converted to C for Athlon vs. P4 speed test (21.87 KB, text/plain)
2003-11-06 19:40 EST, Ognjen Milic
no flags Details

  None (edit)
Description Ognjen Milic 2003-11-06 19:27:51 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624

Description of problem:
The system with AthlonMP 2800 runs linpack benchmark
as well as commercial numerical apps from Synopsys 50% slower then P4
2.4GHz. Both machines have single channel DR266 memory. 
Under Win32, the machines with matched CPU performance and 
matched memory architecture show no difference between P4 and 
Athlon linpack performance (i.e. P4 2.0Ghz vs. Athlon 2000+ both
running DDR266).
Please see:

(look at rightmost part of the graph)
The numbers I have gotten are :

P4 2.4 Ghz - 180MFlops 2.4.20-8
AthlonMP  2800 - 100 Mflops 2.4.20-20.9smp
AthlonMP  2200 - 78 Mflops 2.4.2-2smp

I will be happy to provide you with the linpack source I used in this
I have used gcc with following commands
cc -DDP -DUNROLL -O1 clinpack1000.c -lm -o clinpack1000.exe

Version-Release number of selected component (if applicable):
kernel-2.4.20-20.9smp and earlier

How reproducible:

Steps to Reproduce:
1.Compile the clinpack1000.c source for matrix order of 1000 (n=1000)
2. Run on P4
3. Run on Athlon
4. Compare the performance

Actual Results:  Athlon has roughly 50% performance of P4.

Expected Results:  Somewhat better Athlon 2800 performance over P4 2.4

Additional info:

my email address is ognjen.milic@monolithicpower.com
the person in Synopsys Inc. who can confirm these findings is
Comment 1 Ognjen Milic 2003-11-06 19:40:39 EST
Created attachment 95780 [details]
Linpack bench converted to C  for Athlon vs. P4 speed test

please compile with the following commands

cc -DDP -DUNROLL -O1 clinpack1000.c -lm -o clinpack1000.exe
Comment 2 Kostas Georgiou 2004-05-12 07:44:58 EDT
In a AlthlonMP 2000+ here (rh9, tyan tiger S2466, one cpu only) i get
120 Mflops with your options.

If i optimize with:
-O3 -march=athlon -msse -mfpmath=sse -malign-double
-mpreferred-stack-boundary=4 -falign-loops=4
it goes up to 130 Mflops.

For the record in a p4 2Ghz with DDRAM i get 170 Mflops and in a a
dual 2Ghz with RDRAM i get 250 Mflops with a linpack binary compiled
with ifc (version 7 if i remember).

Linpack performance is directly correlated to memory bandwidth, since
you get bad performance  i suspect something is wrong with your
Comment 3 Ognjen Milic 2004-05-12 13:08:37 EDT
It is nice to see that by tweaking compiler options one can achieve
better performance, however that is beside the point here. 
The issue is that there is huge difference between P4 and 2P Athlon
performance when run with the same executable even though
P4 and 2P Athlon run the same type and speed of memory, in this
case DDR266. This is contrary to the win32 linpack performance as seen
from tech-report review. I have tested the problem on about 12
machines, all
with AMD760MPX chipset and they all have the same problem. 
The problem is that memory controller driver for AMD762 north-bridge
does not work or is non-existent in Linux distributions when installed
as-is. Did you do any kernel recompilation? One thing though,
I never tested it on Athlon MP platform with only one CPU! Should not
matter as linpack is single-threaded app. but who knows. I am 100%
percent sure that this is not an isolated case, as I tested it
on number of other AMD760MPX platforms, running various kernel
versions. All had 2 processors installed.Also, the same machine with 
2800+ processors was tested for memory performance under win32 and
it passed with flying colors. That rules out single-machine hardware
issue. Maybe plugging in second processor causes the memory controller
to divide the bandwidth between the processors by default and there is
no driver, or is malfunctioning, to rectify this. In your case it is
obvious that single CPU is getting all the bandwidth and thus
performs well. 
Comment 4 Bugzilla owner 2004-09-30 11:41:42 EDT
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.