Bug 592502 - Performance regression 15% for single precision linpack benchmark
Performance regression 15% for single precision linpack benchmark
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: gcc (Show other bugs)
6.0
x86_64 Linux
low Severity high
: rc
: ---
Assigned To: Jakub Jelinek
Kamil Kolakowski
:
Depends On:
Blocks: 573755 599016
  Show dependency treegraph
 
Reported: 2010-05-14 21:39 EDT by Jiri Hladky
Modified: 2011-05-19 09:57 EDT (History)
8 users (show)

See Also:
Fixed In Version: gcc-4.4.5-5.el6
Doc Type: Bug Fix
Doc Text:
Previously, the optimizations performed when calculating induction variables during the induction variable optimization (ivopts) pass were not as efficient as previous releases. In these updated packages, the optimizations performed during the the induction variable optimization (ivopts) pass is improved.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-05-19 09:57:41 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
linpack source code, run scripts, run results and statically linked binaries (4.09 MB, application/x-gzip)
2010-05-14 21:39 EDT, Jiri Hladky
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0663 normal SHIPPED_LIVE gcc bug fix update 2011-05-19 05:37:21 EDT

  None (edit)
Description Jiri Hladky 2010-05-14 21:39:19 EDT
Created attachment 414198 [details]
linpack source code, run scripts, run results and statically linked binaries

Description of problem:

We have measured regression of 15% in performance of single precision linpack benchmark on RHEL 6.0 using RHEL 5.5 as baseline.

We have reproduced the problem on 
gs-bl460cg1-01.rhts.eng.bos.redhat.com  : Xeon L5420 @ 2.50GHz 
nec-em24-1.rhts.eng.bos.redhat.com      : Xeon E7340 @ 2.40GHz
candycane.rhts.eng.bos.redhat.com : Xeon 7400-series "Dunnington", probably  L7455

We don't see the problem on Nehalem. Also double precision linpack shows no regression.

By using statically linked binaries I was able to prove that regression is caused by different gcc versions. The problem is in function daxpy, more precisely with the following piece of code:

===========================================================
	m = n % 4;
	if ( m != 0) {
		for (i = 0; i < m; i++) 
			dy[i] = dy[i] + da*dx[i];
		if (n < 4) return;
	}
	for (i = m; i < n; i = i + 4) {
		dy[i] = dy[i] + da*dx[i];
		dy[i+1] = dy[i+1] + da*dx[i+1];
		dy[i+2] = dy[i+2] + da*dx[i+2];
		dy[i+3] = dy[i+3] + da*dx[i+3];
	}
===========================================================

To further debug and localize the problem I have added cpuid&rdtsc instruction (rdtscp is not supported on those systems) around critical code. The purpose is to:
1) Clearly identify critical code in assembler
2) Measure CPU cycles spent in the critical section of the code

In RHEL 5.5 it however appears to me that rdtsc instruction is moved by gcc into another location than in the source code (Yes, I'm using __asm__ __volatile__ to prevent it. Is it another bug or my mistake???)



Version-Release number of selected component (if applicable):
RHEL 5.5 Server for baseline data with gcc (GCC) 4.1.2 20080704
RHEL 6.0 Snapshot-3 Server variant with gcc (GCC) 4.4.4 20100503


How reproducible:
I suggest to use one of these systems:
gs-bl460cg1-01.rhts.eng.bos.redhat.com  : Xeon L5420 @ 2.50GHz 
nec-em24-1.rhts.eng.bos.redhat.com      : Xeon E7340 @ 2.40GHz
candycane.rhts.eng.bos.redhat.com : Xeon 7400-series "Dunnington", probably  L7455

or any other system with the same Intel CPU. 

Make sure that CPU governor is set to performance. 

On RHEL 6.0 you need to install glibc-static
yum install glibc-static
in order to statically link the program.

Steps to Reproduce:
1. Untar Testcase on RHEL 5.5
2. Run ./install.sh. It will compile linpack using single precision
linpacks
and double precision floating point
linpackd

We see regression only with linpacks.

linpack-normal : Original linpack program
linpack_RDTSC  : rdtsc added around critical section of daxpy function (if you can suggest why gcc on RHEL 5.5 is placing rdtsc instruction call to the different location than in the source code, please let me know!)


3. Go to 
linpack-normal
and execute
linpacks

4. Do the same on RHEL 6.0 and compare runtimes and Kflops reported by linpacks on RHEL 5.5 and RHEL 6.0

5. You may want to run 
./runtest.sh
which will run all executables 10 times and will record results in Comma Separated Values file (extension .csv)

6. You may want to run statically linked binaries from
linpack_static_x86_binaries_normal
directory on any RHEL 5 or RHEL 6 system and compare runtimes and Kflops reported. linpacks binary created on RHEL55 is faster by 15% than linpacks binary created on RHEL 6.0 These binaries has been built on 
gs-bl460cg1-01.rhts.eng.bos.redhat.com

7. You may want to check linpack_RDTSC and linpack_static_x86_binaries_RDTSC directories as well. As already mentioned, my intention was to further debug problem by using cpuid&rdtsc instruction. However, it seems that gcc is moving these instructions to another locations than in the source code. Any hints why it's happening are welcome!
  
Actual results:
nec-em24-1.rhts.eng.bos.redhat.com:
Testcase/linpack_static_x86_binaries_normal directory
time ./static-linpacks-RHEL55
Unrolled Single  Precision 1956035 Kflops ; 10 Reps 

real    0m9.128s
user    0m9.119s
sys     0m0.007s

time ./static-linpacks-RHEL60
Unrolled Single  Precision 1685829 Kflops ; 10 Reps 

real    0m10.558s
user    0m10.545s


Expected results:

Runtime and kflops reported by linpcks to be within 1% same as on RHEL 5.5.

Additional info:
Comment 2 Jakub Jelinek 2010-05-15 05:23:16 EDT
Seems gcc 4.3 and 4.4 use 8 induction variables in the loop instead of 2.

For 4.5 this got fixed by:
http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147983
( http://gcc.gnu.org/ml/gcc-patches/2009-05/msg01579.html )
and that patch applies cleanly to redhat/gcc-4_4-branch and seems to get the numbers back to 4.1 ranges.

I've briefly checked for any needed follow-ups, and couldn't find any, but that needs to be verified, and the change needs to be properly tested.
Comment 4 RHEL Product and Program Management 2010-06-07 11:55:31 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 5 Jiri Hladky 2010-06-08 03:52:30 EDT
Hi Jakub,

what is the current status of the fix? Will it be included in RHEL 6.0 Beta2 ? 

Thanks a lot!
Jirka
Comment 8 Jakub Jelinek 2010-11-22 03:37:12 EST
This is going to be shipped as Fedora 13 update this week, and, assuming it won't show up any regressions in Fedora 13, is ok for RHEL 6.1 too.
Comment 9 Kamil Kolakowski 2010-11-22 04:34:59 EST
Hi Jakub,

This fix could be shipped in F13 as gcc update right?
Jiri is leaving RH tomorrow so this I will retest this bug.
Comment 12 Jakub Jelinek 2010-12-08 13:18:46 EST
This fix has been shipped in F13 updates already.
Comment 14 Kamil Kolakowski 2011-01-18 11:51:09 EST
Retested on F13. Fix works as expected.
Comment 16 Kamil Kolakowski 2011-02-07 13:59:19 EST
Now rested on 6.1 Beta
Linux ibm-x3650m3-01.lab.eng.brq.redhat.com 2.6.32-94.el6.x86_64 #1 SMP Tue Dec 28 21:55:53 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

gcc --version
gcc (GCC) 4.4.5 20110116 (Red Hat 4.4.5-5)

Results:

5.5STATIC
real    0m8.433s
user    0m8.429s
sys     0m0.002s

6.0STATIC
real    0m8.416s
user    0m8.414s
sys     0m0.001s

Resolved can be closed.
Comment 17 Ryan Lerch 2011-04-20 00:29:32 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
cost computation of induction variables during ivopts pass has been improved.
Comment 19 Ryan Lerch 2011-05-05 22:40:18 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-cost computation of induction variables during ivopts pass has been improved.+These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass.
Comment 20 Jeff Law 2011-05-06 11:33:30 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,9 @@
-These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass.+These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass.
+
+Cause: Cost analysis for induction variables in certain loops was suboptimal, leading to poor code generation.
+
+Consequence: Poor application performance if the application's performance is dominated by loops of this nature.
+
+Fix: The cost analysis for the loop optimizer was improved to handle these loops better.
+
+Result: We get the desired code generated for the reported testcases and application performance no longer regresses compared to the RHEL 5.5 compilers
Comment 21 Ryan Lerch 2011-05-15 22:56:41 EDT
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,9 +1 @@
-These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass.
+Previously, the optimizations performed when calculating induction variables during the induction variable optimization (ivopts) pass were not as efficient as previous releases.  In these updated packages, the optimizations performed during the the induction variable optimization (ivopts) pass is improved.-
-Cause: Cost analysis for induction variables in certain loops was suboptimal, leading to poor code generation.
-
-Consequence: Poor application performance if the application's performance is dominated by loops of this nature.
-
-Fix: The cost analysis for the loop optimizer was improved to handle these loops better.
-
-Result: We get the desired code generated for the reported testcases and application performance no longer regresses compared to the RHEL 5.5 compilers
Comment 22 errata-xmlrpc 2011-05-19 09:57:41 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0663.html

Note You need to log in before you can comment on or make changes to this bug.