Bug 592502
Summary: | Performance regression 15% for single precision linpack benchmark | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jiri Hladky <jhladky> | ||||
Component: | gcc | Assignee: | Jakub Jelinek <jakub> | ||||
Status: | CLOSED ERRATA | QA Contact: | Kamil Kolakowski <kkolakow> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 6.0 | CC: | bmarson, dshaks, ebachalo, kkolakow, law, mnowak, rmusil, vmakarov | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | gcc-4.4.5-5.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, the optimizations performed when calculating induction variables during the induction variable optimization (ivopts) pass were not as efficient as previous releases. In these updated packages, the optimizations performed during the the induction variable optimization (ivopts) pass is improved.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-05-19 13:57:41 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 573755, 599016 | ||||||
Attachments: |
|
Seems gcc 4.3 and 4.4 use 8 induction variables in the loop instead of 2. For 4.5 this got fixed by: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=147983 ( http://gcc.gnu.org/ml/gcc-patches/2009-05/msg01579.html ) and that patch applies cleanly to redhat/gcc-4_4-branch and seems to get the numbers back to 4.1 ranges. I've briefly checked for any needed follow-ups, and couldn't find any, but that needs to be verified, and the change needs to be properly tested. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. Hi Jakub, what is the current status of the fix? Will it be included in RHEL 6.0 Beta2 ? Thanks a lot! Jirka This is going to be shipped as Fedora 13 update this week, and, assuming it won't show up any regressions in Fedora 13, is ok for RHEL 6.1 too. Hi Jakub, This fix could be shipped in F13 as gcc update right? Jiri is leaving RH tomorrow so this I will retest this bug. This fix has been shipped in F13 updates already. Retested on F13. Fix works as expected. Now rested on 6.1 Beta Linux ibm-x3650m3-01.lab.eng.brq.redhat.com 2.6.32-94.el6.x86_64 #1 SMP Tue Dec 28 21:55:53 EST 2010 x86_64 x86_64 x86_64 GNU/Linux gcc --version gcc (GCC) 4.4.5 20110116 (Red Hat 4.4.5-5) Results: 5.5STATIC real 0m8.433s user 0m8.429s sys 0m0.002s 6.0STATIC real 0m8.416s user 0m8.414s sys 0m0.001s Resolved can be closed. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: cost computation of induction variables during ivopts pass has been improved. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -cost computation of induction variables during ivopts pass has been improved.+These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass. Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,9 @@ -These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass.+These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass. + +Cause: Cost analysis for induction variables in certain loops was suboptimal, leading to poor code generation. + +Consequence: Poor application performance if the application's performance is dominated by loops of this nature. + +Fix: The cost analysis for the loop optimizer was improved to handle these loops better. + +Result: We get the desired code generated for the reported testcases and application performance no longer regresses compared to the RHEL 5.5 compilers Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,9 +1 @@ -These updated packages provide optimizations when calculating induction variables during the induction variable optimization (ivopts) pass. +Previously, the optimizations performed when calculating induction variables during the induction variable optimization (ivopts) pass were not as efficient as previous releases. In these updated packages, the optimizations performed during the the induction variable optimization (ivopts) pass is improved.- -Cause: Cost analysis for induction variables in certain loops was suboptimal, leading to poor code generation. - -Consequence: Poor application performance if the application's performance is dominated by loops of this nature. - -Fix: The cost analysis for the loop optimizer was improved to handle these loops better. - -Result: We get the desired code generated for the reported testcases and application performance no longer regresses compared to the RHEL 5.5 compilers An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0663.html |
Created attachment 414198 [details] linpack source code, run scripts, run results and statically linked binaries Description of problem: We have measured regression of 15% in performance of single precision linpack benchmark on RHEL 6.0 using RHEL 5.5 as baseline. We have reproduced the problem on gs-bl460cg1-01.rhts.eng.bos.redhat.com : Xeon L5420 @ 2.50GHz nec-em24-1.rhts.eng.bos.redhat.com : Xeon E7340 @ 2.40GHz candycane.rhts.eng.bos.redhat.com : Xeon 7400-series "Dunnington", probably L7455 We don't see the problem on Nehalem. Also double precision linpack shows no regression. By using statically linked binaries I was able to prove that regression is caused by different gcc versions. The problem is in function daxpy, more precisely with the following piece of code: =========================================================== m = n % 4; if ( m != 0) { for (i = 0; i < m; i++) dy[i] = dy[i] + da*dx[i]; if (n < 4) return; } for (i = m; i < n; i = i + 4) { dy[i] = dy[i] + da*dx[i]; dy[i+1] = dy[i+1] + da*dx[i+1]; dy[i+2] = dy[i+2] + da*dx[i+2]; dy[i+3] = dy[i+3] + da*dx[i+3]; } =========================================================== To further debug and localize the problem I have added cpuid&rdtsc instruction (rdtscp is not supported on those systems) around critical code. The purpose is to: 1) Clearly identify critical code in assembler 2) Measure CPU cycles spent in the critical section of the code In RHEL 5.5 it however appears to me that rdtsc instruction is moved by gcc into another location than in the source code (Yes, I'm using __asm__ __volatile__ to prevent it. Is it another bug or my mistake???) Version-Release number of selected component (if applicable): RHEL 5.5 Server for baseline data with gcc (GCC) 4.1.2 20080704 RHEL 6.0 Snapshot-3 Server variant with gcc (GCC) 4.4.4 20100503 How reproducible: I suggest to use one of these systems: gs-bl460cg1-01.rhts.eng.bos.redhat.com : Xeon L5420 @ 2.50GHz nec-em24-1.rhts.eng.bos.redhat.com : Xeon E7340 @ 2.40GHz candycane.rhts.eng.bos.redhat.com : Xeon 7400-series "Dunnington", probably L7455 or any other system with the same Intel CPU. Make sure that CPU governor is set to performance. On RHEL 6.0 you need to install glibc-static yum install glibc-static in order to statically link the program. Steps to Reproduce: 1. Untar Testcase on RHEL 5.5 2. Run ./install.sh. It will compile linpack using single precision linpacks and double precision floating point linpackd We see regression only with linpacks. linpack-normal : Original linpack program linpack_RDTSC : rdtsc added around critical section of daxpy function (if you can suggest why gcc on RHEL 5.5 is placing rdtsc instruction call to the different location than in the source code, please let me know!) 3. Go to linpack-normal and execute linpacks 4. Do the same on RHEL 6.0 and compare runtimes and Kflops reported by linpacks on RHEL 5.5 and RHEL 6.0 5. You may want to run ./runtest.sh which will run all executables 10 times and will record results in Comma Separated Values file (extension .csv) 6. You may want to run statically linked binaries from linpack_static_x86_binaries_normal directory on any RHEL 5 or RHEL 6 system and compare runtimes and Kflops reported. linpacks binary created on RHEL55 is faster by 15% than linpacks binary created on RHEL 6.0 These binaries has been built on gs-bl460cg1-01.rhts.eng.bos.redhat.com 7. You may want to check linpack_RDTSC and linpack_static_x86_binaries_RDTSC directories as well. As already mentioned, my intention was to further debug problem by using cpuid&rdtsc instruction. However, it seems that gcc is moving these instructions to another locations than in the source code. Any hints why it's happening are welcome! Actual results: nec-em24-1.rhts.eng.bos.redhat.com: Testcase/linpack_static_x86_binaries_normal directory time ./static-linpacks-RHEL55 Unrolled Single Precision 1956035 Kflops ; 10 Reps real 0m9.128s user 0m9.119s sys 0m0.007s time ./static-linpacks-RHEL60 Unrolled Single Precision 1685829 Kflops ; 10 Reps real 0m10.558s user 0m10.545s Expected results: Runtime and kflops reported by linpcks to be within 1% same as on RHEL 5.5. Additional info: