Bug 1421121
Summary: | Serious performance degradation of math functions in Fedora 24/25 due to known Glibc bug | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Oleg Strikov <oleg.strikov> |
Component: | glibc | Assignee: | Carlos O'Donell <codonell> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 25 | CC: | arjun.is, codonell, dj, fweimer, law, mfabian, pfrankli, siddhesh, stimberg |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-09 11:45:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Oleg Strikov
2017-02-10 12:09:55 UTC
I don't see anywhere near the performance degradation you're seeing, so it must be heavily dependent on the family and stepping that you're using. e.g. [carlos@athas rhbz1421121]$ time LD_BIND_NOW=1 ./pow-test 154964150.331550 real 0m1.831s user 0m1.820s sys 0m0.003s [carlos@athas rhbz1421121]$ time ./pow-test 154964150.331550 real 0m1.830s user 0m1.820s sys 0m0.001s Verified pow-test built without DT_FLAGS BIND_NOW. I agree that it is less than optimal to have processor state transitions like those you indicate for every time the dynamic loader trampoline is called. We'll look into this. Fedora 26 will not have this problem since it's based on glibc 2.25 with the fix you indicate already present. Hi Carlos, Many thanks for looking into this! Could you please confirm that you used the following command to compile pow test with gcc: $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm Passing -mavx is the key thing for this example to work as expected. You want to compile pow() test WITH -mavx but exp() test WITHOUT -mavx. I'd also appreciate if you tell me on which CPU you do testing. It's impossible for me run this test on every possible CPU (tried on Sandy Bridge and Ivy Bridge machines so far) and this information would be really helpful. Thanks! (In reply to Oleg Strikov from comment #4) > Hi Carlos, > > Many thanks for looking into this! Could you please confirm that you used > the following command to compile pow test with gcc: > > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm I can confirm that I used these options on an F25 system. The dynamic loader trampoline is only called once in the loop to resolve the singular math function call, and after that it's the same sequence over and over again without any explicit software save/restore (though the CPU might do something for the transition). carlos@athas rhbz1421121]$ gcc -O3 -march=x86-64 -mavx -o pow-test pow-test.c -lm [carlos@athas rhbz1421121]$ time ./pow-test 154964150.331550 real 0m1.829s user 0m1.819s sys 0m0.002s [carlos@athas rhbz1421121]$ time LD_BIND_NOW=1 ./pow-test 154964150.331550 real 0m1.833s user 0m1.819s sys 0m0.005s gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC) > Passing -mavx is the key thing for this example to work as expected. You > want to compile pow() test WITH -mavx but exp() test WITHOUT -mavx. > > I'd also appreciate if you tell me on which CPU you do testing. It's > impossible for me run this test on every possible CPU (tried on Sandy Bridge > and Ivy Bridge machines so far) and this information would be really helpful. I ran this on an i5-4690K, so a Haswell series CPU, but without AVX512. (In reply to Carlos O'Donell from comment #5) > (In reply to Oleg Strikov from comment #4) > > Hi Carlos, > > > > Many thanks for looking into this! Could you please confirm that you used > > the following command to compile pow test with gcc: > > > > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm > > I can confirm that I used these options on an F25 system. > > The dynamic loader trampoline is only called once in the loop to resolve the > singular math function call, and after that it's the same sequence over and > over again without any explicit software save/restore (though the CPU might > do something for the transition). Right, that's why I found the claim about the substantial performance impact always a bit puzzling. What happens if you use LD_BIND_NOT=1? (In reply to Florian Weimer from comment #6) > (In reply to Carlos O'Donell from comment #5) > > (In reply to Oleg Strikov from comment #4) > > > Hi Carlos, > > > > > > Many thanks for looking into this! Could you please confirm that you used > > > the following command to compile pow test with gcc: > > > > > > $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm > > > > I can confirm that I used these options on an F25 system. > > > > The dynamic loader trampoline is only called once in the loop to resolve the > > singular math function call, and after that it's the same sequence over and > > over again without any explicit software save/restore (though the CPU might > > do something for the transition). > > Right, that's why I found the claim about the substantial performance impact > always a bit puzzling. Agreed. > What happens if you use LD_BIND_NOT=1? [carlos@athas rhbz1421121]$ time LD_BIND_NOT=1 ./pow-test 154964150.331550 real 0m4.527s user 0m4.505s sys 0m0.003s Terrible performance as expected though. Surprisingly inline with Oleg's numbers. However, LD_BIND_NOT performance is never the default, you'd have to be running with a preloaded audit library (LD_AUDIT) to trigger that kind of behaviour. Perhaps something is wrong with Oleg's system configuration? To my understanding, once trampoline touched upper halves of YMM registers ALL future switches between AVX and SSE require time consuming store/restore operation (i. e. all future calls to pow will suffer). Touching upper halves sets somewhat like a dirty flag (which forces cpu to do store/restore) and this flag never gets dropped during the whole program execution. That's why impact is so serious. I was able to reproduce the issue using f25 live cd. So it looks like a cpu model depending issue. We were able to repro on E5-1630 (haswell) though. Hi, I'm the one that Oleg referred to who had this issue on an E5-1630 CPU. It turns out, that I actually /cannot/ reproduce it with a Fedora 25 live CD (before and after an update of glibc)! I don't normally use Fedora on this machine, I originally encountered the problem with Ubuntu 16.04 (which has glibc 2.23 and not 2.24 as Fedora 25) -- there it is perfectly reproducible with Oleg's code, with very similar timings to the ones that Oleg reported. This is very confusing, I can try with a Fedora 24 live CD as well, but Oleg seems to be able to reproduce it on Fedora 25, so... Um, sorry for the noise, but it seems that the bug was fixed with Fedora's glibc 2.24-4 release: * Fri Dec 23 2016 Carlos O'Donell <carlos@...> - 2.24-4 - Auto-sync with upstream release/2.24/master, commit e9e69e468039fcd57276f783a16aa771a8e4214e, fixing: - [...] - Fix runtime resolver routines in the presence of AVX512 (swbz#20508) - [...] That would explain why Oleg saw it with the Fedora 25 live CD (which still has 2.24-3) while Carlos did not see it on his system. Now what I don't understand is why I myself could not reproduce with the live CD, even though I tried compiling/running it before updating glibc... I just rerun all the tests again on F24 and F25. I can confirm that the performance issue disappears on F25 when glibc package gets updated to version 2.24-4. It is still observable on F24 because the fix has not been propagated there. I'm very sorry for such a stupid mistake (not updating livecd packages before running tests). Thanks to Marcel for pointing that out, it saved me huge amount of time. We also did some kind of investigation regarding specific CPU models which suffer from such kind of performance degradation. Quite reliable source [1] says that 'AMD processors and later Intel processors (Skylake and Knights Landing) do not have such a state switch'. It means that only Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected. Many thanks to Carlos and Florian for such fast and straight to the point response. I really appreciate that. [1] http://www.agner.org/optimize/blog/read.php?i=761#761 |