Bug 1480664
Summary: | gcc: Inline expansion of string functions very slow on most x86 CPUs when optimizing for size | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Paulo Andrade <pandrade> | ||||||||
Component: | gcc | Assignee: | Marek Polacek <mpolacek> | ||||||||
gcc sub component: | system-version | QA Contact: | Alexandra Petlanová Hájková <ahajkova> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | medium | ||||||||||
Priority: | medium | CC: | ahajkova, ashankar, codonell, cww, fweimer, jakub, law, mcermak, mnewsome, mpetlan, msebor, ohudlick, pandrade, pfrankli | ||||||||
Version: | 8.0 | Keywords: | Bugfix | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 8.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-04-28 16:56:35 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1477664, 1679810 | ||||||||||
Attachments: |
|
Description
Paulo Andrade
2017-08-11 16:02:00 UTC
Do we have a copy of the benchmark harness used here? It looks like it's multi-threaded and uses the same destination buffer. It wouldn't be a particularly interesting benchmark, then. (In reply to Florian Weimer from comment #4) > Do we have a copy of the benchmark harness used here? It looks like it's > multi-threaded and uses the same destination buffer. It wouldn't be a > particularly interesting benchmark, then. Ideally, we would have this information before proceeding. Created attachment 1507400 [details]
test,c
Quick and dirty rewrite of the test in plain C.
Note that on recent Fedora snprintf is very fast, while on rhel7
it indeed appears quite slow (snprintf is also being called, not
a gcc optimization).
Thanks for the C testcase. In glibc-2.17-260.el7.x86_64, this is the hotspot (identified after bumping the iteration count): │ mov %r12,%rdi 99.41 │ repnz scas %es:(%rdi),%al 0.24 │ movl $0x0,-0x538(%rbp) │ mov %rcx,%rsi I believe the scas instruction comes from this code in the non-wide case if process_string_arg: else if (!is_long && spec != L_('S')) \ { \ if (prec != -1) \ /* Search for the end of the string, but don't search past \ the length (in bytes) specified by the precision. */ \ len = __strnlen (string, prec); \ else \ len = strlen (string); \ } \ I don't think this comes from glibc's string inlines, based on the preprocessed vfprintf.i sources. There are also no #APP markers in the .s file commonly associated with inline assembly. Created attachment 1507419 [details]
Preprocessed sources of stdio-common/vfprintf.c
Compile with: gcc -O2 -std=gnu99
group_number is pretty small and has a strlen call near the start. This one is turned into repnz scasb by GCC as well.
Created attachment 1507420 [details]
Assembler input file generated from stdio-common/vfprintf.c
I believe this is a known deficiency in GCC on x86: it expands string functions using the string opcodes, which are known to be horribly slow in many x86 implementations because they have a very high startup overhead. GCC 8 in Red Hat Enterprise Linux 8 happens to produce a call to strlen (well, __GI_strlen due to the redirect), which is why the performance issue appears to be fixed there, but I'm not sure if this is just an accident due to different decisions by the GCC optimiziers. I would appreciate another look for the GCC maintainers. Perhaps we can finally fix this performance issue upstream. One can choose what kind of strop expansion one wants in great detail and it also heavily depends on the tuning. String opcodes aren't horribly slow on many x86 implementations, on some they are recommended, at least for not very long lengths, the call overhead is certainly higher in many cases. See documentation on -m{,no-}align-stringops, -minline-all-stringops, -minline-stringops-dynamically, -mstringop-strategy=, -mmemcpy-strategy=, -mmemset-strategy= . The default comes up from the tuning. (In reply to Jakub Jelinek from comment #12) > One can choose what kind of strop expansion one wants in great detail and it > also heavily depends on the tuning. String opcodes aren't horribly slow on > many x86 implementations, on some they are recommended, at least for not > very long lengths, the call overhead is certainly higher in many cases. > See documentation on -m{,no-}align-stringops, -minline-all-stringops, > -minline-stringops-dynamically, -mstringop-strategy=, -mmemcpy-strategy=, > -mmemset-strategy= . The default comes up from the tuning. I'm very much convinced that the generic tuning for strlen is objectively wrong. For other string functions, things are likely more murky. Agner Fo's optimization manual says that REPNE SCASB is to be avoided. The Intel optimization manual says this: “However, using a REP prefix with string scan instructions (SCASB, SCASW, SCASD, SCASQ) or compare instructions (CMPSB, CMPSW, SMPSD, SMPSQ) is not recommended for high performance. Consider using SIMD instructions instead.” The benchmark in this bug also clearly shows that the current GCC choice is wrong. It may be possible that REPNE SCASB is a win in artificial benchmarks on some CPUs *with very long strings*, but in practice, most strings are short, so it does not matter what these benchmarks say. So I think this is a GCC bug. Of course, it may be too late to fix this in Red Hat Enterprise Linux 7, but we should not carry forward this mistake indefinitely. Please fix this upstream and consider backporting it into Red Hat Enterprise Linux 8. If you disagree, please close this bug. We will not override GCC's wrong optimization choices in <string.h>; upstream has shifted away from that because it is not beneficial. FWIW, the quoted Intel recommendation is still included in what appears to be the most recent version of the Intel® 64 and IA-32 Architectures Optimization Reference Manual (April 2018). Jakub pointed out on IRC that GCC uses REPNE SCASB only under very specific conditions (as when optimizing for size). For some reason, GCC versions before 8 assume that key parts of vfprintf are cold, when they are not. GCC continues to do this even if I #define out __builtin_expect, so it is not related to glibc's often questionable use of that built-in. With GCC 7, I see this in vfprintf.c.227t.optimized: ;; Function group_number (group_number, funcdef_no=76, decl_uid=12164, cgraph_uid=76, symbol_order=86) (unlikely executed) With GCC 8, I get this instead in vfprintf.c.232t.optimized: ;; Function group_number (group_number, funcdef_no=76, decl_uid=13078, cgraph_uid=76, symbol_order=86) I have no idea whether this is just an accident, or the result of a deliberate change to the way GCC determines whether a function is cold (and should be optimized for size). Some improvement is possible by making GCC better at optimizing snprintf. I opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88813 for one of these enhancements (making use of array sizes to transform snprintf to strcpy). The other (integrating the sprintf and strlen passes) is already planned for GCC 10. It should make it possible to transform snprintf calls like the one in the test_snprintf() function in attachment 1507400 [details] into memcpy(dst, src, len + 1). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1864 |