1480664 – gcc: Inline expansion of string functions very slow on most x86 CPUs when optimizing for size

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1480664 - gcc: Inline expansion of string functions very slow on most x86 CPUs when optimizing for size

Summary: gcc: Inline expansion of string functions very slow on most x86 CPUs when opt...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	gcc
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	Marek Polacek
QA Contact:	Alexandra Petlanová Hájková
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1477664 1679810
TreeView+	depends on / blocked

Reported:	2017-08-11 16:02 UTC by Paulo Andrade
Modified:	2023-07-18 14:19 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-28 16:56:35 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test,c (1.69 KB, text/plain) 2018-11-20 12:12 UTC, Paulo Andrade	no flags	Details
Preprocessed sources of stdio-common/vfprintf.c (489.87 KB, text/plain) 2018-11-20 14:20 UTC, Florian Weimer	no flags	Details
Assembler input file generated from stdio-common/vfprintf.c (543.55 KB, text/plain) 2018-11-20 14:21 UTC, Florian Weimer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
GNU Compiler Collection	81772	'P3'	'NEW'	'Compile-time snprintf optimization'	2019-12-08 20:55:51 UTC
GNU Compiler Collection	88793	'P3'	'UNCONFIRMED'	'Document that __attribute__ ((cold)) is not equivalent to __builtin_expect because of optimization for size'	2019-12-08 20:55:52 UTC
GNU Compiler Collection	88809	'P3'	'RESOLVED'	'do not use rep-scasb for inline strlen/memchr'	2019-12-08 20:55:53 UTC
Red Hat Product Errata	RHSA-2020:1864	None	None	None	2020-04-28 16:56:44 UTC
Sourceware	21905	'P2'	'RESOLVED'	'snprintf is terribly slow when copying strings'	2019-12-08 20:55:50 UTC

Description Paulo Andrade 2017-08-11 16:02:00 UTC

There is way too much overhead for a simple:

snprintf("%s", size, string);

and the longer the string, the more time it takes
to copy the string, for example, at 10 bytes it
is around 17 times slower than strcpy, but at
10000 bytes it is around 220 times slower.

Comment 4 Florian Weimer 2017-11-27 18:08:26 UTC

Do we have a copy of the benchmark harness used here?  It looks like it's multi-threaded and uses the same destination buffer.  It wouldn't be a particularly interesting benchmark, then.

Comment 5 Florian Weimer 2018-11-20 09:26:15 UTC

(In reply to Florian Weimer from comment #4)
> Do we have a copy of the benchmark harness used here?  It looks like it's
> multi-threaded and uses the same destination buffer.  It wouldn't be a
> particularly interesting benchmark, then.

Ideally, we would have this information before proceeding.

Comment 7 Paulo Andrade 2018-11-20 12:12:17 UTC

Created attachment 1507400 [details]
test,c

Quick and dirty rewrite of the test in plain C.

Note that on recent Fedora snprintf is very fast, while on rhel7
it indeed appears quite slow (snprintf is also being called, not
a gcc optimization).

Comment 8 Florian Weimer 2018-11-20 14:17:36 UTC

Thanks for the C testcase.

In glibc-2.17-260.el7.x86_64, this is the hotspot (identified after bumping the iteration count):

       │        mov    %r12,%rdi                                                      
 99.41 │        repnz  scas %es:(%rdi),%al                                            
  0.24 │        movl   $0x0,-0x538(%rbp)                                              
       │        mov    %rcx,%rsi                                                      

I believe the scas instruction comes from this code in the non-wide case if process_string_arg:

	else if (!is_long && spec != L_('S'))				      \
	  {								      \
	    if (prec != -1)						      \
	      /* Search for the end of the string, but don't search past      \
		 the length (in bytes) specified by the precision.  */	      \
	      len = __strnlen (string, prec);				      \
	    else							      \
	      len = strlen (string);					      \
	  }								      \

I don't think this comes from glibc's string inlines, based on the preprocessed vfprintf.i sources.  There are also no #APP markers in the .s file commonly associated with inline assembly.

Comment 9 Florian Weimer 2018-11-20 14:20:55 UTC

Created attachment 1507419 [details]
Preprocessed sources of stdio-common/vfprintf.c

Compile with: gcc -O2 -std=gnu99

group_number is pretty small and has a strlen call near the start.  This one is turned into repnz scasb by GCC as well.

Comment 10 Florian Weimer 2018-11-20 14:21:44 UTC

Created attachment 1507420 [details]
Assembler input file generated from stdio-common/vfprintf.c

Comment 11 Florian Weimer 2018-11-20 14:27:35 UTC

I believe this is a known deficiency in GCC on x86: it expands string functions using the string opcodes, which are known to be horribly slow in many x86 implementations because they have a very high startup overhead.

GCC 8 in Red Hat Enterprise Linux 8 happens to produce a call to strlen (well, __GI_strlen due to the redirect), which is why the performance issue appears to be fixed there, but I'm not sure if this is just an accident due to different decisions by the GCC optimiziers.

I would appreciate another look for the GCC maintainers.  Perhaps we can finally fix this performance issue upstream.

Comment 12 Jakub Jelinek 2018-11-20 14:32:42 UTC

One can choose what kind of strop expansion one wants in great detail and it also heavily depends on the tuning.  String opcodes aren't horribly slow on many x86 implementations, on some they are recommended, at least for not very long lengths, the call overhead is certainly higher in many cases.
See documentation on -m{,no-}align-stringops, -minline-all-stringops, -minline-stringops-dynamically, -mstringop-strategy=, -mmemcpy-strategy=, -mmemset-strategy= .  The default comes up from the tuning.

Comment 13 Florian Weimer 2018-11-20 14:44:40 UTC

(In reply to Jakub Jelinek from comment #12)
> One can choose what kind of strop expansion one wants in great detail and it
> also heavily depends on the tuning.  String opcodes aren't horribly slow on
> many x86 implementations, on some they are recommended, at least for not
> very long lengths, the call overhead is certainly higher in many cases.
> See documentation on -m{,no-}align-stringops, -minline-all-stringops,
> -minline-stringops-dynamically, -mstringop-strategy=, -mmemcpy-strategy=,
> -mmemset-strategy= .  The default comes up from the tuning.

I'm very much convinced that the generic tuning for strlen is objectively wrong.  For other string functions, things are likely more murky.

Agner Fo's optimization manual says that REPNE SCASB is to be avoided.  The Intel optimization manual says this: “However, using a REP prefix with string scan instructions (SCASB, SCASW, SCASD, SCASQ) or compare instructions (CMPSB, CMPSW, SMPSD, SMPSQ) is not recommended for high performance.  Consider using SIMD instructions instead.”

The benchmark in this bug also clearly shows that the current GCC choice is wrong.  It may be possible that REPNE SCASB is a win in artificial benchmarks on some CPUs *with very long strings*, but in practice, most strings are short, so it does not matter what these benchmarks say.

So I think this is a GCC bug.  Of course, it may be too late to fix this in Red Hat Enterprise Linux 7, but we should not carry forward this mistake indefinitely.  Please fix this upstream and consider backporting it into Red Hat Enterprise Linux 8.

If you disagree, please close this bug.  We will not override GCC's wrong optimization choices in <string.h>; upstream has shifted away from that because it is not beneficial.

Comment 14 Florian Weimer 2018-11-20 14:47:11 UTC

FWIW, the quoted Intel recommendation is still included in what appears to be the most recent version of the Intel® 64 and IA-32 Architectures
Optimization Reference Manual (April 2018).

Comment 15 Florian Weimer 2018-11-20 15:50:40 UTC

Jakub pointed out on IRC that GCC uses REPNE SCASB only under very specific conditions (as when optimizing for size).  For some reason, GCC versions before 8 assume that key parts of vfprintf are cold, when they are not.  GCC continues to do this even if I #define out __builtin_expect, so it is not related to glibc's often questionable use of that built-in.

With GCC 7, I see this in vfprintf.c.227t.optimized:

;; Function group_number (group_number, funcdef_no=76, decl_uid=12164, cgraph_uid=76, symbol_order=86) (unlikely executed)

With GCC 8, I get this instead in vfprintf.c.232t.optimized:

;; Function group_number (group_number, funcdef_no=76, decl_uid=13078, cgraph_uid=76, symbol_order=86)

I have no idea whether this is just an accident, or the result of a deliberate change to the way GCC determines whether a function is cold (and should be optimized for size).

Comment 20 Martin Sebor 2019-01-11 23:25:18 UTC

Some improvement is possible by making GCC better at optimizing snprintf.  I opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88813 for one of these enhancements (making use of array sizes to transform snprintf to strcpy).  The other (integrating the sprintf and strlen passes) is already planned for GCC 10.  It should make it possible to transform snprintf calls like the one in the test_snprintf() function in attachment 1507400 [details] into memcpy(dst, src, len + 1).

Comment 25 errata-xmlrpc 2020-04-28 16:56:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1864

Note You need to log in before you can comment on or make changes to this bug.