2180462 – glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2180462 - glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Summary: glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hard...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	8.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	DJ Delorie
QA Contact:	Martin Coufal
Docs Contact:	Jacob Taylor Valdez
URL:
Whiteboard:
Depends On:
Blocks:	2213907
TreeView+	depends on / blocked

Reported:	2023-03-21 14:33 UTC by Joe Mario
Modified:	2024-03-19 18:22 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glibc-2.28-236.el8
Doc Type:	Enhancement
Doc Text:	.Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc` Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.
Clone Of:
Clones:	2213907 (view as bug list)
Environment:
Last Closed:	2023-11-14 15:49:05 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Simple memcpy reproducer (1.78 KB, text/x-csrc) 2023-03-21 14:33 UTC, Joe Mario	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-152599	0	None	None	None	2023-03-21 14:34:41 UTC
Red Hat Product Errata	RHBA-2023:7107	0	None	None	None	2023-11-14 15:49:34 UTC

Description Joe Mario 2023-03-21 14:33:19 UTC

Created attachment 1952419 [details]
Simple memcpy reproducer

Created attachment 1952419 [details]
Simple memcpy reproducer

Description of problem:
Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake.

Version-Release number of selected component (if applicable):

How reproducible:
The customer used the following example to demonstrate the problem.

  # perf bench mem memcpy -f default --nr_loops 500 --size 3MB

That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4.  This is easily reproducible. 

Steps to Reproduce:
Run the above test on RHEL-7.5 and again on RHEL-8.4.  The customer had a 2-socket Skylake server.  I have been able to reproduce this on a 2-socket Cascade Lake server.
 
Additional info:
Thanks to great triaging help from Carlos O'Donell, the problem is understood.
It turns out glibc is selecting a sub-optimal memcpy routine for that processor.

On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then.

On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()".

On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used.

For example, slow and fast cases:

  # perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
         5.468937 GB/sec

  # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \
  > perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
        12.508272 GB/sec


I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below:

  # gcc -O memcpy.c -o memcpy
  # ./memcpy --help
    USAGE: ./memcpy size-in-MB loop-iterations

  # ./memcpy 3 500
  Rate for 500 3MB memcpy iterations: 7.30 GB/sec

  # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
  Rate for 500 3MB memcpy iterations: 27.29 GB/sec

The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled.  Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.

Comment 1 Joe Mario 2023-03-21 14:52:15 UTC

Just to clarify, I was able to reproduce this on Intel Skylake and Cascade Lake servers.   

The problem appears less severe on an Ice Lake server.  The default routine used for memcpy there is __memcpy_evex_unaligned_erms(), and the routine used with Prefer_ERMS is __memcpy_erms().

On the Ice Lake, I get the following using the attached memcpy reproducer:

 # ./memcpy 3 500
 Rate for 500 3MB memcpy iterations: 27.42 GB/sec

 # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
 Rate for 500 3MB memcpy iterations: 30.17 GB/sec

Comment 2 Carlos O'Donell 2023-03-21 15:10:07 UTC

Joe, Could you please add the cpuinfo for the CPU you're testing with?

Comment 3 Joe Mario 2023-03-21 15:14:04 UTC

Here's the lscpu output:

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  1
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
Stepping:            4
CPU MHz:             3800.000
CPU max MHz:         3800.0000
CPU min MHz:         1000.0000
BogoMIPS:            5000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            39424K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities

Comment 6 Carlos O'Donell 2023-03-23 02:14:06 UTC

There are two things we need to review.

(a) The tunables listed via ld.so --list-tunables do not seem to be populated with the correct values.
- The tunable values are used to influence the behaviour of memcpy algorithms.
- DJ Delorie is reviewing the tunables values, and the cpuid results, for this Skylake system.

(b) The dynamic runtime selection of the memcpy routine for Skylake.
- I'm reviewing the selection, and possible alternatives that might provide the best geomean performance for the *system*.
- I am in discussions with Intel engineers to review microbenchmark results.

I'll update the bug as we make progress.

Comment 7 Carlos O'Donell 2023-04-18 18:24:12 UTC

We are discussing with Intel if we should be adjusting the dyanmic threshold differently for these processors (Skylake, Cascade Lake, and Ice Lake).

To give a brief overview, the families include a different L3 topology, and to some degree we must balance single-thread performance vs. multi-threaded performance.

RHEL7 tunes preferentially for single-threaded performance, with multi-threaded performance showing as severe degradation in benchmarks like STREAM.

In RHEL8 and RHEL9 performance is balanced in the middle between single core and the whole system, and the key parameter here is how much L3 cache to allow each in-flight memory copy to consume.

The change in RHEL 8.3 looks like this:
~~~
+- /* The large memcpy micro benchmark in glibc shows that 6 times of
+- shared cache size is the approximate value above which non-temporal
+- store becomes faster on a 8-core processor. This is the 3/4 of the
+- total shared cache size. */
++ /* The default setting for the non_temporal threshold is 3/4 of one
++ thread's share of the chip's cache. For most Intel and AMD processors
++ with an initial release date between 2017 and 2020, a thread's typical
++ share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4
++ threshold leaves 125 KBytes to 500 KBytes of the thread's data
++ in cache after a maximum temporal copy, which will maintain
++ in cache a reasonable portion of the thread's stack and other
++ active data. If the threshold is set higher than one thread's
++ share of the cache, it has a substantial risk of negatively
++ impacting the performance of other threads running on the chip. */
+ __x86_shared_non_temporal_threshold
+ = (cpu_features->non_temporal_threshold != 0
+ ? cpu_features->non_temporal_threshold
+- : __x86_shared_cache_size * threads * 3 / 4);
++ : __x86_shared_cache_size * 3 / 4);
~~~
This is the current upstream glibc implementation as discussed between Intel, AMD, Red Hat, Oracle and others who contributed overall to the tuning of this parameter for whole-system performance.

Before the change in RHEL 8.3 we were aggressively favouring single-threaded performance.

This is probably room to improve the current tuning on system that have a unified L3 like Skylake has, and that is something we are still investigating.

Options for customer look like this:

(a) Use the ERMS routine directly:

export GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS

(b) Raise the threshold for non-temporal stores:

export GLIBC_TUNABLES=glibc.cpu.hwcaps=glibc.cpu.x86_non_temporal_threshold=0x2ffff0

Next steps:
- Continue to review tuning for Skylake, Cascade Lake and Ice Lake.

Comment 23 errata-xmlrpc 2023-11-14 15:49:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glibc bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7107

Note You need to log in before you can comment on or make changes to this bug.