Bug 2180462

Summary:

glibc: Memcpy throughput lower on RH8.4 compared to RH7.5 - same Skylake hardware

Product:

Red Hat Enterprise Linux 8

Reporter:

Joe Mario <jmario>

Component:

glibc

Assignee:

DJ Delorie <dj>

Status:

CLOSED ERRATA

QA Contact:

Martin Coufal <mcoufal>

Severity:

high

Docs Contact:

Jacob Taylor Valdez <jvaldez>

Priority:

unspecified

Version:

8.4

CC:

ashankar, barend.havenga, bgray, cdeardor, codonell, dj, fweimer, jvaldez, jyoung, mcermak, pandrade, pfrankli, sipoyare

Target Milestone:

Keywords:

Regression, Triaged

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

glibc-2.28-236.el8

Doc Type:

Enhancement

Doc Text:

.Improved string and memory routine performance on Intel® Xeon® v5-based hardware in `glibc` Previously, the default amount of cache used by `glibc` for string and memory routines resulted in lower than expected performance on Intel® Xeon® v5-based systems. With this update, the amount of cache to use has been tuned to improve performance.

Story Points:

---

Clone Of:

Clones:

2213907 (view as bug list)

Environment:

Last Closed:

2023-11-14 15:49:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2213907

Attachments:

Description	Flags
Simple memcpy reproducer	none

Description Joe Mario 2023-03-21 14:33:19 UTC

Created attachment 1952419 [details]
Simple memcpy reproducer

Created attachment 1952419 [details]
Simple memcpy reproducer

Description of problem:
Customer reported performance regression from RHEL 7 to RHEL 8 in Intel Skylake.

Version-Release number of selected component (if applicable):

How reproducible:
The customer used the following example to demonstrate the problem.

  # perf bench mem memcpy -f default --nr_loops 500 --size 3MB

That test achieved 8.5 GB/sec on RHEL-7.5, and only 5.3 GB/sec on RHEL-8.4.  This is easily reproducible. 

Steps to Reproduce:
Run the above test on RHEL-7.5 and again on RHEL-8.4.  The customer had a 2-socket Skylake server.  I have been able to reproduce this on a 2-socket Cascade Lake server.
 
Additional info:
Thanks to great triaging help from Carlos O'Donell, the problem is understood.
It turns out glibc is selecting a sub-optimal memcpy routine for that processor.

On RHEL-7.5, it used the "__memcpy_ssse3_back()" routine, which was the optimal choice then.

On RHEL-8.4, the glibc memcpy routine used is "__memmove_avx_unaligned_erms()".

On RHEL-8.4, if the "Prefer_ERMS" attribute is given to glibc, then the faster "__memmove_erms()" is used.

For example, slow and fast cases:

  # perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
         5.468937 GB/sec

  # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS \
  > perf bench mem memcpy -f default --nr_loops 500 --size 3MB |grep GB
        12.508272 GB/sec


I've also attached a simple memcpy reproducer to demonstrate the problem, as shown below:

  # gcc -O memcpy.c -o memcpy
  # ./memcpy --help
    USAGE: ./memcpy size-in-MB loop-iterations

  # ./memcpy 3 500
  Rate for 500 3MB memcpy iterations: 7.30 GB/sec

  # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
  Rate for 500 3MB memcpy iterations: 27.29 GB/sec

The customer's system did boot with mitigations=off, and with transparent_hugepages (THP) disabled.  Neither are needed to reproduce this problem, but disabling THP does enable the simple memcpy reproducer to achieve much higher performance.

Comment 1 Joe Mario 2023-03-21 14:52:15 UTC

Just to clarify, I was able to reproduce this on Intel Skylake and Cascade Lake servers.   

The problem appears less severe on an Ice Lake server.  The default routine used for memcpy there is __memcpy_evex_unaligned_erms(), and the routine used with Prefer_ERMS is __memcpy_erms().

On the Ice Lake, I get the following using the attached memcpy reproducer:

 # ./memcpy 3 500
 Rate for 500 3MB memcpy iterations: 27.42 GB/sec

 # GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS ./memcpy 3 500
 Rate for 500 3MB memcpy iterations: 30.17 GB/sec

Comment 2 Carlos O'Donell 2023-03-21 15:10:07 UTC

Joe, Could you please add the cpuinfo for the CPU you're testing with?

Comment 3 Joe Mario 2023-03-21 15:14:04 UTC

Here's the lscpu output:

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  1
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
Stepping:            4
CPU MHz:             3800.000
CPU max MHz:         3800.0000
CPU min MHz:         1000.0000
BogoMIPS:            5000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            39424K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities

Comment 6 Carlos O'Donell 2023-03-23 02:14:06 UTC

There are two things we need to review.

(a) The tunables listed via ld.so --list-tunables do not seem to be populated with the correct values.
- The tunable values are used to influence the behaviour of memcpy algorithms.
- DJ Delorie is reviewing the tunables values, and the cpuid results, for this Skylake system.

(b) The dynamic runtime selection of the memcpy routine for Skylake.
- I'm reviewing the selection, and possible alternatives that might provide the best geomean performance for the *system*.
- I am in discussions with Intel engineers to review microbenchmark results.

I'll update the bug as we make progress.

Comment 7 Carlos O'Donell 2023-04-18 18:24:12 UTC

We are discussing with Intel if we should be adjusting the dyanmic threshold differently for these processors (Skylake, Cascade Lake, and Ice Lake).

To give a brief overview, the families include a different L3 topology, and to some degree we must balance single-thread performance vs. multi-threaded performance.

RHEL7 tunes preferentially for single-threaded performance, with multi-threaded performance showing as severe degradation in benchmarks like STREAM.

In RHEL8 and RHEL9 performance is balanced in the middle between single core and the whole system, and the key parameter here is how much L3 cache to allow each in-flight memory copy to consume.

The change in RHEL 8.3 looks like this:
~~~
+- /* The large memcpy micro benchmark in glibc shows that 6 times of
+- shared cache size is the approximate value above which non-temporal
+- store becomes faster on a 8-core processor. This is the 3/4 of the
+- total shared cache size. */
++ /* The default setting for the non_temporal threshold is 3/4 of one
++ thread's share of the chip's cache. For most Intel and AMD processors
++ with an initial release date between 2017 and 2020, a thread's typical
++ share of the cache is from 500 KBytes to 2 MBytes. Using the 3/4
++ threshold leaves 125 KBytes to 500 KBytes of the thread's data
++ in cache after a maximum temporal copy, which will maintain
++ in cache a reasonable portion of the thread's stack and other
++ active data. If the threshold is set higher than one thread's
++ share of the cache, it has a substantial risk of negatively
++ impacting the performance of other threads running on the chip. */
+ __x86_shared_non_temporal_threshold
+ = (cpu_features->non_temporal_threshold != 0
+ ? cpu_features->non_temporal_threshold
+- : __x86_shared_cache_size * threads * 3 / 4);
++ : __x86_shared_cache_size * 3 / 4);
~~~
This is the current upstream glibc implementation as discussed between Intel, AMD, Red Hat, Oracle and others who contributed overall to the tuning of this parameter for whole-system performance.

Before the change in RHEL 8.3 we were aggressively favouring single-threaded performance.

This is probably room to improve the current tuning on system that have a unified L3 like Skylake has, and that is something we are still investigating.

Options for customer look like this:

(a) Use the ERMS routine directly:

export GLIBC_TUNABLES=glibc.cpu.hwcaps=Prefer_ERMS

(b) Raise the threshold for non-temporal stores:

export GLIBC_TUNABLES=glibc.cpu.hwcaps=glibc.cpu.x86_non_temporal_threshold=0x2ffff0

Next steps:
- Continue to review tuning for Skylake, Cascade Lake and Ice Lake.

Comment 23 errata-xmlrpc 2023-11-14 15:49:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glibc bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7107