1880670 – glibc: memcpy calls are slower for x86_64 processors on RHEL 8 than on RHEL 7

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1880670 - glibc: memcpy calls are slower for x86_64 processors on RHEL 8 than on RHEL 7

Summary: glibc: memcpy calls are slower for x86_64 processors on RHEL 8 than on RHEL 7

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	8.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	8.0
Assignee:	Florian Weimer
QA Contact:	Sergey Kolosov
Docs Contact:	Zuzana Zoubkova
URL:
Whiteboard:
Depends On:	1893197
Blocks:	1913750
TreeView+	depends on / blocked

Reported:	2020-09-18 23:00 UTC by Brandon Clark
Modified:	2024-06-13 23:06 UTC (History)
CC List:	21 users (show)
Fixed In Version:	glibc-2.28-137.el8
Doc Type:	Bug Fix
Doc Text:	.The `glibc` string functions now avoid negative impact on system cache on AMD64 and Intel 64 processors Previously, the `glibc` implementation of string functions incorrectly estimated the amount of last-level cache available to a thread on the 64-bit AMD and Intel processors. As a consequence, calling the `memcpy` function on large buffers either negatively impacted the overall cache performance of the system or slowed down the `memcpy` system call. With this update, the last-level cache size is no longer scaled with the number of reported hardware threads in the system. As a result, the string functions now bypass caches for large buffers, avoiding negative impact on the rest of the system cache.
Clone Of:
Clones:	1913750 (view as bug list)
Environment:
Last Closed:	2021-05-18 14:36:39 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test_memcpy.c for reproduction steps. (1.02 KB, text/x-csrc) 2020-09-18 23:00 UTC, Brandon Clark	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5514711	0	None	None	None	2021-01-14 01:32:36 UTC

Internal Links: 1890830

Description Brandon Clark 2020-09-18 23:00:45 UTC

Created attachment 1715440 [details]
test_memcpy.c for reproduction steps.

Description of problem:
A customer has observed that compiling a program that uses memcpy on a AMD EPYC processor (using generic tuning) results in this function taking 1.32s longer to process 1000 memcpy calls than RHEL7.

Version-Release number of selected component (if applicable):
glibc-2.28-101.el8.x86_64

How reproducible:
They were able to consistently have this occur.

Steps to Reproduce:
1. Have AMD Zen2 processor.
2. Compile test file using arguments '-mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64'
3. run time command compiled application with '32' as argument of test_memcpy
4. Repeat test on RHEL 7.

Actual results:
Execution of compiled application is slower than RHEL 7.

Expected results:
Execution is as fast (or faster) than RHEL 7.

Additional Information:
Customer is currently using following GLIBC_TUNABLES for a workaround:

GLIBC_TUNABLES=glibc.tune.hwcaps=-AVX_Usable,-AVX2_Usable,-Prefer_ERMS,-Prefer_FSRM,Prefer_No_AVX512,Prefer_No_VZEROUPPER,-AVX_Fast_Unaligned_Load,-ERMS

Comment 1 Florian Weimer 2020-09-21 16:29:39 UTC

The sosreport shows this:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7542 32-Core Processor
stepping	: 0
microcode	: 0x8301038
cpu MHz		: 2428.418
cache size	: 512 KB

We have a couple of those in Beaker, so we can check fairly quickly if it's fixed upstream.

Comment 7 Carlos O'Donell 2020-10-02 12:00:21 UTC

I configured a AMD EPYC 7452 32-Core Processor with RHEL 8.3.

I built glibc upstream master as of commit 2deb7793907c7995b094b3778017c0ef0bd432d5

Testsuite results look clean.

Baseline: glibc-2.17-317.el7.x86_64.rpm:
./rpm/lib64/ld-linux-x86-64.so.2 --library-path /root/rpm/lib64/ ./test_memcpy 32
32 MB = 1.763765 ms

RHLE8: glibc-2.28-127.el8.x86_64:
for i in a b c d e f g h i j; do ./test_memcpy 32; done
32 MB = 3.400531 ms
32 MB = 3.219160 ms
32 MB = 3.211369 ms
32 MB = 3.207968 ms
32 MB = 3.216026 ms
32 MB = 3.215049 ms
32 MB = 3.207945 ms
32 MB = 3.208351 ms
32 MB = 3.203935 ms
32 MB = 3.208126 ms

Picked routine is __memmove_avx_unaligned.

This performance is lower than RHEL7.

Upstream: glibc-2.32.9000
for i in a b c d e f g h i j; do ./build/elf/ld-linux-x86-64.so.2 --library-path /root/build/ ./test_memcpy 32; done
32 MB = 1.766122 ms
32 MB = 1.761784 ms
32 MB = 1.760405 ms
32 MB = 1.780899 ms
32 MB = 1.760634 ms
32 MB = 1.760430 ms
32 MB = 1.762390 ms
32 MB = 1.762698 ms
32 MB = 1.763199 ms
32 MB = 1.761510 ms

Picked routine is __memmove_avx_unaligned.

This upstream performance is the same as RHEL7 again so something in upstream has improved the performance.

The regression was fixed by:

commit d3c57027470b78dba79c6d931e4e409b1fecfc80
Author: Patrick McGehearty <patrick.mcgehearty>
Date:   Mon Sep 28 20:11:28 2020 +0000

    Reversing calculation of __x86_shared_non_temporal_threshold

This makes some sense here and upstream we discussed the balancing of that L3 cache among threads.

I experimented a bit for the EPYC device:
Reducing the cache usage from 75% -> 50% worsens performance e.g. ~1.91ms.
Reducing the cache usage from 75% -> 66% is about the same performance e.g. ~1.76ms
So there is some room for fine tuning this for the device.

There is definitely some architectural details that matter here and AMD is going to have to review this.

- In RHEL 8 __x86_shared_non_temporal_threshold is "cache * threads * 3 / 4"
  - Value of cache is 1572864.
  - Value of threads is 128.
  - Thus __x86_shared_non_temporal_threshold is 150,994,944 or 150MiB (too high)

- In upstream master __x86_shared_non_temporal_threshold is "cache * 3 / 4"
  - Value of cache is 1572864.
  - Value of threads is 128.
  - Thus __x86_shared_non_temporal_threshold is 1,179,648 or ~1.7MiB.

It is not clear to me if the relevant code for AMD is correctly modelling the topology of the shared thread count for the cache.

803           /* Figure out the number of logical threads that share L3.  */
804           if (max_cpuid_ex >= 0x80000008)
805             {
806               /* Get width of APIC ID.  */
807               __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
808               threads = 1 << ((ecx >> 12) & 0x0f);
809             }

This is fairly rudimentary and could probably be improved.

We divide the whole of the L3 cache size by the reported "threads" count to balance the cache usage.

If the threads value is too high we'll be giving threads less cache than they could use.

Comment 10 Florian Weimer 2020-10-02 20:32:06 UTC

I have placed an unsupported test build with a backport of the relevant upstream patch here:

  https://people.redhat.com/~fweimer/RbMvxmRwQE1x/glibc-2.28-130.el8.bz1880670.1/

I would appreciate if those affected by this issue could test this build and report if it addresses this issue. Thanks.

Comment 13 Sajan 2020-10-22 05:37:32 UTC

I have posted a patch to bring in performance gains for memcpy/memove on AMD machines in August, which tunes the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to 2/3 of shareable cache per thread.

Though this patch brought in performance gains in walk-bench results, I did see regression for memory ranges of 1MB to 8MB in large-bench results.
I have mentioned this in the cover-note, looking for answers to the discrepancies in the results.
https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html
As I did not get any response to my queries, then I started working on a solution for AMD Zen architectures to fix this regression.

On the verge of pushing my patch, I saw Patrick's patch already committed to 2.32. This patch by Patrick is in the similar lines of tuning the non-temporal threshold and had brought in the regression problem on AMD Zen machines as mentioned earlier.

I have rebased my patch and re-run benchmark tests to come up with a solution to handle this regression on the master branch.
This solution brings in performance gains of ~44% for memory sizes greater than 16MB with no regression for 1MB to 8MB memory sizes on Large bench results.

The patch "Optimizing memcpy for AMD Zen architecture" is pushed for review.
Patch: https://sourceware.org/pipermail/libc-alpha/2020-October/118895.html
More details on the patch and performance numbers 
cover-note: https://sourceware.org/pipermail/libc-alpha/2020-October/118894.html

Comment 35 errata-xmlrpc 2021-05-18 14:36:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: glibc security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1585

Comment 36 Red Hat Bugzilla 2023-11-16 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

alanm
amike
ashankar
brclark
casantos
chorn
codonell
dj
fweimer
jaeshin
jwright
kdudka
mbliss
mcermak
mkolbas
mnewsome
pfrankli
sajan.karumanchi
sipoyare
tnagata
vmukhame