Bug 1880670

Summary:

glibc: memcpy calls are slower for x86_64 processors on RHEL 8 than on RHEL 7

Product:

Red Hat Enterprise Linux 8

Reporter:

Brandon Clark <brclark>

Component:

glibc

Assignee:

Florian Weimer <fweimer>

Status:

CLOSED ERRATA

QA Contact:

Sergey Kolosov <skolosov>

Severity:

unspecified

Docs Contact:

Zuzana Zoubkova <zzoubkov>

Priority:

unspecified

Version:

8.2

CC:

alanm, amike, ashankar, brclark, casantos, chorn, codonell, dj, fweimer, jaeshin, jwright, kdudka, mbliss, mcermak, mkolbas, mnewsome, pfrankli, sajan.karumanchi, sipoyare, tnagata, vmukhame

Target Milestone:

Keywords:

Bugfix, Patch, Triaged, ZStream

Target Release:

8.0

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glibc-2.28-137.el8

Doc Type:

Bug Fix

Doc Text:

.The `glibc` string functions now avoid negative impact on system cache on AMD64 and Intel 64 processors Previously, the `glibc` implementation of string functions incorrectly estimated the amount of last-level cache available to a thread on the 64-bit AMD and Intel processors. As a consequence, calling the `memcpy` function on large buffers either negatively impacted the overall cache performance of the system or slowed down the `memcpy` system call. With this update, the last-level cache size is no longer scaled with the number of reported hardware threads in the system. As a result, the string functions now bypass caches for large buffers, avoiding negative impact on the rest of the system cache.

Story Points:

---

Clone Of:

Clones:

1913750 (view as bug list)

Environment:

Last Closed:

2021-05-18 14:36:39 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1893197

Bug Blocks:

1913750

Attachments:

Description	Flags
test_memcpy.c for reproduction steps.	none

Description Brandon Clark 2020-09-18 23:00:45 UTC

Created attachment 1715440 [details]
test_memcpy.c for reproduction steps.

Description of problem:
A customer has observed that compiling a program that uses memcpy on a AMD EPYC processor (using generic tuning) results in this function taking 1.32s longer to process 1000 memcpy calls than RHEL7.

Version-Release number of selected component (if applicable):
glibc-2.28-101.el8.x86_64

How reproducible:
They were able to consistently have this occur.

Steps to Reproduce:
1. Have AMD Zen2 processor.
2. Compile test file using arguments '-mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64'
3. run time command compiled application with '32' as argument of test_memcpy
4. Repeat test on RHEL 7.

Actual results:
Execution of compiled application is slower than RHEL 7.

Expected results:
Execution is as fast (or faster) than RHEL 7.

Additional Information:
Customer is currently using following GLIBC_TUNABLES for a workaround:

GLIBC_TUNABLES=glibc.tune.hwcaps=-AVX_Usable,-AVX2_Usable,-Prefer_ERMS,-Prefer_FSRM,Prefer_No_AVX512,Prefer_No_VZEROUPPER,-AVX_Fast_Unaligned_Load,-ERMS

Comment 1 Florian Weimer 2020-09-21 16:29:39 UTC

The sosreport shows this:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 49
model name	: AMD EPYC 7542 32-Core Processor
stepping	: 0
microcode	: 0x8301038
cpu MHz		: 2428.418
cache size	: 512 KB

We have a couple of those in Beaker, so we can check fairly quickly if it's fixed upstream.

Comment 7 Carlos O'Donell 2020-10-02 12:00:21 UTC

I configured a AMD EPYC 7452 32-Core Processor with RHEL 8.3.

I built glibc upstream master as of commit 2deb7793907c7995b094b3778017c0ef0bd432d5

Testsuite results look clean.

Baseline: glibc-2.17-317.el7.x86_64.rpm:
./rpm/lib64/ld-linux-x86-64.so.2 --library-path /root/rpm/lib64/ ./test_memcpy 32
32 MB = 1.763765 ms

RHLE8: glibc-2.28-127.el8.x86_64:
for i in a b c d e f g h i j; do ./test_memcpy 32; done
32 MB = 3.400531 ms
32 MB = 3.219160 ms
32 MB = 3.211369 ms
32 MB = 3.207968 ms
32 MB = 3.216026 ms
32 MB = 3.215049 ms
32 MB = 3.207945 ms
32 MB = 3.208351 ms
32 MB = 3.203935 ms
32 MB = 3.208126 ms

Picked routine is __memmove_avx_unaligned.

This performance is lower than RHEL7.

Upstream: glibc-2.32.9000
for i in a b c d e f g h i j; do ./build/elf/ld-linux-x86-64.so.2 --library-path /root/build/ ./test_memcpy 32; done
32 MB = 1.766122 ms
32 MB = 1.761784 ms
32 MB = 1.760405 ms
32 MB = 1.780899 ms
32 MB = 1.760634 ms
32 MB = 1.760430 ms
32 MB = 1.762390 ms
32 MB = 1.762698 ms
32 MB = 1.763199 ms
32 MB = 1.761510 ms

Picked routine is __memmove_avx_unaligned.

This upstream performance is the same as RHEL7 again so something in upstream has improved the performance.

The regression was fixed by:

commit d3c57027470b78dba79c6d931e4e409b1fecfc80
Author: Patrick McGehearty <patrick.mcgehearty>
Date:   Mon Sep 28 20:11:28 2020 +0000

    Reversing calculation of __x86_shared_non_temporal_threshold

This makes some sense here and upstream we discussed the balancing of that L3 cache among threads.

I experimented a bit for the EPYC device:
Reducing the cache usage from 75% -> 50% worsens performance e.g. ~1.91ms.
Reducing the cache usage from 75% -> 66% is about the same performance e.g. ~1.76ms
So there is some room for fine tuning this for the device.

There is definitely some architectural details that matter here and AMD is going to have to review this.

- In RHEL 8 __x86_shared_non_temporal_threshold is "cache * threads * 3 / 4"
  - Value of cache is 1572864.
  - Value of threads is 128.
  - Thus __x86_shared_non_temporal_threshold is 150,994,944 or 150MiB (too high)

- In upstream master __x86_shared_non_temporal_threshold is "cache * 3 / 4"
  - Value of cache is 1572864.
  - Value of threads is 128.
  - Thus __x86_shared_non_temporal_threshold is 1,179,648 or ~1.7MiB.

It is not clear to me if the relevant code for AMD is correctly modelling the topology of the shared thread count for the cache.

803           /* Figure out the number of logical threads that share L3.  */
804           if (max_cpuid_ex >= 0x80000008)
805             {
806               /* Get width of APIC ID.  */
807               __cpuid (0x80000008, max_cpuid_ex, ebx, ecx, edx);
808               threads = 1 << ((ecx >> 12) & 0x0f);
809             }

This is fairly rudimentary and could probably be improved.

We divide the whole of the L3 cache size by the reported "threads" count to balance the cache usage.

If the threads value is too high we'll be giving threads less cache than they could use.

Comment 10 Florian Weimer 2020-10-02 20:32:06 UTC

I have placed an unsupported test build with a backport of the relevant upstream patch here:

  https://people.redhat.com/~fweimer/RbMvxmRwQE1x/glibc-2.28-130.el8.bz1880670.1/

I would appreciate if those affected by this issue could test this build and report if it addresses this issue. Thanks.

Comment 13 Sajan 2020-10-22 05:37:32 UTC

I have posted a patch to bring in performance gains for memcpy/memove on AMD machines in August, which tunes the non-temporal threshold parameter '__x86_shared_non_temporal_threshold' to 2/3 of shareable cache per thread.

Though this patch brought in performance gains in walk-bench results, I did see regression for memory ranges of 1MB to 8MB in large-bench results.
I have mentioned this in the cover-note, looking for answers to the discrepancies in the results.
https://sourceware.org/pipermail/libc-alpha/2020-August/117080.html
As I did not get any response to my queries, then I started working on a solution for AMD Zen architectures to fix this regression.

On the verge of pushing my patch, I saw Patrick's patch already committed to 2.32. This patch by Patrick is in the similar lines of tuning the non-temporal threshold and had brought in the regression problem on AMD Zen machines as mentioned earlier.

I have rebased my patch and re-run benchmark tests to come up with a solution to handle this regression on the master branch.
This solution brings in performance gains of ~44% for memory sizes greater than 16MB with no regression for 1MB to 8MB memory sizes on Large bench results.

The patch "Optimizing memcpy for AMD Zen architecture" is pushed for review.
Patch: https://sourceware.org/pipermail/libc-alpha/2020-October/118895.html
More details on the patch and performance numbers 
cover-note: https://sourceware.org/pipermail/libc-alpha/2020-October/118894.html

Comment 35 errata-xmlrpc 2021-05-18 14:36:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: glibc security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1585

Comment 36 Red Hat Bugzilla 2023-11-16 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days