Bug 2213907

Summary: glibc: Memcpy throughput lower on RH9.3 compared to RHEL 8.3/RHEL 7.5 - same Skylake hardware
Product: Red Hat Enterprise Linux 9 Reporter: Carlos O'Donell <codonell>
Component: glibcAssignee: DJ Delorie <dj>
Status: MODIFIED --- QA Contact: Sergey Kolosov <skolosov>
Severity: high Docs Contact:
Priority: unspecified    
Version: 9.3CC: ashankar, barend.havenga, bgray, codonell, dj, fweimer, jmario, pfrankli, qe-baseos-tools-bugs, sipoyare, skolosov
Target Milestone: rcKeywords: Regression, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glibc-2.34-82.el9 Doc Type: Enhancement
Doc Text:
Feature: Improved string and memory routine performance on Intel Skylake-based hardware. Reason: The default amount of cache to use for string and memory routine performance is a balance between single process and whole system performance. It was found that on Intel Skylake-based systems the tuning could result in lower than expected performance. The default amount of cache to use for string and memory routines was reviewed against industry standard benchmarks. Result: the default amount of cache to use has been increased to improve performance.
Story Points: ---
Clone Of: 2180462 Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2180462    
Bug Blocks: 2166710    

Comment 1 Carlos O'Donell 2023-06-09 20:43:13 UTC
In RHEL 9 we should review the amount of L3 used for in-flight memory copies and adjust based on upstream discussions with Intel.

The same issue for RHEL 8 is this one:
https://bugzilla.redhat.com/show_bug.cgi?id=2180462

Comment 2 Florian Weimer 2023-06-13 09:03:44 UTC
In particular, this should include a backport of this commit to benefit TDX environments as they exist today:

commit ed2f9dc9420c4c61436328778a70459d0a35556a
Author: Noah Goldstein <goldstein.w.n>
Date:   Mon May 8 22:10:20 2023 -0500

    x86: Use 64MB as nt-store threshold if no cacheinfo [BZ #30429]
    
    If `non_temporal_threshold` is below `minimum_non_temporal_threshold`,
    it almost certainly means we failed to read the systems cache info.
    
    In this case, rather than defaulting the minimum correct value, we
    should default to a value that gets at least reasonable
    performance. 64MB is chosen conservatively to be at the very high
    end. This should never cause non-temporal stores when, if we had read
    cache info, we wouldn't have otherwise.
    Reviewed-by: Florian Weimer <fweimer>