Created attachment 1844409[details]
TLS test case
Description of problem: When two separate threads load TLS library functions sequentially, one thread will be very slow due to a generation counter mismatch (and thus glibc thinking it needs to reallocate memory for it).
Version-Release number of selected component (if applicable):
2.28-164.el8.x86_64
How reproducible: 100%
Steps to Reproduce:
1. yum install gcc gcc-c++ make glibc-devel openssl-devel
2. Unzip shell archive with test case.
3. Run "make".
4. Execute program "tls-test".
Actual results:
One thread is slower than the other to access TLS variables:
none loaded
main normal variable : 554.770ms
main thread-local variable : 578.829ms
lib1 loaded
main normal variable : 536.941ms
main thread-local variable : 504.300ms
lib1 variable : 2079.362ms
lib2 loaded
main normal variable : 451.575ms
main thread-local variable : 434.603ms
lib1 variable : 5567.543ms
lib2 accessed
main normal variable : 424.644ms
main thread-local variable : 429.140ms
lib1 variable : 1911.933ms
Expected results: lib1 variable access time is consistent.
Additional info:
- Issue was first noted in 2016 (https://patchwork.ozlabs.org/project/glibc/patch/1465309688.1188.19.camel@mailbox.tu-dresden.de/), and a patch was proposed.
Under bug 1991001, we are considering backporting changes to the DTV TLS management, so that it aligns with upstream. This is a prerequisite for backporting an eventual upstream fix for this bug here, which does not exist at this time.
We backported the glibc.rtld.optional_static_tls upstream tunable as part of bug 1817513. With the tunable, it is possible to get initial-exec TLS in dlopen'ed shared objects working in more cases. Initial-exec TLS does not suffer from the performance degradation, so it might be an alternative approach. For instance, glibc malloc uses initial-exec TLS to access its thread-local data, so it is not affected by this.