Bug 1740039

Summary: glibc: wrong handling of dlopen() of a nonexistent/broken library, dl_tls_max_dtv_idx incremented too early
Product: Red Hat Enterprise Linux 7 Reporter: Antonio Di Monaco <antonio.di.monaco>
Component: glibcAssignee: Florian Weimer <fweimer>
Status: CLOSED ERRATA QA Contact: qe-baseos-tools-bugs
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.7CC: aarnold, ashankar, bfinger, bgollahe, cbrune, codonell, dj, fweimer, gcase, kim-thomas.rehmann, mcermak, mnewsome, pfrankli, skolosov, tgummels, woodard
Target Milestone: rcKeywords: Patch
Target Release: 7.8   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glibc-2.17-307.el7 Doc Type: Bug Fix
Doc Text:
Cause: An attempt to call dlopen on an ET_EXEC executable fails as expected, but also leaves the dynamic loader in an inconsistent state. Consequence: A later call to pthread_create crashes with a segmentation fault, due to inconsistent TLS data structures. Fix: A check has been added to dlopen to reject ET_EXEC executables earlier during execution, before the TLS data structures become inconsistent. Result: The dlopen failure is reported, and subsequent pthread_create calls behave as expected.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-31 19:08:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1599298, 1754591    

Description Antonio Di Monaco 2019-08-12 07:27:29 UTC
Description of problem:
A program which repeatedly is calling dlopen() to open shared libraries which are created dynamically on demand can crash after a certain number of dlopen()'s.

Version-Release number of selected component (if applicable):
glibc 2.17

How reproducible:
See test program and instructions in https://sourceware.org/bugzilla/show_bug.cgi?id=16634

Steps to Reproduce:
1. Create and compile the test program as per https://sourceware.org/bugzilla/show_bug.cgi?id=16634
2. Run the program:
# ./a.out


Actual results:
The test program aborts with:
...
61: &x = 0x7feaa003873c
62: &x = 0x7feaa003873c
63: &x = 0x7feaa003873c
Segmentation fault (core dumped)

Expected results:
The program should run as long as there are dlopen() calls, and either end after a certain number of calls (in the case of the test program, 100) or run forever and maybe reporting dlopen() call failures after resources are exhausted.

Additional info:
This issue is likely related to https://sourceware.org/bugzilla/show_bug.cgi?id=16634 "Application
calling dlopen("./a.out",...) may run into _dl_allocate_tls_init:
Assertion `listp != ((void *)0)' failed!", which is fixed in glibc 2.20.

RHEL 7 comes with glibc 2.17, and there is currently no open Red Hat bug for this upstream bug.

The upstream bug https://sourceware.org/bugzilla/show_bug.cgi?id=16634 mentions that the application fails after 64 iterations with "Assertion ... failed!" but the same program on RHEL 7 just dumps core after 64 iterations.

The upstream bug mentions that the underlying problem is that an application erroneously tries to
repeatedly call dlopen("a.out", ...). In other words: If the application is flawless, this bug
will never be encountered. In the case of SAP HANA, however, the SAP HANA software is doing on-demand code generation
which effectively generates shared libraries which are then loaded/unloaded potentially quite often. This behavior can
trigger this bug, causing potentially long time for error analysis and unnecesasry downtime in production systems.

I could not reproduce the bug on RHEL 8 GA, glibc 2.28, release 42.el8_0.1: The program exits normally even after 10.000 iterations.

Comment 5 Florian Weimer 2019-08-12 09:43:24 UTC
A repository with a test build is available here:

https://people.redhat.com/~fweimer/IQlrkw5SoVmo/

The repository file for /etc/yum.repos.d is here:

https://people.redhat.com/~fweimer/IQlrkw5SoVmo/glibc-2.17-306.el7.fweimer.bz1740039.1.repo

Would you please verify that this build fixes the original problem?  Thanks.

Comment 6 Florian Weimer 2019-08-12 14:36:03 UTC
Building the upstream test requires some changes.  See bug 1740088.  I submitted a test generalization upstream:

https://sourceware.org/ml/libc-alpha/2019-08/msg00229.html

Comment 8 Gary Case 2019-08-22 14:25:08 UTC
FYI, this bug is being moved to be a 7.9 item as our development work on glibc for RHEL 7.8 has concluded. That being said, depending on when the feedback from SAP arrives we may still be able to include this as a 7.8 item.

Comment 10 Antonio Di Monaco 2019-09-19 08:11:58 UTC
Hi,

I confirm that the patched glibc fixes the issue.

Thanks,

BR,
Antonio

Comment 12 Carlos O'Donell 2019-10-07 05:07:19 UTC
Antonio,

Testing by Red Hat has revealed that the upstream patch to fix this issue is incomplete.

Further changes to the dynamic loader were required to fix the thread-local storage issues seen in your test case scenario.

We want to have the upstream change go through enough operational hours to show that it doesn't have any further impact on dlopen, arriving signals during dlopen, etc. This means that we will not immediately be backporting these changes into a z-stream release.

As we understand it these issues impact only SAP validation, but not customer deployments. Given the limited impact on customers we want to ensure that the riskier but correct fix does not impact our joint customers. Again, we will be doing upstream testing before we do downstream deployment of the fix in RHEL.

We will work to provide a new test fix for SAP that includes what we believe is a more complete set of fixes. Please be patient while we get this ready for testing. Florian will be working on delivering the test fix for SAP.

Comment 13 Florian Weimer 2019-10-16 15:14:11 UTC
I have posted yet another upstream test fix:

  https://sourceware.org/ml/libc-alpha/2019-10/msg00491.html

This splits the TLS modid tests (which we want to backport) from the self-dlopen tests (which are not needed).

Comment 17 Sergey Kolosov 2019-11-08 13:46:34 UTC
Verified with the reproducer and glibc testsite test elf/tst-dlopen-tlsmodid.

Comment 18 Carlos O'Donell 2020-01-10 14:53:07 UTC
*** Bug 1670620 has been marked as a duplicate of this bug. ***

Comment 20 errata-xmlrpc 2020-03-31 19:08:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0989