Bug 2119304

Summary: glibc: Upgrading to glibc-2.28-209.el8.x86_64 causes segfaults during concurrent process launch
Product: Red Hat Enterprise Linux 8 Reporter: Ben Morrice <ben.morrice>
Component: glibcAssignee: Florian Weimer <fweimer>
Status: CLOSED ERRATA QA Contact: Martin Coufal <mcoufal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: CentOS StreamCC: alex.iribarren, ashankar, bstinson, codonell, daniel.vanderster, davide, dj, fweimer, jwboyer, kpfleming, mcoufal, mnewsome, pfrankli, sipoyare, skolosov
Target Milestone: rcKeywords: Bugfix, Patch, Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glibc-2.28-211.el8 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2121536 (view as bug list) Environment:
Last Closed: 2022-11-08 10:43:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2121536    
Deadline: 2022-08-29   

Description Ben Morrice 2022-08-18 09:14:33 UTC
Description of problem:

Version-Release number of selected component (if applicable):

glibc-2.28-209.el8.x86_64

How reproducible:

Easily / every time

Steps to Reproduce:

Have glibc-2.28-208.el8.x86_64 (or lower) installed
Run a simple script such as

#!/bin/bash
while true; do
  /bin/true
  sleep 0.05
done

Upgrade glibc to glibc-2.28-209.el8.x86_64

This will cause the script '/bin/true' to seg fault

Example dmesg output:

[ 2264.221694] show_signal: 10 callbacks suppressed
[ 2264.221709] traps: true[36840] general protection fault ip:7f36ffe01d83 sp:7ffe42ce05e0 error:0 in libc-2.28.so[7f36ffdc7000+1bc000]
[ 2264.230469] traps: systemd-coredum[36841] general protection fault ip:7f2b6dabfd83 sp:7fffdd63f360 error:0 in libc-2.28.so[7f2b6da85000+1bc000]
[ 2264.230501] Process 36841(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.230503] Aborting core
[ 2264.232424] traps: sleep[36842] general protection fault ip:7fcc20367d83 sp:7fff0edec260 error:0 in libc-2.28.so[7fcc2032d000+1bc000]
[ 2264.238065] traps: systemd-coredum[36843] general protection fault ip:7f3235f57d83 sp:7ffcb67e90c0 error:0 in libc-2.28.so[7f3235f1d000+1bc000]
[ 2264.238095] Process 36843(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.238097] Aborting core
[ 2264.239801] traps: true[36844] general protection fault ip:7f95eec16d83 sp:7ffc2267d190 error:0 in libc-2.28.so[7f95eebdc000+1bc000]
[ 2264.244854] traps: systemd-coredum[36845] general protection fault ip:7f1517fe3d83 sp:7ffd75dbf3d0 error:0 in libc-2.28.so[7f1517fa9000+1bc000]
[ 2264.244876] Process 36845(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.244878] Aborting core
[ 2264.246514] traps: sleep[36846] general protection fault ip:7fa2cf43bd83 sp:7fff41aee840 error:0 in libc-2.28.so[7fa2cf401000+1bc000]
[ 2264.251660] traps: systemd-coredum[36847] general protection fault ip:7f36951c4d83 sp:7ffefb45ebf0 error:0 in libc-2.28.so[7f369518a000+1bc000]
[ 2264.251684] Process 36847(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.251685] Aborting core
[ 2264.253329] traps: true[36848] general protection fault ip:7fdd93ae0d83 sp:7fffeab980d0 error:0 in libc-2.28.so[7fdd93aa6000+1bc000]
[ 2264.258431] traps: systemd-coredum[36849] general protection fault ip:7f950145ed83 sp:7ffef6c5b960 error:0 in libc-2.28.so[7f9501424000+1bc000]
[ 2264.258455] Process 36849(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.258456] Aborting core
[ 2264.265079] Process 36851(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.265082] Aborting core
[ 2264.271712] Process 36853(systemd-coredum) has RLIMIT_CORE set to 1
[ 2264.271715] Aborting core
[ 2264.278745] Process 36855(systemd-coredum) has RLIMIT_CORE set to 1

Actual results:

processes seg fault

Expected results:

processes should not seg fault

Additional info:

The script referred to above is just an example.
We are seeing this behaviour across a wide range of processes whilst glibc is upgraded on many systems.

Comment 1 Florian Weimer 2022-08-18 11:43:50 UTC
The new ld.so cannot load the old libc. There is a brief time window when RPM has already renamed the new ld.so in place, but the file system still has the old libc.so (actual file names differ). This is not a new issue, we have tickled the same bug during the life time of Red Hat Enterprise Linux 8 as we implemented other dynamic loader enhancements.

I'm not sure what we can do about this.

Comment 2 Florian Weimer 2022-08-19 11:26:06 UTC
I posted some upstream patches to detect this situation and avoid the coredump and print a clear error message (“Fatal glibc error: ld.so/libc.so mismatch detected”):

[PATCH 0/2] Check ld.so/libc.so consistency during startup
<https://sourceware.org/pipermail/libc-alpha/2022-August/141525.html>

During the update, processes will still fail to start (that's going to be much hard to fix because upstream needs to be sufficiently distribution-agnostic), but the failures do not result in coredumps, so the secondary crashes from systemd-coredumpd are gone.

Comment 3 Florian Weimer 2022-08-24 11:36:29 UTC
The upstream change works to mask the segmentation fault in the reproducer (and a couple of related ones) with our 2.28-based glibc, but we have additional ABI exposure due to our still-separate libpthread and libdl downstream (merged into libc upstream in 2.34). I'm inclined to move forward with the coredump suppression logic we can implement today, although it is incomplete.

Comment 10 Florian Weimer 2022-08-26 12:21:48 UTC
Here is what we are changing:

Updates from -208 and earlier to -211 on aarch64, x86_64 are expected not to cause any disruptions (even in the presence of concurrent process creation) because we have taken steps to minimize internal ABI impact coming from the LD_AUDIT changes (internal bug 2047981). On ppc64le POWER9, an error message “undefined symbol: _dl_audit_symbind_alt, version GLIBC_PRIVATE” might be printed due to different shared object upgrade order on this platform.

For s390x, the crash free upgrades are possible from version -202 and earlier. This is because bug 2077835 already introduced a private ABI change.

Because we have reverted the internal ABI changes, upgrades from -209 and later to -211 (-203 for s390x) will likely crash once more if processes are created concurrently with an update. Downgrades are negatively impacted as well. There is really no good way to avoid either set of issues. We investigated version fingerprinting, but the impact on static dlopen was eventually deemed to high.

Please note that this ABI inconsistency is restricted to the internal glibc ABI, not the public ABI. The issue materializes if a process loads different components of glibc at different times. (This can happen because glibc is split across several files internally.) Typically, this is triggered if a process launches concurrently with a glibc update or downgrade, but it is also possible that issues arise if a long-running component loads parts of glibc later (say indirectly via dlopen).

Comment 16 Alex Iribarren 2022-09-13 11:31:13 UTC
Any idea when this update will be released?

Comment 18 Alex Iribarren 2022-09-14 08:25:33 UTC
And of course it was released just after my post... :) Sorry for the noise.

Comment 19 Florian Weimer 2022-09-14 08:31:46 UTC
(In reply to Alex Iribarren from comment #18)
> And of course it was released just after my post... :) Sorry for the noise.

No worries, I should have said here that the update was under way when I received word that it was.

Please let us know if there are remaining issues with updates on live systems. I think we addressed the main source of crashes, but delayed loading of libpthread with glibc 2.28 remains tricky.

Comment 21 errata-xmlrpc 2022-11-08 10:43:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glibc bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:7684