Bug 1889892

Summary: glibc: pthread_cond_wait missed wakeup (swbz#25847)
Product: [Fedora] Fedora Reporter: Michael Bacarella <michael.bacarella>
Component: glibcAssignee: Carlos O'Donell <codonell>
Status: ASSIGNED --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 39CC: aoliva, ashankar, codonell, dj, edwin+bugs, fweimer, law, mfabian, pfrankli, rth, schwerin, sipoyare
Target Milestone: ---Keywords: Bugfix
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1979990 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1979990    
Attachments:
Description Flags
test case repro from sourceware entry
none
one-line patch to glibc that fixes the deadlock
none
testcase with abort() on stuck none

Description Michael Bacarella 2020-10-20 20:30:15 UTC
Description of problem:

This bug was submitted by Qin Li to glibc bugzilla earlier this year, with a one-line patch, though it hasn't been merged into glibc yet:

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

Version-Release number of selected component: glibc-2.27 onwards

How reproducible: reliably, try the repro from the sourceware url above

Actual results: deadlocks after 30-120 minutes on a 4-core Fedora 32 box

Expected results: should never deadlock

Additional info:

This bug in pthread conditions will deadlock the OCaml runtime, as well as Python and .NET applications.

The bug was introduced in glibc 2.27 and is still present in glibc 2.31.

I confirm the repro from the above deadlocks on Fedora 32. Takes about 30-180 minutes on a 4 core server.

I further confirm that the one-line fix to glibc at the above applies cleanly to Fedora 32's glibc source rpm, and does not deadlock after running the repro for more than 30 hours.

Please kindly consider merging the one-line fix into Fedora glibc.

More background about this bug, for the sake of future internet searchers:
* https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime

Comment 1 Michael Bacarella 2020-10-20 20:34:52 UTC
Created attachment 1722977 [details]
test case repro from sourceware entry

will deadlock

Comment 2 Michael Bacarella 2020-10-20 20:35:47 UTC
Created attachment 1722978 [details]
one-line patch to glibc that fixes the deadlock

Comment 3 Carlos O'Donell 2020-10-27 13:21:56 UTC
We are looking to fix this for Fedora and Red Hat Enterprise Linux 8 as this has impact to users on both platforms.

Comment 4 Török Edwin 2020-11-01 17:59:31 UTC
Created attachment 1725573 [details]
testcase with abort() on stuck

Small modification to upstream testcase that abort()s when the loop is stuck for several iterations.

Comment 6 Fedora Program Management 2021-04-29 17:06:51 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 7 Carlos O'Donell 2021-04-29 20:14:38 UTC
Still a bug, and still in Rawhide.

Comment 8 Ben Cotton 2021-08-10 13:42:25 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 35 development cycle.
Changing version to 35.

Comment 9 Ben Cotton 2022-02-08 21:45:17 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 36 development cycle.
Changing version to 36.

Comment 10 Ben Cotton 2023-04-25 16:40:44 UTC
This message is a reminder that Fedora Linux 36 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '36'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 36 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 11 Fedora Release Engineering 2023-08-16 08:08:28 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.