Bug 1889892 - glibc: pthread_cond_wait missed wakeup (swbz#25847)
Summary: glibc: pthread_cond_wait missed wakeup (swbz#25847)
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: glibc
Version: 32
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Carlos O'Donell
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 20:30 UTC by Michael Bacarella
Modified: 2020-11-10 14:25 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
test case repro from sourceware entry (7.67 KB, text/x-csrc)
2020-10-20 20:34 UTC, Michael Bacarella
no flags Details
one-line patch to glibc that fixes the deadlock (863 bytes, patch)
2020-10-20 20:35 UTC, Michael Bacarella
no flags Details | Diff
testcase with abort() on stuck (7.84 KB, text/x-csrc)
2020-11-01 17:59 UTC, Török Edwin
no flags Details


Links
System ID Priority Status Summary Last Updated
Sourceware 25847 P2 UNCONFIRMED pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing 2020-11-17 14:16:57 UTC

Description Michael Bacarella 2020-10-20 20:30:15 UTC
Description of problem:

This bug was submitted by Qin Li to glibc bugzilla earlier this year, with a one-line patch, though it hasn't been merged into glibc yet:

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

Version-Release number of selected component: glibc-2.27 onwards

How reproducible: reliably, try the repro from the sourceware url above

Actual results: deadlocks after 30-120 minutes on a 4-core Fedora 32 box

Expected results: should never deadlock

Additional info:

This bug in pthread conditions will deadlock the OCaml runtime, as well as Python and .NET applications.

The bug was introduced in glibc 2.27 and is still present in glibc 2.31.

I confirm the repro from the above deadlocks on Fedora 32. Takes about 30-180 minutes on a 4 core server.

I further confirm that the one-line fix to glibc at the above applies cleanly to Fedora 32's glibc source rpm, and does not deadlock after running the repro for more than 30 hours.

Please kindly consider merging the one-line fix into Fedora glibc.

More background about this bug, for the sake of future internet searchers:
* https://discuss.ocaml.org/t/is-there-a-known-recent-linux-locking-bug-that-affects-the-ocaml-runtime

Comment 1 Michael Bacarella 2020-10-20 20:34:52 UTC
Created attachment 1722977 [details]
test case repro from sourceware entry

will deadlock

Comment 2 Michael Bacarella 2020-10-20 20:35:47 UTC
Created attachment 1722978 [details]
one-line patch to glibc that fixes the deadlock

Comment 3 Carlos O'Donell 2020-10-27 13:21:56 UTC
We are looking to fix this for Fedora and Red Hat Enterprise Linux 8 as this has impact to users on both platforms.

Comment 4 Török Edwin 2020-11-01 17:59:31 UTC
Created attachment 1725573 [details]
testcase with abort() on stuck

Small modification to upstream testcase that abort()s when the loop is stuck for several iterations.


Note You need to log in before you can comment on or make changes to this bug.