Bug 759341

Summary: strange pthread/fork deadlock
Product: Red Hat Enterprise Linux 5 Reporter: Shane Carr <scarr6>
Component: glibcAssignee: Jeff Law <law>
Status: CLOSED ERRATA QA Contact: qe-baseos-tools-bugs
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: fweimer, law, mfranc, mnewsome, pmuller, scarr6
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Same as RHEL 6.2 BZ 738665.
Story Points: ---
Clone Of: 738665 Environment:
Last Closed: 2013-01-08 03:45:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shane Carr 2011-12-02 02:44:14 UTC
+++ This bug was initially created as a clone of Bug #738665 +++

+++ This bug was initially created as a clone of Bug #737387 +++

Created attachment 522617 [details]
test case demonstrating the issue

There appears to be a strange bug in glibc that causes deadlocks when calling fork() from threads. We had a testcase in GLib failing from time to time because of this.

I've attached a minimal testcase that uses only pure pthreads + libc. Compile it with -pthread and run it. It should fill your screen with dots for a while, then hang when it hits the bug (which happens randomly anywhere between 1 dot and hundreds). I've already received independent verification that this testcase hangs on several people's computers.

I believe this to be an upstream issue since this bug is visible on Ubuntu as well, but the glibc website says I should file bugs against distributions first. I also believe the issue to be a regression since older Fedora and RHEL releases are unaffected.  The problem appears to affect both 32 and 64bits.
Description of problem:

Some notes:

 - compiling the testcase with -static has the side-effect of causing the
   bug to go away

 - compiling the testcase with -DFORK_DIRECTLY also appears to solve the
   problem

 - replacing the execv() with a direct exit(0) doesn't solve the problem
   but causes the frequency to change

The fact that both static linking and making the fork() syscall directly cause the problem to disappear leads me to believe that this is a libc bug rather than a kernel bug (which is the only other possibility). I'm not 100% sure of that, though, since libc actually uses the clone() syscall to implement fork(), so there could be a different inside the kernel because of that.

--- Additional comment from scarr6 on 2011-12-01 21:25:13 EST ---

We are hitting this bug regularly in RHEL5 (glibc 2.5).  I have confirmed that the attached test case hangs on multiple machines.

I have installed the latest glibc RPM and it continues to fail.

Is there any chance of having the fix integrated into RHEL5?

Comment 2 RHEL Program Management 2012-04-02 13:09:43 UTC
This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux release.  Product Management has
requested further review of this request by Red Hat Engineering, for
potential inclusion in a Red Hat Enterprise Linux release for currently
deployed products.  This request is not yet committed for inclusion in
a release.

Comment 5 Jeff Law 2012-06-12 17:07:08 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Same as RHEL 6.2 BZ 738665.

Comment 7 errata-xmlrpc 2013-01-08 03:45:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0022.html