Bug 552960
| Summary: | Possible deadlock in pthread_mutex_lock/pthread_cond_wait | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Steve Holland <sh1> |
| Component: | glibc | Assignee: | Siddhesh Poyarekar <spoyarek> |
| Status: | CLOSED ERRATA | QA Contact: | Arjun Shankar <ashankar> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.0 | CC: | ashankar, bhu, bugproxy, codonell, dbasant, fhrbata, fweimer, jakub, jburke, jkachuck, jwest, kk_konrad, law, mfranc, mnewsome, mvpel, pmuller, pparsons, scottt.tw, self, spoyarek, steve.mcgovern, tobias, williams |
| Target Milestone: | rc | Keywords: | Patch, Reopened |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Prior to this update, there were multiple synchronization bugs in pthread_cond_wait and pthread_cond_timedwait on x86_64 and i686, such that when a multithreaded program uses a priority-inherited mutex to synchronize access to a condition variable, some threads may deadlock when woken using pthread_cond_signal or when cancelled. This update fixes all such known problems related to condition variable synchronization.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-11-21 10:38:17 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 782183 | ||
| Attachments: | |||
|
Description
Steve Holland
2010-01-06 16:43:12 UTC
Please provide a complete test case. Working on it... As this is a heisenbug, a simple test case may not be possible. The hardware on which this failed isn't mine... one of my research partners.
Do you know if a nonzero __lock field with a zero owner is a legitimate mutex state?
That looked very suspicious to me.
I'm pasting the relevant code below, which I will be using to assemble the test case.
Code snippets
-------------
Mutex Creation:
pthread_mutexattr_init(&md->MutexAttr);
#ifdef WFMMATH_DEBUG // optional... this seems to make the problem happen less frequently
pthread_mutexattr_settype(&md->MutexAttr,PTHREAD_MUTEX_ERRORCHECK);
#endif
pthread_mutexattr_setprotocol(&md->MutexAttr,PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&md->WorkNotifyMutex,&md->MutexAttr);
pthread_cond_init(&md->WorkNotify,NULL);
Thread creation:
pthread_attr_init(&tattr);
pthread_attr_setscope(&tattr,PTHREAD_SCOPE_SYSTEM);
pthread_attr_setinheritsched(&tattr,PTHREAD_EXPLICIT_SCHED);
pthread_attr_setschedpolicy(&tattr,SCHED_OTHER);
/* SCHED_OTHER requires a sched_priority of 0 */
memset(&schedparam,0,sizeof(schedparam));
schedparam.sched_priority=0;
pthread_attr_setschedparam(&tattr,&schedparam);
for (Cnt=0;Cnt < md->actual_threads;Cnt++) {
err=pthread_create(&thr->Thread,&tattr,&CalcThreadCode,thr);
}
Queuing work:
pthread_mutex_lock(&md->WorkNotifyMutex);
dgl_AddTail((struct dgl_List *)&md->PendingComputation,(struct dgl_Node *)calcfcn); // This is just a simple linked-list add.
pthread_cond_signal(&md->WorkNotify);
pthread_mutex_unlock(&md->WorkNotifyMutex);
Worker thread loop:
pthread_mutex_lock(&md->WorkNotifyMutex);
for (;;) {
todo=(struct MathFcn *)dgl_RemHead((struct dgl_List *)&md->PendingComputation); // this is a simple linked-list remove
if (todo) {
pthread_mutex_unlock(&md->WorkNotifyMutex);
todo->CalcFcn(md,todo); // Actual work done here.
pthread_mutex_lock(&md->WorkNotifyMutex);
dgl_AddHead((struct dgl_List *)&md->CompletedComputation,(struct dgl_Node *)todo); // simple linked list add
pthread_mutex_unlock(&md->WorkNotifyMutex);
write(md->parentnotifypipe[1]," ",1); //# Notify parent process of completion
pthread_mutex_lock(&md->WorkNotifyMutex); //# Must be locked before returning to main loop
}
else {
/* Wait for something to do */
pthread_cond_wait(&md->WorkNotify,&md->WorkNotifyMutex);
}
}
Oops. Add this code for dequeuing the completed computation: pthread_mutex_lock(&md->WorkNotifyMutex); fcn=(struct MathFcn *)dgl_RemTail((struct dgl_List *)&md->CompletedComputation); // Simple linked-list remove pthread_mutex_unlock(&md->WorkNotifyMutex); Other than these few snippets NOTHING touches this particular mutex. Please provide a _complete_ test case. Created attachment 385307 [details]
Deadlock test case. Note the suggested compile parameters
Here is a test case. On a Core I7 motherboard (64 bit OS), three trials lead to:
1st trial: the program stop at 500 its
2nd trial: stop at 1000 its
3rd trial: stop at 1200 its.
I am unable to reproduce (i.e. runs forever) on my core duo laptop (32 bit) or a dual quad core Opteron (64 bit).
Cannot reproduce. It seems quite repeatable on this end. Ending up in the deadlock state might be very dependent on timing details of context switches, CPU core assignments, etc. Could be motherboard dependent. Would a coredump of the testcase from kill -SEGV be helpful? New information on this bug: * The problem seems to be related to priority inheritance. Removing pthread_mutexattr_setprotocol(&md->MutexAttr,PTHREAD_PRIO_INHERIT) seems to work around the problem. * It has been observed on at least two different ASUS P6TD Deluxe motherboards with Core I7 920 CPUs. * The problem has also been observed (using the previously attached test-case) under Red Hat Enterprise 6 public beta 2. This message is a reminder that Fedora 12 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 12. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '12'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 12's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 12 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping As noted above * The problem has also been observed (using the previously attached test-case) under Red Hat Enterprise 6 public beta 2. So I'm changing product to "Red Hat Enterprise 6" Looks like a futex bug. Confirmed on RHEL6 release version on a dual quad-core Opteron (64 bit),
using the testcase above (attachment 385307 [details]).
confirmed on Fedora 13 on x86_64 on a Intel(R) Core(TM)2 CPU Kontron Board ICH8 chipset. (Linux version 2.6.34.7-63.fc13.x86_64,glibc-2.12.2-1.x86_64) when running "taskset -c 0 ./deadlockbug" it hangs immediately (iter=0) (without PTHREAD_PRIO_INHERIT on the mutex it does not hang) using both cpu's I see numbers from 200..500 or similar. this could explain some random hangs I observed in one of my programs (had attributed them to an oversight of mine but now...) Konrad Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. Created attachment 518495 [details]
simplified reproducer
Created attachment 518496 [details]
makefile for simplified reproducer
Created attachment 518497 [details]
proposed patch
This is bug in glibc, not kernel. I attached patch with proposed solution. It seems to me that requeue_pi in libc could not never work, but I could be wrong of course. I do not know what has changed during the time. Anyway I think that the requeue_pi code deserves a review.
Here follows description from the attached patch:
---------------------------------------8<--------------------------------------
Current implementation of wait_requeue_pi in libc does not handle situation when
kernel returns -EAGAIN. Kernel tests if the actual futex(cond_futex) value is as
expected(handled as 3rd($rdx) parameter to futex syscall). If it's not, it means
some other thread increased the cond_futex value and we need to call
wait_requeue_pi again. Not handling this situation means:
1) incorrect locking
Even thread not owning mutex is waken. In the current implementation if
wait_requeue_pi fails, the code path continues with non pi futex_wait, which is
obviously wrong. This leads to the situation where pthread_cond_wait returns,
but the mutex is held by different thread, so the mutex protection does not work
at all.
2) deadlock
This is a consequence of 1). If a thread is woken and it should not be, because
it does not hold the mutex, the woken_seq is increased anyway. This lead to
a deadlock, because if correct thread is waken, which actually owns the mutex,
it fails on woken_seq < wakeup_seq test. This means restart of pthread_cond_wait,
but the thread already owns the mutex.
pthread_cond_signal pthread_cond_wait
wait_requeue_pi is OK(this thread owns mutex)
woken_seq >= wakeup_seq
wait_requeue_pi again
wait on cond_futex
lock mutex
signal cond_futex
unlock mutex
pthread_cond_signal is waiting on mutex which holds pthread_cond_wait thread and
the pthread_cond_wait thread owning the mutex waits on cond_futex which is
signaled from pthread_cond_signal.
---------------------------------------8<--------------------------------------
Oops, I'm looking at the patch again and it will need to hold the cond_lock while increasing cond_futex and setting %edx. Now there is a race. I'll wait to see what do you think about it and I can fix that if desired. But I'm expecting that some libc guru will come with something better. Created attachment 518541 [details]
proposed patch V2 with cond_lock
second version of the proposed patch with cond_lock
Just one more quick note. The patch is for x86_64 only. I haven't checked other archs. Patch is causing problems in Fedora & Debian. Disabled while issues are resolved. For future reference, to reproduce the problems, install glibc with the 552960 patch on F16. Then :
AUDIODRIVER=pulseaudio play -n -c1 synth whitenoise band -n 100 20 \
band -n 50 20 gain +25 fade h 1 864000 1
Fails maybe 1 in 20 times, typically within the first 2-3 seconds. Unfortunately all this code interacts poorly with gdb, so it's been quite difficult to determine what's going on on the user side. Kernel side, printks are my best friend.
It looks like the EAGAIN path in the upstream fix is failing to bump total_seq. With that fixed, the simplified test referenced in c#14 and c#15 runs millions of times. And a test utilizing "play" from the sox package runs forever as well. I'm going to need to sit down a look more closely at how the total_seq counter is used, but we may have this nailed down. Unfortunately after making my patch to fix the total_seq counter available for wider testing, additional issues have been reported. At this point I do not believe we can safely address this bug without a fairly high chance of introducing new regressions which I consider unacceptable for RHEL 6.3. Thus I'm going to regretfully have to change this to a dev_nak and queue it for RHEL 6.4. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development. This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4. I'm trying to look at why the repeated call for the (In reply to comment #19) > Current implementation of wait_requeue_pi in libc does not handle situation > when > kernel returns -EAGAIN. Kernel tests if the actual futex(cond_futex) value > is as > expected(handled as 3rd($rdx) parameter to futex syscall). If it's not, it > means > some other thread increased the cond_futex value and we need to call > wait_requeue_pi again. Not handling this situation means: The trouble with this justification is that EAGAIN means the above only in case of FUTEX_CMP_REQUEUE, at least according to the man page for futex. Anyway, I picked this up from the point of Andreas' patch, which is the following: http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=c5a0802a682dba23f92d47f0f99775aebfbe2539 and used the reproducer in the upstream bug report to try and get the reason for the EAGAIN: http://sourceware.org/bugzilla/show_bug.cgi?id=14417 by using systemtap to figure out where the EAGAIN is coming from and I have narrowed it down to futex_wait_setup so far. The probe to see this is: probe kernel.function("futex_wait_setup").return { if (execname() == "ld-linux-x86-64") { printf ("futex_wait_setup returned %ld\n", $return); print_backtrace (); } } where the next-to-last output is seen as: futex_wait_setup returned -11 Returning from: 0xffffffff810aed90 : futex_wait_setup+0x0/0xf0 [kernel] Returning to : 0xffffffff810af751 : futex_wait_requeue_pi+0x161/0x410 [kernel] 0xffffffff810b0970 : do_futex+0x2f0/0xa50 [kernel] 0xffffffff810b11da : sys_futex+0x10a/0x1a0 [kernel] 0xffffffff81604729 : system_call_fastpath+0x16/0x1b [kernel] the last being the one that is hung. This is intriguing because I don't think futex_wait_setup is supposed to return an EAGAIN at all - it is documented in the code as only being able to return EWOULDBLOCK and EFAULT. I'm still trying to figure out what this means since there must be some case I may have missed. I feel stupid - EWOULDBLOCK is EAGAIN, so that is how we get the EAGAIN. *** Bug 854725 has been marked as a duplicate of this bug. *** Created attachment 628788 [details]
Consolidated patch backported from upstream
Created attachment 628789 [details]
output of custom ps command during deadlock
Created attachment 628790 [details]
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,cmd' taken while the threads were "frozen"
Created attachment 628791 [details]
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,ucomm' taken while the threads were "frozen"
Created attachment 628792 [details]
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,cmd' taken while the threads were "frozen"
Created attachment 628796 [details]
ftrace events trace captured during the test
------- Comment From mkravetz.com 2012-10-24 17:14 EDT------- Will this patch apply to the latest published version of glibc for RHEL 6? I believe the version is: glibc-2.12-1.80.el6_3.5.x86_64 Wanted to ask before I attempt to build a test RPM. ------- Comment From mkravetz.com 2012-10-25 00:37 EDT------- Hmmmm? The simplified reproducer hangs for me (always at a different count). Perhaps I built the test RPMs incorrectly.. Does anyone from Red Hat have test RPMS available (i686 and x86_64)? I have resubmitted a test build, so I should be able to get you test packages soon. The fix is scheduled for inclusion in rhel-6.5, so you won't see the fix in any of the published builds. I have uploaded the test packages here: http://people.redhat.com/spoyarek/bz552960/ ------- Comment From mkravetz.com 2012-10-26 00:23 EDT------- Thank you for the test images. They appear to work well. The IBM Java group should now perform some validation with these images in their test environment. ------- Comment From tpnoonan.com 2012-11-05 22:39 EDT------- hi red hat, ibm's WebSphere RealTime will not work on MRG 2.x due to this defect, can the fix for this defect be considered for rhel6.4 instead of rhel6.5? thanks Please get in touch with your support contacts if you need the fix expedited. Hello, Per Comment 56 I am requesting this for exception for RHEL 6.4. Thank You Joe Kachuck ------- Comment From tpnoonan.com 2012-11-16 16:41 EDT------- (In reply to comment #46) > hi red hat, ibm's WebSphere RealTime will not work on MRG 2.x due to this > defect, can the fix for this defect be considered for rhel6.4 instead of > rhel6.5? thanks Our product has only been certified on the MRG 1.3 release which is no longer supported by Red Hat." This is not suitable for RHEL 6.4; it needs considerably more upstream and Fedora testing. Getting this wrong has serious consequences for our customer base. The upstream exposure is still relatively small at the moment as it's limited to upstream developer builds and Fedora rawhide. We have one report which might be related to installing Siddhesh's patches into rawhide (we're still waiting a core file from the reporter for analysis). This is really a 6.5 issue. ------- Comment From mstoodle.com 2012-11-23 19:32 EDT------- Ok, let me make sure I've got this right (I'm going to take some of your points out of order after starting off with my own point): 1) it's broken right now (hence this bug) 2) "The upstream exposure is still relatively small at the moment" but from there, you went to : 3) "Getting this wrong has serious consequences for our customer base" so 4) "This is not suitable for RHEL 6.4" and capping off with: 5) "This is really a 6.5 issue" I'm having trouble understanding the 3->4->5 sequence given 1 and 2, but maybe that's because I'm not appreciating the scope of the patch. Does the patch affect non-MRG customers as well whereas this bug only concerns MRG customers? Is that the issue? How far away from RHEL 6.5 release are we? For the meantime...if we officially request a patch on RHEL 6.3 for this bug, will you officially support any customer using MRG 2.x on RHEL 6.3 with that patch (don't worry, we would send them to you directly to get the patch :) ). The problem is there have been several attempts to fix this problem over the last few years, each of which has caused regressions of one form or another. In every case I would consider the regressions caused by attempts to fix this bug to actually be worse than the original bug that's being fixed here (in terms of scope of the problem, number of users affected, etc). Those regressions were not caught by the existing test suites, but by wide scale deployments of the patches by way of Debian, Fedora & Ubuntu. Given the history of "fixes" for this problem causing regressions, the lack of widespread testing of the current proposed fix, the fact that this potentially affects every program using pthread condition variables and the fact that we're very late in the RHEL 6.4 release cycle, I can't in good conscience propose this fix be included in RHEL 6.4. RHEL 6.4 hasn't even been released yet, so it's probably safe to assume it'll be several months before RHEL 6.5 would be released. The support implications for MRG 2.x are something you'd need to discuss with your support contacts. Is there any way to detect the fix for this bug from userland, so that we can tell whether it's safe to use priority-inherited mutexes? Comment 44 has test cases that can be adapted to test independently. Just replace 'do_test' with 'main' and compile as you normally would. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1605.html ------- Comment From mstoodle.com 2013-11-22 14:48 EDT------- Is there a minimum MRG 2 level needed to use RHEL 6.5? The most recent Realtime release is the supported version and we use the most recent RHEL available at the time for our testing. RHEL6.5 was used for testing our MRG-2.4 release, which features the 3.8 kernel. |