552960 – Possible deadlock in pthread_mutex_lock/pthread_cond_wait

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 552960 - Possible deadlock in pthread_mutex_lock/pthread_cond_wait

Summary: Possible deadlock in pthread_mutex_lock/pthread_cond_wait

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	glibc
Sub Component:
Version:	6.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Siddhesh Poyarekar
QA Contact:	Arjun Shankar
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	854725 (view as bug list)
Depends On:
Blocks:	782183
TreeView+	depends on / blocked

Reported:	2010-01-06 16:43 UTC by Steve Holland
Modified:	2020-06-11 12:33 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Prior to this update, there were multiple synchronization bugs in pthread_cond_wait and pthread_cond_timedwait on x86_64 and i686, such that when a multithreaded program uses a priority-inherited mutex to synchronize access to a condition variable, some threads may deadlock when woken using pthread_cond_signal or when cancelled. This update fixes all such known problems related to condition variable synchronization.
Clone Of:
Environment:
Last Closed:	2013-11-21 10:38:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Deadlock test case. Note the suggested compile parameters (4.45 KB, text/x-csrc) 2010-01-19 03:44 UTC, Steve Holland	no flags	Details
simplified reproducer (1.88 KB, text/plain) 2011-08-16 14:10 UTC, Frantisek Hrbata	no flags	Details
makefile for simplified reproducer (844 bytes, text/plain) 2011-08-16 14:11 UTC, Frantisek Hrbata	no flags	Details
proposed patch (3.57 KB, text/plain) 2011-08-16 14:12 UTC, Frantisek Hrbata	no flags	Details
proposed patch V2 with cond_lock (4.05 KB, text/plain) 2011-08-16 17:34 UTC, Frantisek Hrbata	no flags	Details
Consolidated patch backported from upstream (34.06 KB, patch) 2012-10-17 13:04 UTC, Siddhesh Poyarekar	no flags	Details \| Diff
output of custom ps command during deadlock (8.66 KB, application/octet-stream) 2012-10-17 13:08 UTC, IBM Bug Proxy	no flags	Details
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,cmd' taken while the threads were "frozen" (43.02 KB, text/plain) 2012-10-17 13:08 UTC, IBM Bug Proxy	no flags	Details
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,ucomm' taken while the threads were "frozen" (25.76 KB, text/plain) 2012-10-17 13:08 UTC, IBM Bug Proxy	no flags	Details
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,cmd' taken while the threads were "frozen" (41.40 KB, application/octet-stream) 2012-10-17 13:08 UTC, IBM Bug Proxy	no flags	Details
ftrace events trace captured during the test (3.38 MB, application/x-bzip2) 2012-10-17 13:09 UTC, IBM Bug Proxy	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
IBM Linux Technology Center	84844	0	None	None	None	2019-04-27 20:57:39 UTC
Red Hat Product Errata	RHSA-2013:1605	0	normal	SHIPPED_LIVE	Moderate: glibc security, bug fix, and enhancement update	2013-11-20 21:54:09 UTC

Description Steve Holland 2010-01-06 16:43:12 UTC

I have a possible deadlock condition in the pthreads library. It is a very rare and random occurrance in very simple code that dispatches jobs from a main thread to a pool of worker threads. 

This code uses pthread_mutex_lock(), pthread_cond_wait(), and pthread_cond_signal() to control access to the list of jobs. The mutex is never held for more than a few lines of code and there are no code paths that could allow the mutex to be left held. pthread_cond_signal() and pthread_cond_wait() are always called with the mutex locked. 

The symptom is that a random thread gets stuck in pthread_mutex_lock(). In the deadlocked state, the mutex __lock field contains the thread ID of one of the other threads that is sitting in pthread_cond_wait() OR'd with 0x80000000. Count I believe is 1, owner is 0. nusers is 9. All of the other threads are 
waiting in pthread_cond_wait(). 

Attempting at the same point to call pthread_mutex_lock() from another thread locks the other thread as well.

The problem is observed on Fedora 12 on a 4-core Core I7 with hyperthreading, so 8 cpus. 

The threads have PTHREAD_PRIO_INHERIT set. Setting PTHREAD_MUTEX_ERRORCHECK does not prevent the problem (and none of the pthread_mutex()) calls return an error, but setting ERRORCHECK does seem to make the problem occur less frequently. 

In this situation there isn't a whole lot of work for the worker threads to do. I suspect that there might be a race condition in the response of multiple threads in pthread_cond_wait() that leads to a problem in the mutex. 

glibc version: glibc-2.11-2.x86_64

Comment 1 Andreas Schwab 2010-01-11 13:30:44 UTC

Please provide a complete test case.

Comment 2 Steve Holland 2010-01-11 16:30:12 UTC

Working on it... As this is a heisenbug, a simple test case may not be possible. The hardware on which this failed isn't mine... one of my research partners.

Do you know if a nonzero __lock field with a zero owner is a legitimate mutex state? 

That looked very suspicious to me. 


I'm pasting the relevant code below, which I will be using to assemble the test case. 



Code snippets
-------------

Mutex Creation: 
	pthread_mutexattr_init(&md->MutexAttr);
#ifdef WFMMATH_DEBUG // optional... this seems to make the problem happen less frequently 
	pthread_mutexattr_settype(&md->MutexAttr,PTHREAD_MUTEX_ERRORCHECK);
#endif	
	pthread_mutexattr_setprotocol(&md->MutexAttr,PTHREAD_PRIO_INHERIT);
	pthread_mutex_init(&md->WorkNotifyMutex,&md->MutexAttr);
	pthread_cond_init(&md->WorkNotify,NULL);

Thread creation:

	pthread_attr_init(&tattr);
	pthread_attr_setscope(&tattr,PTHREAD_SCOPE_SYSTEM);
	pthread_attr_setinheritsched(&tattr,PTHREAD_EXPLICIT_SCHED);
	pthread_attr_setschedpolicy(&tattr,SCHED_OTHER);

	/* SCHED_OTHER requires a sched_priority of 0 */
	memset(&schedparam,0,sizeof(schedparam));
        schedparam.sched_priority=0;
        pthread_attr_setschedparam(&tattr,&schedparam);


	for (Cnt=0;Cnt < md->actual_threads;Cnt++) {
		err=pthread_create(&thr->Thread,&tattr,&CalcThreadCode,thr);
        }


Queuing work: 
	pthread_mutex_lock(&md->WorkNotifyMutex);
	dgl_AddTail((struct dgl_List *)&md->PendingComputation,(struct dgl_Node *)calcfcn); // This is just a simple linked-list add. 
	pthread_cond_signal(&md->WorkNotify);
	pthread_mutex_unlock(&md->WorkNotifyMutex);


Worker thread loop: 
	pthread_mutex_lock(&md->WorkNotifyMutex);
	for (;;) {
		
		todo=(struct MathFcn *)dgl_RemHead((struct dgl_List *)&md->PendingComputation); // this is a simple linked-list remove
		if (todo) {
			pthread_mutex_unlock(&md->WorkNotifyMutex);
			todo->CalcFcn(md,todo); // Actual work done here.

			pthread_mutex_lock(&md->WorkNotifyMutex); 
			dgl_AddHead((struct dgl_List *)&md->CompletedComputation,(struct dgl_Node *)todo); // simple linked list add
			pthread_mutex_unlock(&md->WorkNotifyMutex);

			write(md->parentnotifypipe[1]," ",1); //# Notify parent process of completion
			pthread_mutex_lock(&md->WorkNotifyMutex); //# Must be locked before returning to main loop

		}
		else {
			/* Wait for something to do */
			pthread_cond_wait(&md->WorkNotify,&md->WorkNotifyMutex);
		}
	}

Comment 3 Steve Holland 2010-01-11 16:35:28 UTC

Oops. Add this code for dequeuing the completed computation:


	pthread_mutex_lock(&md->WorkNotifyMutex);
	fcn=(struct MathFcn *)dgl_RemTail((struct dgl_List *)&md->CompletedComputation); // Simple linked-list remove
	pthread_mutex_unlock(&md->WorkNotifyMutex);


Other than these few snippets NOTHING touches this particular mutex.

Comment 4 Andreas Schwab 2010-01-11 16:45:10 UTC

Please provide a _complete_ test case.

Comment 5 Steve Holland 2010-01-19 03:44:59 UTC

Created attachment 385307 [details]
Deadlock test case. Note the suggested compile parameters

Here is a test case. On a Core I7 motherboard (64 bit OS), three trials lead to: 

1st trial: the program stop at 500 its
2nd trial: stop at 1000 its
3rd trial: stop at 1200 its.

I am unable to reproduce (i.e. runs forever) on my core duo laptop (32 bit) or a dual quad core Opteron (64 bit).

Comment 6 Andreas Schwab 2010-02-15 13:31:34 UTC

Cannot reproduce.

Comment 7 Steve Holland 2010-02-15 17:47:37 UTC

It seems quite repeatable on this end. 

Ending up in the deadlock state might be very dependent on timing details of context switches, CPU core assignments, etc. Could be motherboard dependent.

Would a coredump of the testcase from kill -SEGV be helpful?

Comment 8 Steve Holland 2010-08-16 15:14:01 UTC

New information on this bug: 
  * The problem seems to be related to priority inheritance. Removing pthread_mutexattr_setprotocol(&md->MutexAttr,PTHREAD_PRIO_INHERIT) seems to work around the problem.
  * It has been observed on at least two different ASUS P6TD Deluxe motherboards with Core I7 920 CPUs. 
  * The problem has also been observed (using the previously attached test-case) under Red Hat Enterprise 6 public beta 2.

Comment 9 Bug Zapper 2010-11-04 01:42:25 UTC

This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 10 Steve Holland 2010-11-04 04:14:15 UTC

As noted above
 * The problem has also been observed (using the previously attached
test-case) under Red Hat Enterprise 6 public beta 2.

So I'm changing product to "Red Hat Enterprise 6"

Comment 12 Andreas Schwab 2010-11-19 16:46:06 UTC

Looks like a futex bug.

Comment 13 Steve Holland 2010-12-09 16:15:48 UTC

Confirmed on RHEL6 release version on a dual quad-core Opteron (64 bit), 

using the testcase above (attachment 385307 [details]).

Comment 14 Konrad Karl 2011-01-11 17:14:08 UTC

confirmed on Fedora 13 on x86_64 on a Intel(R) Core(TM)2 CPU 
Kontron Board ICH8 chipset.

(Linux version 2.6.34.7-63.fc13.x86_64,glibc-2.12.2-1.x86_64)

when running "taskset -c 0 ./deadlockbug" it hangs immediately (iter=0)
(without PTHREAD_PRIO_INHERIT on the mutex it does not hang)

using both cpu's I see numbers from 200..500 or similar.

this could explain some random hangs I observed in one of my programs
(had attributed them to an oversight of mine but now...)

Konrad

Comment 15 RHEL Program Management 2011-04-04 01:44:31 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 16 Frantisek Hrbata 2011-08-16 14:10:41 UTC

Created attachment 518495 [details]
simplified reproducer

Comment 17 Frantisek Hrbata 2011-08-16 14:11:26 UTC

Created attachment 518496 [details]
makefile for simplified reproducer

Comment 18 Frantisek Hrbata 2011-08-16 14:12:46 UTC

Created attachment 518497 [details]
proposed patch

Comment 19 Frantisek Hrbata 2011-08-16 14:23:28 UTC

This is bug in glibc, not kernel. I attached patch with proposed solution. It seems to me that requeue_pi in libc could not never work, but I could be wrong of course. I do not know what has changed during the time. Anyway I think that the requeue_pi code deserves a review.

Here follows description from the attached patch:

---------------------------------------8<--------------------------------------
Current implementation of wait_requeue_pi in libc does not handle situation when
kernel returns -EAGAIN. Kernel tests if the actual futex(cond_futex) value is as
expected(handled as 3rd($rdx) parameter to futex syscall). If it's not, it means
some other thread increased the cond_futex value and we need to call
wait_requeue_pi again. Not handling this situation means:

1) incorrect locking
Even thread not owning mutex is waken. In the current implementation if
wait_requeue_pi fails, the code path continues with non pi futex_wait, which is
obviously wrong. This leads to the situation where pthread_cond_wait returns,
but the mutex is held by different thread, so the mutex protection does not work
at all.

2) deadlock
This is a consequence of 1). If a thread is woken and it should not be, because
it does not hold the mutex, the woken_seq is increased anyway. This lead to
a deadlock, because if correct thread is waken, which actually owns the mutex,
it fails on woken_seq < wakeup_seq test. This means restart of pthread_cond_wait,
but the thread already owns the mutex.

pthread_cond_signal pthread_cond_wait

wait_requeue_pi is OK(this thread owns mutex)
woken_seq >= wakeup_seq
wait_requeue_pi again
wait on cond_futex
lock mutex
signal cond_futex
unlock mutex

pthread_cond_signal is waiting on mutex which holds pthread_cond_wait thread and
the pthread_cond_wait thread owning the mutex waits on cond_futex which is
signaled from pthread_cond_signal.
---------------------------------------8<--------------------------------------

Comment 20 Frantisek Hrbata 2011-08-16 15:49:32 UTC

Oops, I'm looking at the patch again and it will need to hold the cond_lock while increasing cond_futex and setting %edx. Now there is a race. I'll wait to see what do you think about it and I can fix that if desired. But I'm expecting that some libc guru will come with something better.

Comment 21 Frantisek Hrbata 2011-08-16 17:34:15 UTC

Created attachment 518541 [details]
proposed patch V2 with cond_lock

second version of the proposed patch with cond_lock

Comment 22 Frantisek Hrbata 2011-08-17 10:26:48 UTC

Just one more quick note. The patch is for x86_64 only. I haven't checked other archs.

Comment 30 Jeff Law 2011-12-22 16:00:13 UTC

Patch is causing problems in Fedora & Debian.  Disabled while issues are resolved.

Comment 33 Jeff Law 2012-01-09 20:40:18 UTC

For future reference, to reproduce the problems, install glibc with the 552960 patch on F16.  Then :

AUDIODRIVER=pulseaudio play -n -c1  synth whitenoise band -n 100 20 \
        band -n 50 20 gain +25 fade h 1 864000 1

Fails maybe 1 in 20 times, typically within the first 2-3 seconds.  Unfortunately all this code interacts poorly with gdb, so it's been quite difficult to determine what's going on on the user side.  Kernel side, printks are my best friend.

Comment 34 Jeff Law 2012-01-13 04:46:42 UTC

It looks like the EAGAIN path in the upstream fix is failing to bump total_seq.  With that fixed, the simplified test referenced in c#14 and c#15 runs millions of times.  And a test utilizing "play" from the sox package runs forever as well.

I'm going to need to sit down a look more closely at how the total_seq counter is used, but we may have this nailed down.

Comment 36 Jeff Law 2012-02-29 18:05:10 UTC

Unfortunately after making my patch to fix the total_seq counter available for wider testing, additional issues have been reported.

At this point I do not believe we can safely address this bug without a fairly high chance of introducing new regressions which I consider unacceptable for RHEL 6.3.  Thus I'm going to regretfully have to change this to a dev_nak and queue it for RHEL 6.4.

Comment 37 RHEL Program Management 2012-07-10 07:23:43 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 38 RHEL Program Management 2012-07-10 23:17:24 UTC

This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.

Comment 40 Siddhesh Poyarekar 2012-09-10 14:12:48 UTC

I'm trying to look at why the repeated call for the (In reply to comment #19)
> Current implementation of wait_requeue_pi in libc does not handle situation
> when
> kernel returns -EAGAIN. Kernel tests if the actual futex(cond_futex) value
> is as
> expected(handled as 3rd($rdx) parameter to futex syscall). If it's not, it
> means
> some other thread increased the cond_futex value and we need to call
> wait_requeue_pi again. Not handling this situation means:

The trouble with this justification is that EAGAIN means the above only in case of FUTEX_CMP_REQUEUE, at least according to the man page for futex.

Anyway, I picked this up from the point of Andreas' patch, which is the following:

http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=c5a0802a682dba23f92d47f0f99775aebfbe2539

and used the reproducer in the upstream bug report to try and get the reason for the EAGAIN:

http://sourceware.org/bugzilla/show_bug.cgi?id=14417

by using systemtap to figure out where the EAGAIN is coming from and I have narrowed it down to futex_wait_setup so far. The probe to see this is:

probe kernel.function("futex_wait_setup").return {
        if (execname() == "ld-linux-x86-64") {
                printf ("futex_wait_setup returned %ld\n", $return);
                print_backtrace ();
        }
}

where the next-to-last output is seen as:

futex_wait_setup returned -11
Returning from:  0xffffffff810aed90 : futex_wait_setup+0x0/0xf0 [kernel]
Returning to  :  0xffffffff810af751 : futex_wait_requeue_pi+0x161/0x410 [kernel]
 0xffffffff810b0970 : do_futex+0x2f0/0xa50 [kernel]
 0xffffffff810b11da : sys_futex+0x10a/0x1a0 [kernel]
 0xffffffff81604729 : system_call_fastpath+0x16/0x1b [kernel]

the last being the one that is hung.

This is intriguing because I don't think futex_wait_setup is supposed to return an EAGAIN at all - it is documented in the code as only being able to return EWOULDBLOCK and EFAULT. I'm still trying to figure out what this means since there must be some case I may have missed.

Comment 41 Siddhesh Poyarekar 2012-09-11 11:42:41 UTC

I feel stupid - EWOULDBLOCK is EAGAIN, so that is how we get the EAGAIN.

Comment 43 Siddhesh Poyarekar 2012-10-17 12:56:54 UTC

*** Bug 854725 has been marked as a duplicate of this bug. ***

Comment 44 Siddhesh Poyarekar 2012-10-17 13:04:42 UTC

Created attachment 628788 [details]
Consolidated patch backported from upstream

Comment 45 IBM Bug Proxy 2012-10-17 13:08:29 UTC

Created attachment 628789 [details]
output of custom ps command during deadlock

Comment 46 IBM Bug Proxy 2012-10-17 13:08:38 UTC

Created attachment 628790 [details]
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,cmd' taken while the threads were "frozen"

Comment 47 IBM Bug Proxy 2012-10-17 13:08:46 UTC

Created attachment 628791 [details]
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,ucomm' taken while the threads were "frozen"

Comment 48 IBM Bug Proxy 2012-10-17 13:08:53 UTC

Created attachment 628792 [details]
'ps -ALo pid,tid,pri,rtprio,stat,status,wchan:30,cmd' taken while the threads were "frozen"

Comment 49 IBM Bug Proxy 2012-10-17 13:09:04 UTC

Created attachment 628796 [details]
ftrace events trace captured during the test

Comment 50 IBM Bug Proxy 2012-10-24 17:23:10 UTC

------- Comment From mkravetz.com 2012-10-24 17:14 EDT-------
Will this patch apply to the latest published version of glibc for RHEL 6?  I believe the version is:

glibc-2.12-1.80.el6_3.5.x86_64

Wanted to ask before I attempt to build a test RPM.

Comment 51 IBM Bug Proxy 2012-10-25 00:42:44 UTC

------- Comment From mkravetz.com 2012-10-25 00:37 EDT-------
Hmmmm?

The simplified reproducer hangs for me (always at a different count).  Perhaps I built the test RPMs incorrectly..

Does anyone from Red Hat have test RPMS available (i686 and x86_64)?

Comment 52 Siddhesh Poyarekar 2012-10-25 01:35:13 UTC

I have resubmitted a test build, so I should be able to get you test packages soon. The fix is scheduled for inclusion in rhel-6.5, so you won't see the fix in any of the published builds.

Comment 54 Siddhesh Poyarekar 2012-10-25 03:19:25 UTC

I have uploaded the test packages here:

http://people.redhat.com/spoyarek/bz552960/

Comment 55 IBM Bug Proxy 2012-10-26 00:33:08 UTC

------- Comment From mkravetz.com 2012-10-26 00:23 EDT-------
Thank you for the test images.  They appear to work well.

The IBM Java group should now perform some validation with these images in their test environment.

Comment 56 IBM Bug Proxy 2012-11-05 22:42:40 UTC

------- Comment From tpnoonan.com 2012-11-05 22:39 EDT-------
hi red hat, ibm's WebSphere RealTime will not work on MRG 2.x  due to this defect, can the fix for this defect be considered for rhel6.4 instead of rhel6.5? thanks

Comment 58 Siddhesh Poyarekar 2012-11-09 14:12:31 UTC

Please get in touch with your support contacts if you need the fix expedited.

Comment 59 Joseph Kachuck 2012-11-16 14:57:06 UTC

Hello,
Per Comment 56 I am requesting this for exception for RHEL 6.4.

Thank You
Joe Kachuck

Comment 60 IBM Bug Proxy 2012-11-16 17:08:23 UTC

------- Comment From tpnoonan.com 2012-11-16 16:41 EDT-------
(In reply to comment #46)
> hi red hat, ibm's WebSphere RealTime will not work on MRG 2.x  due to this
> defect, can the fix for this defect be considered for rhel6.4 instead of
> rhel6.5? thanks

Our product has only been certified on the MRG 1.3 release which is
no longer supported by Red Hat."

Comment 64 Jeff Law 2012-11-23 15:41:30 UTC

This is not suitable for RHEL 6.4; it needs considerably more upstream and Fedora testing.  Getting this wrong has serious consequences for our customer base.

The upstream exposure is still relatively small at the moment as it's limited to upstream developer builds and Fedora rawhide.  We have one report which might be related to installing Siddhesh's patches into rawhide (we're still waiting a core file from the reporter for analysis).

This is really a 6.5 issue.

Comment 66 IBM Bug Proxy 2012-11-23 19:43:38 UTC

------- Comment From mstoodle.com 2012-11-23 19:32 EDT-------
Ok, let me make sure I've got this right (I'm going to take some of your points out of order after starting off with my own point):
1) it's broken right now (hence this bug)
2) "The upstream exposure is still relatively small at the moment"

but from there, you went to :
3) "Getting this wrong has serious consequences for our customer base"
so
4) "This is not suitable for RHEL 6.4"
and capping off with:
5) "This is really a 6.5 issue"

I'm having trouble understanding the 3->4->5 sequence given 1 and 2, but maybe that's because I'm not appreciating the scope of the patch.

Does the patch affect non-MRG customers as well whereas this bug only concerns MRG customers?  Is that the issue?

How far away from RHEL 6.5 release are we? For the meantime...if we officially request a patch on RHEL 6.3 for this bug, will you officially support any customer using MRG 2.x on RHEL 6.3 with that patch (don't worry, we would send them to you directly to get the patch :) ).

Comment 67 Jeff Law 2012-11-27 19:34:34 UTC

The problem is there have been several attempts to fix this problem over the last few years, each of which has caused regressions of one form or another.  In every case I would consider the regressions caused by attempts to fix this bug to actually be worse than the original bug that's being fixed here (in terms of scope of the problem, number of users affected, etc).

Those regressions were not caught by the existing test suites, but by wide scale deployments of the patches by way of Debian, Fedora & Ubuntu.

Given the history of "fixes" for this problem causing regressions, the lack of widespread testing of the current proposed fix, the fact that this potentially affects every program using pthread condition variables and the fact that we're very late in the RHEL 6.4 release cycle, I can't in good conscience propose this fix be included in RHEL 6.4.

RHEL 6.4 hasn't even been released yet, so it's probably safe to assume it'll be several months before RHEL 6.5 would be released.

The support implications for MRG 2.x are something you'd need to discuss with your support contacts.

Comment 71 Steve Holland 2013-08-29 13:06:47 UTC

Is there any way to detect the fix for this bug from userland, so that we can tell whether it's safe to use priority-inherited mutexes?

Comment 72 Siddhesh Poyarekar 2013-08-30 02:22:22 UTC

Comment 44 has test cases that can be adapted to test independently.  Just replace 'do_test' with 'main' and compile as you normally would.

Comment 73 errata-xmlrpc 2013-11-21 10:38:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1605.html

Comment 74 IBM Bug Proxy 2013-11-22 14:52:21 UTC

------- Comment From mstoodle.com 2013-11-22 14:48 EDT-------
Is there a minimum MRG 2 level needed to use RHEL 6.5?

Comment 75 Beth Uptagrafft 2013-11-22 16:32:25 UTC

The most recent Realtime release is the supported version and we use the most recent RHEL available at the time for our testing. RHEL6.5 was used for testing our MRG-2.4 release, which features the 3.8 kernel.

Note You need to log in before you can comment on or make changes to this bug.

ashankar
bhu
bugproxy
codonell
dbasant
fhrbata
fweimer
jakub
jburke
jkachuck
jwest
kk_konrad
law
mfranc
mnewsome
mvpel
pmuller
pparsons
scottt.tw
self
spoyarek
steve.mcgovern
tobias
williams