Bug 548989

Summary: PI mutexes are broken (again)
Product: [Fedora] Fedora Reporter: Bruno Wolff III <bruno>
Component: glibcAssignee: Andreas Schwab <schwab>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: rawhideCC: anton, ascii79, awilliam, davidsen, dodji, dougsland, erik-fedora, gansalmon, geoff+fedora, itamar, jakub, kasal, kernel-maint, lkundrak, lpoetter, M8R-qx9aop, mschmidt, n12367, pascal, paul, phuang, redhat, schnell, schwab, scorporat, tomek, wtogami
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: glibc-2.11.90-10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-01-21 09:46:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 538274    
Attachments:
Description Flags
simple testcase
none
possible fix
none
Fix pthread_cond_wait with requeue-PI on i386 none

Description Bruno Wolff III 2009-12-19 20:58:05 UTC
Description of problem:
Recently (about a week ago) applications trying to use pulse audio started crashing. Pulseaudio has not been upgraded since before the problem started, so something else triggered it. As sample error message from gmplayer is:

Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.

Version-Release number of selected component (if applicable):
pulseaudio-0.9.21-3.fc13.i686

How reproducible:
I am seeing this with gmplayer, xmms with both the alsa and pulse plugins, and the games tremulous and Battle for Wesnoth. One odd thing is that if I started Wesnoth with the sound sliders set to no sound and then changed them to get sound things worked. Restarting the game once the sliders were set so that music was set to be played failed with essentially the same error message as for gmplayer.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Lennart Poettering 2009-12-20 12:11:25 UTC
I am pretty sure this is again the fault of the PI mutex code in the kernel. Will verify.

Comment 2 Lennart Poettering 2009-12-20 13:31:23 UTC
Yes, verified. Either the kernel or glibc are at fault here. Reassigning.

(RT folks, why do you break this every second month? Broken PI mutex unlocking is a recurring story...)

Comment 3 Lennart Poettering 2009-12-20 13:41:15 UTC
*** Bug 538638 has been marked as a duplicate of this bug. ***

Comment 4 Lennart Poettering 2009-12-20 13:44:27 UTC
*** Bug 548259 has been marked as a duplicate of this bug. ***

Comment 5 Lennart Poettering 2009-12-20 14:57:56 UTC
*** Bug 541420 has been marked as a duplicate of this bug. ***

Comment 6 Lennart Poettering 2009-12-20 14:59:07 UTC
*** Bug 541359 has been marked as a duplicate of this bug. ***

Comment 7 Owen Taylor 2009-12-20 15:14:32 UTC
*** Bug 549134 has been marked as a duplicate of this bug. ***

Comment 8 Michal Schmidt 2009-12-20 17:37:30 UTC
(In reply to comment #7)
> *** Bug 549134 has been marked as a duplicate of this bug. ***  

That one has a different backtrace. The assertion failure was in pa_pstream_send_tagstruct_with_creds, not in pa_mutex_unlock like it was in all the others. Not sure if it is really duplicate.

And all the other duplicates are from i686.

Comment 9 Michal Schmidt 2009-12-20 17:38:36 UTC
(In reply to comment #2)
> Yes, verified. Either the kernel or glibc are at fault here.

Lennart,
how do you verify it? Do you have a simple test case?

Comment 10 Bill Davidsen 2009-12-21 02:57:25 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > *** Bug 549134 has been marked as a duplicate of this bug. ***  
> 
> That one has a different backtrace. The assertion failure was in
> pa_pstream_send_tagstruct_with_creds, not in pa_mutex_unlock like it was in all
> the others. Not sure if it is really duplicate.
> 
> And all the other duplicates are from i686.  

I looked at these since my report was classified as a duplicate as well, and I'm uncertain if this is just some damage in another place from a common problem, or if all metacity bugs were classified as duplicates. On initial reading I'm with you, I don't think 549134 is the same thing.

I do think that what we are all seeing in the i686 is more common and should be addresses first, then 549134 can be retested.

Comment 11 Lennart Poettering 2009-12-21 11:10:28 UTC
(In reply to comment #9)
> (In reply to comment #2)
> > Yes, verified. Either the kernel or glibc are at fault here.
> 
> Lennart,
> how do you verify it? Do you have a simple test case?  

In the PA dev tree is a relatively simple example which I used, but its not exactly trivial to compile because the dev tree pulls in quite a few dependencies.

Comment 12 Lennart Poettering 2009-12-21 11:12:50 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > *** Bug 549134 has been marked as a duplicate of this bug. ***  
> 
> That one has a different backtrace. The assertion failure was in
> pa_pstream_send_tagstruct_with_creds, not in pa_mutex_unlock like it was in all
> the others. Not sure if it is really duplicate.

Oops. Goot catch. I have split this up again now. Got confused by the bt since it also included an _unlock() call...

Comment 13 Lennart Poettering 2009-12-21 12:18:19 UTC
*** Bug 549247 has been marked as a duplicate of this bug. ***

Comment 14 Lennart Poettering 2009-12-21 12:20:18 UTC
bug 549247 suggests glibc needs fixing, not the kernel. Tentatively reassigning.

Comment 15 Bruno Wolff III 2009-12-21 16:23:45 UTC
I also tried going back to an older glibc.Downgrading to glibc-2.11.90-3.i686 (same for other packages from that src rpm) got sound working again.

Comment 16 Michal Schmidt 2009-12-21 19:31:57 UTC
Created attachment 379689 [details]
simple testcase

I was able to reproduce in an i686 KVM guest with F12 + glibc-2.11.90-4 from Rawhide. Here's a minimal testcase, loosely based on thread-mainloop-test.c from Pulseaudio.

Comment 17 Adam Williamson 2009-12-22 15:48:00 UTC
This should block beta as basic sound functionality is a beta criterion.

Comment 18 Rahul Sundaram 2009-12-29 14:34:36 UTC
*** Bug 514060 has been marked as a duplicate of this bug. ***

Comment 19 Michal Schmidt 2009-12-30 16:45:15 UTC
Created attachment 380966 [details]
possible fix

glibc-2.11.90-4 introduced requeue-PI support on i386 (x86_64 already had it). It seems the problem is in the new code.

This is the upstream commit:
http://sourceware.org/git/?p=glibc.git;a=commit;h=75956694f3f80a1c32389c95069641f52c236c8b

I reviewed it and I believe I found the bug in pthread_cond_wait. I am attaching a possible fix. So far it's completely untested, I'm building glibc with it now.

Comment 20 Michal Schmidt 2009-12-30 21:56:19 UTC
Created attachment 381009 [details]
Fix pthread_cond_wait with requeue-PI on i386

The previous patch caused a segfault. Here's a new one. It works for me. It fixes pthread_cond_timedwait too, though I tested only pthread_cond_wait so far.

Comment 21 Michal Schmidt 2009-12-30 23:03:44 UTC
Scratch build of glibc with the patch in Koji:
http://kojipkgs.fedoraproject.org/scratch/michich/task_1896278/

Looks like I broke something else though. From the build.log:
tst-robustpi8: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
Didn't expect signal from child: got `Aborted'

Comment 22 Owen Taylor 2010-01-05 18:48:18 UTC
*** Bug 552512 has been marked as a duplicate of this bug. ***

Comment 23 Owen Taylor 2010-01-05 18:48:24 UTC
*** Bug 552553 has been marked as a duplicate of this bug. ***

Comment 24 Owen Taylor 2010-01-05 18:48:32 UTC
*** Bug 552590 has been marked as a duplicate of this bug. ***

Comment 25 Owen Taylor 2010-01-05 18:48:37 UTC
*** Bug 552595 has been marked as a duplicate of this bug. ***

Comment 26 Michal Schmidt 2010-01-08 16:10:45 UTC
(In reply to comment #21)
> Scratch build of glibc with the patch in Koji:
> http://kojipkgs.fedoraproject.org/scratch/michich/task_1896278/
> 
> Looks like I broke something else though. From the build.log:
> tst-robustpi8: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion
> `(-(e)) != 3 || !robust' failed.
> Didn't expect signal from child: got `Aborted'  

I could neither reproduce this testsuite failure locally nor did it occur in another try in Koji. Dinakar Guniguntala (the author of the requeue-PI patch for i386) found nothing wrong with my fix.

Did anyone test the Koji build? Did it fix the bug? Were there any problems?
It seems to have expired already, but here's the repeated one:
http://kojipkgs.fedoraproject.org/scratch/michich/task_1904693/

Comment 27 Ola Thoresen 2010-01-08 20:06:47 UTC
Just tested the latest build, and at least youtube videos work fine in firefox again.

glibc-devel-2.11.90-4.m3.i686
glibc-common-2.11.90-4.m3.i686
glibc-debuginfo-2.11.90-4.i686
glibc-headers-2.11.90-4.m3.i686
glibc-2.11.90-4.m3.i686

Comment 28 Lennart Poettering 2010-01-08 20:55:22 UTC
*** Bug 553756 has been marked as a duplicate of this bug. ***

Comment 29 Scorporat 2010-01-08 22:04:54 UTC
Recompliled glibc-2.11.90-4 with last patch - work fine for me.

Comment 30 Stepan Kasal 2010-01-08 22:15:23 UTC
(In reply to comment #26)
> Did anyone test the Koji build? Did it fix the bug? Were there any problems?
> It seems to have expired already, but here's the repeated one:
> http://kojipkgs.fedoraproject.org/scratch/michich/task_1904693/  

I witness that this build of glibc* fixes the problem for me.

Comment 31 Dodji Seketeli 2010-01-10 12:56:05 UTC
(In reply to comment #30)
> > Did anyone test the Koji build? Did it fix the bug? Were there any problems?
> > It seems to have expired already, but here's the repeated one:
> > http://kojipkgs.fedoraproject.org/scratch/michich/task_1904693/  
> 
> I witness that this build of glibc* fixes the problem for me.    

Yes, it fixed it for me too. I am currently running Rawhide with those RPM on i686 just fine.

Comment 32 Lennart Poettering 2010-01-11 22:39:46 UTC
*** Bug 552544 has been marked as a duplicate of this bug. ***

Comment 33 Owen Taylor 2010-01-14 21:12:12 UTC
*** Bug 555262 has been marked as a duplicate of this bug. ***

Comment 34 Michal Schmidt 2010-01-16 13:37:14 UTC
The patch is now applied upstream:
http://sourceware.org/git/?p=glibc.git;a=commit;h=893549c5a06956d2559391a3ffdeb6ded53b65c0