Bug 513854

Summary:	Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.
Product:	[Fedora] Fedora	Reporter:	Tom London <selinux>
Component:	glibc	Assignee:	Andreas Schwab <schwab>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	low
Version:	rawhide	CC:	ben.lewis, drepper, jakub, jmccann, lkundrak, lpoetter, mclasen, schwab, valdis.kletnieks, walters, wtogami, yaneti, zkabelac
Target Milestone:	---	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-07-30 21:25:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	507676

Description Tom London 2009-07-26 16:20:16 UTC

Description of problem:
After updating to newest gcc/glibc, I am getting this message when I start applications that use sound:

Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.

So far, I've seen this with rhythmbox, audacity, qemu-kvm, totem, ...

This a side effect of the "f12 rebuild"?  Will this be fixed when the pulseaudio packages are rebuilt?

Version-Release number of selected component (if applicable):
pulseaudio-libs-glib2-0.9.16-2.test2.fc12.x86_64
glibc-headers-2.10.90-7.1.x86_64
glibc-devel-2.10.90-7.1.x86_64
pulseaudio-libs-0.9.16-2.test2.fc12.x86_64
glibc-common-2.10.90-7.1.x86_64
pulseaudio-debuginfo-0.9.16-2.test2.fc12.x86_64
pulseaudio-utils-0.9.16-2.test2.fc12.x86_64
pulseaudio-module-gconf-0.9.16-2.test2.fc12.x86_64
pulseaudio-0.9.16-2.test2.fc12.x86_64
glibc-2.10.90-7.1.x86_64
pulseaudio-module-bluetooth-0.9.16-2.test2.fc12.x86_64
gcc-4.4.1-3.x86_64
pulseaudio-module-x11-0.9.16-2.test2.fc12.x86_64


How reproducible:
Every time, all the time.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Tom London 2009-07-26 16:27:44 UTC

Didn't provide versions of failing apps:

rhythmbox-0.12.3-1.fc12.x86_64
totem-2.27.2-1.fc12.x86_64
audacity-1.3.8-0.2.betamp3.fc12.x86_64
qemu-kvm-0.10.50-13.kvm88.fc12.x86_64

Comment 2 Tom London 2009-07-26 18:17:53 UTC

I rebuilt audacity locally, and still get the same error:

[tbl@tlondon audacity-src-1.3.8]$ ./audacity
Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.

Also, I notice 2 "run away" processes: 

[root@tlondon ~]# top

top - 11:14:57 up  2:08,  4 users,  load average: 2.07, 2.35, 2.67
Tasks: 168 total,   3 running, 165 sleeping,   0 stopped,   0 zombie
Cpu(s): 98.0%us,  1.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   3884452k total,  2298684k used,  1585768k free,    90604k buffers
Swap:  5963768k total,        0k used,  5963768k free,  1367748k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 1628 tbl       20   0  519m 6676 5332 R 97.9  0.2 116:44.18 canberra-gtk-pl    
 1654 tbl       20   0  441m  11m 9276 R 97.2  0.3  97:06.87 gpk-update-icon    
 1292 root      20   0  292m  75m  12m S  1.0  2.0   3:22.12 Xorg               
 2007 tbl       20   0  457m  18m 9.9m S  1.0  0.5   1:13.39 npviewer.bin       
29782 root      20   0 13020 1192  864 R  1.0  0.0   0:00.22 top   

The first is canberra-gtk-play, the second is gpk-update-icon.  Not sure the second is related to this....

Comment 3 Zdenek Kabelac 2009-07-27 11:47:11 UTC

I've tried downgrade to version 0.9.15 which used to work somehow before...

With this version 0.9.15 my desktop doesn't crash for now (like with 0.9.16) - but i.e. mplayer must be used with  -ao oss (and changing volume slider via aumix) to get some sound out of my machine.

This pulseaudio thing becomes really horrible nightmare.....

It could not be easily purged from my system...

Comment 4 Tom London 2009-07-27 13:26:51 UTC

I've recovered "working pulseaudio and sound apps" by downgrading glibc and gcc packages:

cpp-4.4.1-1.x86_64.rpm              libgcc-4.4.1-1.x86_64.rpm
gcc-4.4.1-1.x86_64.rpm              libgcj-4.4.1-1.x86_64.rpm
gcc-c++-4.4.1-1.x86_64.rpm          libgfortran-4.4.1-1.x86_64.rpm
gcc-gfortran-4.4.1-1.x86_64.rpm     libgomp-4.4.1-1.x86_64.rpm
glibc-2.10.90-3.x86_64.rpm          libstdc++-4.4.1-1.x86_64.rpm
glibc-common-2.10.90-3.x86_64.rpm   libstdc++-devel-4.4.1-1.x86_64.rpm
glibc-devel-2.10.90-3.x86_64.rpm    nscd-2.10.90-3.x86_64.rpm
glibc-headers-2.10.90-3.x86_64.rpm

With this, pulseaudio functions great for me....

Comment 5 Lennart Poettering 2009-07-27 14:04:48 UTC

Assigning to glibc hence.

Hey you, glibc/gcc folks, you broke mutexes!

Comment 6 Lennart Poettering 2009-07-27 14:07:12 UTC

Also see bug 497650

Comment 7 Jakub Jelinek 2009-07-27 16:51:51 UTC

Have you verified that you are never calling pthread_mutex_unlock in a different thread than the one that actually acquired the mutex?  For PI mutexes EPERM in pthread_mutex_unlock is then returned (similarly for error checking mutex).
There haven't been any changes in the PI mutex glibc code for months btw., but most of the work is done in the kernel, not glibc.

Comment 8 Valdis Kletnieks 2009-07-28 00:31:58 UTC

Am seeing hangs in firefox (3.6b1 nightly from last week), top of the stack looks like this:

(gdb) where
#0  0x00007fbfee69a314 in pa_tls_get () from /usr/lib64/libpulsecommon-0.9.15.so
#1  0x00007fbfee699b52 in ?? () from /usr/lib64/libpulsecommon-0.9.15.so
#2  0x00007fbfee699fe5 in pa_thread_self () from /usr/lib64/libpulsecommon-0.9.15.so
#3  0x00007fbfee8e473b in ?? () from /usr/lib64/libpulse.so.0
#4  0x00007fbfee8e5133 in pa_threaded_mainloop_wait () from /usr/lib64/libpulse.so.0
#5  0x00007fbfeeafe832 in pulse_driver_open () from /usr/lib64/libcanberra-0.14/libcanberra-pulse.so
#6  0x0000003a5400bc52 in ?? () from /usr/lib64/libcanberra.so.0
#7  0x0000003a540034a8 in ?? () from /usr/lib64/libcanberra.so.0
#8  0x000000000001183f in ?? ()
#9  0x00007fbff8b8a000 in ?? ()
#10 0x00007fbff0d09c40 in ?? ()
#11 0x0000003a54003cb5 in ca_context_play_full () from /usr/lib64/libcanberra.so.0
#12 0x0000003a54004060 in ca_context_play () from /usr/lib64/libcanberra.so.0

And we just spin our wheels at 100% user-mode CPU after that. Was working with glibc-2.10.90-3.x86_64,
but broke today with  glibc-2.10.90-7.1.x86_64.

For what it's worth, I'm pretty sure it's not the kernel (am running a self-compiled 2.6.31-rc3-mmotm0716 here), because firefox was working this morning, and broke after 'yum update', and there's been no reboots in the meantime.

Comment 9 Valdis Kletnieks 2009-07-28 03:55:17 UTC

Quick-n-dirty *partial* workaround:

% export CANBERRA_DRIVER="null"
% your_busted_pulseaudio_program_here

Not a total fix, but will at least keep canberra from triggering the problem behind the program's back.  Programs invoking pulseaudio directly will still have issues, of course.

(Ended up having to grovel through the libcanberra sources to find that...)

Comment 10 Jakub Jelinek 2009-07-28 08:26:23 UTC

As shown in #512121, this is nothing new in F12.

*** This bug has been marked as a duplicate of bug 512121 ***

Comment 11 Jakub Jelinek 2009-07-28 15:36:50 UTC

Reopening, as this is a different problem.
Should happen on x86-64 only ATM, the problem is that with 2.6.31ish kernels where FUTEX_WAIT_REQUEUE_PI/FUTEX_CMP_REQUEUE_PI futex ops are newly supported the kernel locks mutex->__data.__locks, but nothing already updates mutex->__data.__owner to the current thread id and mutex->__data.__count to 1,
so the next pthread_mutex_unlock fails when using recursive PI mutexes.

Why does PA do that btw?  You need to be extremely careful with those only to call pthread_cond_*wait with the mutex lock nesting exactly 1, otherwise really bad things happen.

Comment 12 Lennart Poettering 2009-07-28 16:16:42 UTC

(In reply to comment #11)

> Why does PA do that btw?  You need to be extremely careful with those only to
> call pthread_cond_*wait with the mutex lock nesting exactly 1, otherwise really
> bad things happen.  

Hmm? PA does what exactly?

Is that a question about recursive use of mutexes? I am pretty sure that libcanberra only takes the lock once.

The lock is recursive only because we export it pretty much directly to the clients, and in some situations it did make sense to have it recursive for that.

Comment 13 Lennart Poettering 2009-07-28 16:39:39 UTC

So let me clarify this: PA creates that mutex as PI and as recursive since it is exported to the caller almost directly. And if the caller uses RT then he wants PI and in some other cases he might want a recursive lock. However, in most cases it's completely safe to go without either.

For testing I have now disabled PI for the mutex in question and things work correctly again.

I see some other issue here, which is probably related. In some situations pthread_cond_wait() returns with nothing having happened, and then causes us to enter a spin loop since we call cond_wait in a loop.

Comment 14 Lennart Poettering 2009-07-28 16:45:11 UTC

(In reply to comment #13)

> I see some other issue here, which is probably related. In some situations
> pthread_cond_wait() returns with nothing having happened, and then causes us to
> enter a spin loop since we call cond_wait in a loop.  

Tracking that in bug 513629.

Comment 15 Colin Walters 2009-07-30 14:29:14 UTC

(In reply to comment #14)
> (In reply to comment #13)
> 
> > I see some other issue here, which is probably related. In some situations
> > pthread_cond_wait() returns with nothing having happened, and then causes us to
> > enter a spin loop since we call cond_wait in a loop.  
> 
> Tracking that in bug 513629.  

pthread_cond_wait is defined to allow spurious returns.  See:

http://www.opengroup.org/onlinepubs/007908775/xsh/pthread_cond_timedwait.html

"When using condition variables there is always a boolean predicate involving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_wait() or pthread_cond_timedwait() functions may occur. Since the return from pthread_cond_wait() or pthread_cond_timedwait() does not imply anything about the value of this predicate, the predicate should be re-evaluated upon such return."

Comment 16 Colin Walters 2009-07-30 15:01:08 UTC

Basically any time you call pthread_cond_wait, it must be in a loop, and you have to pair a call to pthread_mutex_lock with it.  The pattern looks like this:

pthread_mutex_lock (&lock);
while (!my_condition)
  pthread_cond_wait (&cv, &lock);

So it looks like the code in thread-mainloop.c:pa_threaded_mainloop_signal

    if (wait_for_accept && m->n_waiting > 0)
        pa_cond_wait(m->accept_cond, m->mutex);

needs to be a while loop.  But the next one in pa_threaded_mainloop_wait I'm less sure how it's supposed to work.

I'm not able to immediately wrap my head around the design of the whole file, but it looks like pulse needs to take this into account.

Comment 17 Lennart Poettering 2009-07-30 20:51:21 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #13)
> > 
> > > I see some other issue here, which is probably related. In some situations
> > > pthread_cond_wait() returns with nothing having happened, and then causes us to
> > > enter a spin loop since we call cond_wait in a loop.  
> > 
> > Tracking that in bug 513629.  
> 
> pthread_cond_wait is defined to allow spurious returns.  See:
> 
> http://www.opengroup.org/onlinepubs/007908775/xsh/pthread_cond_timedwait.html
> 
> "When using condition variables there is always a boolean predicate involving
> shared variables associated with each condition wait that is true if the thread
> should proceed. Spurious wakeups from the pthread_cond_wait() or
> pthread_cond_timedwait() functions may occur. Since the return from
> pthread_cond_wait() or pthread_cond_timedwait() does not imply anything about
> the value of this predicate, the predicate should be re-evaluated upon such
> return."  

The _wait() is only called from with the mutex locked. The wait that matters here is the one in pa_threaded_mainloop_wait(), not the one in pa_threaded_mainloop_signal() which almost never used (but probably should be fixed)

Also, let's keep track of that in 513629.

Comment 18 Jakub Jelinek 2009-07-30 21:04:27 UTC

Of course _wait() has to be always called with the mutex locked, that's the requirement.  The spurious wakeups may happen even then, so if pa doesn't loop checking the condition, it is buggy.

Comment 19 Lennart Poettering 2009-07-30 21:16:57 UTC

PA doesn't loop. The _wait() is exported 1:1 to the applications as part of the API. And the software I wrote does all the loops. Including libcanberra which is the software in question here.

You can easily verify this in the sources:

http://git.0pointer.de/?p=libcanberra.git;a=blob;f=src/pulse.c;h=2a61c8fbc5d3a418f7cfd71d45bf1dab17985645;hb=HEAD#l375

Look for the _wait()'s there and you'll see that they are all inside of loops and _lock() called before that loop and _unlock() after. Really, that's not the problem here. The problem is apparently that _wait() does not actually wait for some reasons and we enter a spin loop.

Comment 20 Ulrich Drepper 2009-07-30 21:25:43 UTC

Should be fixed in rawhide (2.10.90-11).  If not, reopen.

Comment 21 Tom London 2009-07-30 22:32:11 UTC

Just for the record, I updated to 2.10.90-11 and rebooted.

No longer get the error, and sound/pulseaudio apps now work.

Comment 22 Valdis Kletnieks 2009-07-31 15:46:08 UTC

Confirming - 90-11 eliminates the assert failures in Pulseaudio library calls, and seems to have fixed the Firefox hang (which was almost certainly bug #513629).