497650 – mutexes are broken on PAE kernels?

Bug 497650 - mutexes are broken on PAE kernels?

Summary: mutexes are broken on PAE kernels?

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	11
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-25 16:13 UTC by Craig Robson
Modified:	2010-06-28 12:12 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-06-28 12:12:45 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci -v (6.11 KB, text/plain) 2009-04-25 16:14 UTC, Craig Robson	no flags	Details
lsmod (1.41 KB, text/plain) 2009-04-25 16:14 UTC, Craig Robson	no flags	Details
play - Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185, backtrace (28.22 KB, text/plain) 2009-05-01 01:54 UTC, Craig Robson	no flags	Details
ogg123 - Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting. Full Backtrace (18.48 KB, text/plain) 2009-05-01 02:07 UTC, Craig Robson	no flags	Details
lspci -vvv (12.25 KB, text/plain) 2009-05-18 19:03 UTC, Benjamin Lewis	no flags	Details
Output of alsa-info (17.43 KB, text/plain) 2009-05-18 19:04 UTC, Benjamin Lewis	no flags	Details
View All

Description Craig Robson 2009-04-25 16:13:29 UTC

Description of problem:
Audio is not working on my Via device.  It worked fine in Fedora 10.

80:01.0 Audio device: VIA Technologies, Inc. VT1708/A [Azalia HDAC] (VIA High Definition Audio Controller) (rev 10)
        Subsystem: VIA Technologies, Inc. VT1708/A [Azalia HDAC] (VIA High Definition Audio Controller)
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at 9fffc000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: HDA Intel
        Kernel modules: snd-hda-intel

How reproducible:
Always fails.

Steps to Reproduce:
1. Attempt to play audio file.
  
Actual results:
play 'x.ogg' 
Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185, function connect_defer_cb(). Aborting.
Aborted


Expected results:
Beautiful music comes from speakers.

Comment 1 Craig Robson 2009-04-25 16:14:24 UTC

Created attachment 341321 [details]
lspci -v

Comment 2 Craig Robson 2009-04-25 16:14:44 UTC

Created attachment 341322 [details]
lsmod

Comment 3 Lennart Poettering 2009-04-27 01:36:23 UTC

Hmm, this is weird.

Which application triggers this? Does it happen with all applications?

Comment 4 Craig Robson 2009-04-27 02:21:31 UTC

Everything I've tried seems to trigger this.

totem and play, aplay,  die with the assert above.

ogg123 dies.  I just realized it is a different assert.
Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.
Aborted

Comment 5 Lennart Poettering 2009-04-27 19:35:25 UTC

Uh, so this crashes the client side, not the server side, right?

could you run those apps with the env var PULSE_LOG set 5?

This should give us a hint what exactly happens since it will log what the client does to stderr.

Is there anything unusual on your setup? Have you mounted some non-standard file system to /home? Changed any PA config options?

For the mutex issue, could you get me a stack trace?

Comment 6 Craig Robson 2009-04-27 20:45:23 UTC

Yes.  The client crashes.

/home is a local ext4 filesystem.

Here is the output with PULSE_LOG set.

[dj@musicbox ~]$ PULSE_LOG=5 totem 01\ -\ Fanfare\ for\ the\ Common\ Man.ogg 
Using shared memory pool with 1024 slots of size 64.0 KiB each, total size is 64.0 MiB, maximum usable slot size is 65496
Trying to connect to {441b376c66d66bf00116251149e80395}unix:/home/dj/.pulse/441b376c66d66bf00116251149e80395:runtime/native...
Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185, function connect_defer_cb(). Aborting.
058ebc1e-f63d-200a-0fbbab31-24466c20 is dumped

[dj@musicbox ~]$ PULSE_LOG=5 play 01\ -\ Fanfare\ for\ the\ Common\ Man.ogg 
Using shared memory pool with 1024 slots of size 64.0 KiB each, total size is 64.0 MiB, maximum usable slot size is 65496
Trying to connect to {441b376c66d66bf00116251149e80395}unix:/home/dj/.pulse/441b376c66d66bf00116251149e80395:runtime/native...
Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185, function connect_defer_cb(). Aborting.
Aborted

Comment 7 Craig Robson 2009-04-27 20:48:11 UTC

Mutex issue stack trace.

(gdb) bt
#0  0x009a4424 in __kernel_vsyscall ()
#1  0x0017d9d1 in raise () from /lib/libc.so.6
#2  0x0017f2d4 in abort () from /lib/libc.so.6
#3  0x00553ebc in pa_mutex_unlock () from /usr/lib/libpulsecommon-0.9.15.so
#4  0x0040381f in pa_threaded_mainloop_unlock () from /usr/lib/libpulse.so.0
#5  0x001147e8 in pa_simple_new () from /usr/lib/libpulse-simple.so.0
#6  0x001110b2 in ao_plugin_test () from /usr/lib/ao/plugins-2/libpulse.so
#7  0x0034b182 in ao_default_driver_id () from /usr/lib/libao.so.2
#8  0x0804dd3c in ogg_stream_pagein ()
#9  0x0804f552 in ogg_stream_pagein ()
#10 0x001688e6 in __libc_start_main () from /lib/libc.so.6
#11 0x0804a641 in ogg_stream_pagein ()

Comment 8 Lennart Poettering 2009-04-30 20:32:00 UTC

Hmm, this is really strange.

Any chance you can get me full backtraces over all threads of both issues?

i.e. "thread apply all bt full" in gdb, with debug symbols installed.

Thanks!

Comment 9 Craig Robson 2009-05-01 01:54:56 UTC

Created attachment 342030 [details]
play - Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185,  backtrace

Attached backtrace for play.
 
Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185,

Comment 10 Craig Robson 2009-05-01 02:07:15 UTC

Created attachment 342031 [details]
ogg123 - Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.  Full Backtrace

Comment 11 Benjamin Lewis 2009-05-18 19:03:24 UTC

Created attachment 344503 [details]
lspci -vvv

This issue is also occurring for me. As it stands, any application that is linked against pulseaudio-libs crashes the moment it tries to do anything related to pulseaudio. This occurs even the if the dæmon is uninstalled. 

I have attached the output of lspci -vvv and alsa-info from this machine in case it helps.

Some cursory probing suggests that this issues exists at least as far back as pulseaudio-libs-0.9.14-2.fc11.i386.rpm.

Comment 12 Benjamin Lewis 2009-05-18 19:04:15 UTC

Created attachment 344504 [details]
Output of alsa-info

Comment 13 Craig Robson 2009-05-18 19:25:55 UTC

I an able to play audio using the paplay command so not everything is completely broken.

Comment 14 Craig Robson 2009-05-24 01:41:16 UTC

I dug into the Assertion 'c->defer_event == e' failed at pulsecore/socket-client.c:185 assert a bit more.

It looks like there is a race between setting and using c->defer_event.

There are two threads will the following calls:

Thread 2 main() thread
socket-client.c: do_connect() 
        c->defer_event = c->mainloop->defer_new(c->mainloop, connect_defer_cb, c) 
                mainloop.c: mainloop_defer_new()
                        pa_mainloop_wakeup(e->mainloop);

Thread 1
socket-client.c: connect_defer_cb()
        pa_assert(c->defer_event == e);


The pa_mainloop_wakeup(e->mainloop) appears to be waking of the second thread which uses c->defer_event before it gets set.

Comment 15 Julien Puydt 2009-05-29 13:03:44 UTC

I confirm it's most probably a race condition : I just had ekiga hit that assert while running in valgrind -- and I had never seen it before.

Notice that I run ekiga pretty regularly in valgrind, so the fact that I'm hitting it now means it must be a recent problem.

(running debian here)

Comment 16 Julien Puydt 2009-05-29 13:46:54 UTC

I did more tests, and the problem shows in ekiga+valgrind only if I compile in the experimental gstreamer code.

That means two things :
1) the problem may not be that recent after all, since I hadn't touched that gstreamer code since long ;
2) the problem shows up when ekiga has threads slown down by valgrind and several pieces of code trying to access the device simultaneously (the usual device code and the gstreamer one).

So basically I can still confirm it's a race, but don't know how old the problem is.

Comment 17 Bug Zapper 2009-06-09 14:33:57 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 18 Craig Robson 2009-06-26 23:34:59 UTC

I added some print statements to the mutex lock and unlock code in pulsecore/mutex-posix.c.  It looks like a thread is able to acquire the mainloop lock even when the other thread had it locked.

I booted the non-PAE kernel and found that my problems only occur when using the PAE kernel.

Comment 19 Lennart Poettering 2009-06-29 18:48:14 UTC

This should probably be reassigned to the kernel then.

Comment 20 Lennart Poettering 2009-07-27 14:07:09 UTC

Also see bug 513854.

Comment 21 Alexandr Kara 2009-07-29 10:40:06 UTC

I am having this issue on a x86-64 kernel (2.6.31-0.94.rc4.fc12.x86_64).

Comment 22 Lennart Poettering 2009-07-31 15:24:57 UTC

(In reply to comment #21)
> I am having this issue on a x86-64 kernel (2.6.31-0.94.rc4.fc12.x86_64).  

This is about x86 PAE, not 64bit kernels.

Comment 23 Lennart Poettering 2009-07-31 15:28:00 UTC

dropping alpha blocker, since this is probably limited to f11.

Is anyoneone able to reproduce this on rawhide?

Comment 24 Dave Ludlow 2009-09-24 15:38:16 UTC

I am seeing the same error:

Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.
145	m_syswrap/syscall-amd64-linux.S: No such file or directory.
Could not find the frame base for "vgPlain_client_syscall".

I see this when I try to run an app that I am developing in valgrind (vifir).  The app does not use sound at all and is not linked to anything with "pulse" in the name.  I'm also using x86-64.

Here is my (not very useful) "thread apply all bt full" dump:

(gdb) thread apply all bt full

Thread 1 (process 28527):
#0  0x00000004076deb55 in ?? ()
No symbol table info available.
#1  0x0000000000005e4d in ?? ()
No symbol table info available.
#2  0x0000000038d6de60 in vgPlain_threads ()
No symbol table info available.
#3  0x000000040338bf40 in ?? ()
No symbol table info available.
#4  0x0000000000001080 in ?? ()
No symbol table info available.
#5  0x0000000038d6de50 in vgPlain_threads ()
No symbol table info available.
#6  0x000000321681c166 in ?? ()
No symbol table info available.
#7  0x0000000408f21000 in ?? ()
No symbol table info available.
#8  0x00000004093e9b20 in ?? ()
No symbol table info available.
#9  0x0000000000000001 in ?? ()
No symbol table info available.
#10 0x0000000000000002 in ?? ()
No symbol table info available.
#11 0x0000000000001070 in ?? ()
No symbol table info available.
#12 0x0000000000000000 in ?? ()
No symbol table info available.

My app is compiled "-O0 -g3" and pulseaudio-debuginfo is installed.  What can I do to make this dump more informative?

Comment 25 Luke Hutchison 2010-01-05 13:38:46 UTC

Re. comment 22: I also have the identical same problem on x86_64 so no, it's not just about PAE -- though on my system, some programs work (Flash plugin works; Skype does not).

Re. comment #23: Yes, I have this problem on rawhide, latest RPMs. (PA version is pulseaudio-0.9.21-3.fc13.x86_64)

Comment 26 vsingleton 2010-01-08 19:56:16 UTC

I have this same issue with rawhide:

# aplay -Dpulse /usr/share/sounds/purple/alert.wav
Assertion 'pthread_mutex_unlock(&m->mutex) == 0' failed at pulsecore/mutex-posix.c:108, function pa_mutex_unlock(). Aborting.
Aborted

kernel-PAE-2.6.32.2-15.fc13.i686
pulseaudio-0.9.21-3.fc13.i686

This causes gnome to lock up whenever a sound is played.
KDE is fine.  Only the app playing the sound is affected under KDE.

Any ideas how to work around this?

Comment 27 Michal Schmidt 2010-01-11 07:39:16 UTC

In current Rawhide on i686 what you're seeing is probably bug 548989.

Comment 28 Bug Zapper 2010-04-27 13:56:21 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 29 Bug Zapper 2010-06-28 12:12:45 UTC

Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.