Bug 813548 - alsa bug makes kernel kill pulseaudio sometimes
alsa bug makes kernel kill pulseaudio sometimes
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: alsa-lib (Show other bugs)
17
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Jaroslav Kysela
Fedora Extras Quality Assurance
RejectedBlocker
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-17 16:31 EDT by Kamil Páral
Modified: 2012-06-04 16:10 EDT (History)
13 users (show)

See Also:
Fixed In Version: kernel-3.3.7-1.fc17
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-04 16:10:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
pulseaudio -vvvv output (158.03 KB, text/plain)
2012-04-17 16:31 EDT, Kamil Páral
no flags Details
alsa-info output (30.14 KB, text/plain)
2012-04-27 04:05 EDT, Kamil Páral
no flags Details
dmesg (80.24 KB, text/plain)
2012-05-04 13:28 EDT, Kamil Páral
no flags Details
messages (1.14 MB, text/plain)
2012-05-04 13:29 EDT, Kamil Páral
no flags Details
dmesg (73.28 KB, text/plain)
2012-05-08 04:52 EDT, Kamil Páral
no flags Details
messages (134.83 KB, text/plain)
2012-05-08 04:52 EDT, Kamil Páral
no flags Details
messages with PA debug logging (681.13 KB, text/plain)
2012-05-08 15:14 EDT, Kamil Páral
no flags Details
dmesg with PA debug logging (78.34 KB, text/plain)
2012-05-08 15:15 EDT, Kamil Páral
no flags Details
tail of messages with PA debug level (10.92 KB, text/plain)
2012-05-10 18:21 EDT, Kamil Páral
no flags Details
Log from running pulseaudio (106.41 KB, text/plain)
2012-05-11 00:50 EDT, Braden McDaniel
no flags Details
Log from running pulseaudio (second failure mode) (134.61 KB, text/plain)
2012-05-11 00:51 EDT, Braden McDaniel
no flags Details

  None (edit)
Description Kamil Páral 2012-04-17 16:31:18 EDT
Description of problem:
During my desktop session pulseaudio dies every few minutes. I can recognize that easily because the volume notification icon disappears from gnome-shell top bar (and obviously sounds stop working). If I grep for pulseaudio, it is not running (i.e. it doesn't respawn automatically). When I run 'pulseaudio' manually, the gnome-shell icon is not restored, but sounds play again for a short while (10 minutes tops). Then pulseaudio quits again.

I have removed all ~/.pulse* files to have a clean profile and then started 'pulseaudio -vvvv'. It seems to start correctly and wait. After I play a single sound

$ paplay /usr/share/sounds/gnome/default/alerts/drip.ogg

pulseaudio terminates in 15-20 seconds. It doesn't seems to terminate _until_ I play that sound.

See attached output from pulseaudio -vvvv. I have marked the stop where it waited peacefully until I played the sound. Search for '################ HERE I PLAY A SOUND #########################'.

I don't see any crash messages in /var/log/messages, I don't see any weird messages there at all.

My hardware:
http://www.smolts.org/client/show/pub_6528457b-60ef-4c7b-b6de-629feeab3d53

00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 03)
	Subsystem: Lenovo ThinkPad T61
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin B routed to IRQ 47
	Region 0: Memory at fe220000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel


Version-Release number of selected component (if applicable):
pulseaudio-1.1-8.fc17.x86_64
pulseaudio-gdm-hooks-1.1-8.fc17.x86_64
pulseaudio-libs-1.1-8.fc17.x86_64
pulseaudio-libs-glib2-1.1-8.fc17.x86_64
pulseaudio-module-bluetooth-1.1-8.fc17.x86_64
pulseaudio-module-gconf-1.1-8.fc17.x86_64
pulseaudio-module-x11-1.1-8.fc17.x86_64
pulseaudio-utils-1.1-8.fc17.x86_64


How reproducible:
easily

Steps to Reproduce:
1. work with desktop normally, pulseaudio dies very often
Comment 1 Kamil Páral 2012-04-17 16:31:41 EDT
Created attachment 578138 [details]
pulseaudio -vvvv output
Comment 2 Kamil Páral 2012-04-17 16:35:30 EDT
I don't know whether this is related to just my sound card or whether it hits more machines, but I nominate this for F17 Blocker, because it makes the desktop quite unusable.
Comment 3 Brendan Jones 2012-04-17 16:37:08 EDT
Can you look at bug 813407 and see if the suggestion there helps?
Comment 4 Kamil Páral 2012-04-17 17:09:04 EDT
Bug 813407 seems similar, but I'm not sure whether it's the same. Workaround mentioned in bug 813407 comment 7 didn't work for me. Pulseaudio still crashes after I reboot, play a sound and wait a few minutes.
Comment 5 Ankur Sinha (FranciscoD) 2012-04-17 17:23:28 EDT
Hey Kamil,

Can you please run pulseaudio -vvvv and attach a gdb instance to it to try and generate a trace? (gdb --pid=PID, "c" to continue). You'll need to install the debuginfos before doing this: debuginfo-install pulseaudio

(I ran around for a few days with my bug, and these are the steps I was suggested to follow. The other bug could be a kind of reference.)

There are also quite a few tips here that solve most issues:
http://fedoraunity.org/Members/fenris02/pulseaudio-fixes-and-workarounds

Even the work around I've used is from that page. You could try out the other stuff and see if they help at all.

Thanks,
Ankur
Comment 6 Kamil Páral 2012-04-18 16:42:23 EDT
Ankur, thanks for pointers. First of all, I found a workaround that works for me. If I run rhythmbox and play music, pulseaudio doesn't terminate. So I have to play music all the time now :) When I quit rhythmbox, pulseaudio usually terminates in a matter of minutes.

But sometimes it just doesn't terminate, I don't know the reason. Anyway, I killed it and started manually pulseaudio -vvvv, then connected with gdb, and reproduced pulseaudio termination. gdb prints out just this:

(gdb) c
Continuing.
[New Thread 0x7f3d89b73700 (LWP 5763)]
[Thread 0x7f3d9114e700 (LWP 5678) exited]
[Thread 0x7f3d908f4700 (LWP 5679) exited]
[Thread 0x7f3d9c94d780 (LWP 5677) exited]
[Inferior 1 (process 5677) exited normally]
(gdb) thread apply all bt full
(gdb) bt
No stack.
(gdb) quit

I believe that means pulseaudio doesn't crash at all (program output also indicates that), therefore there is no backtrace. It terminates correctly. In your bugreport I see "Program terminated with signal SIGKILL, Killed.", so it's probably a different issue.
Comment 7 Adam Williamson 2012-04-20 16:47:30 EDT
Discussed at 2012-04-20 blocker review meeting: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-04-20/fedora-bugzappers.2012-04-20-17.01.log.txt . This is a worrying issue, but we need more certainty on the causes and how wide the impact is likely to be in order to judge whether this is a blocker issue. We agreed to delay the decision by a week and see if more people start seeing this issue, and if Kamil and the PA team are able to diagnose it more precisely.
Comment 8 Kamil Páral 2012-04-27 04:05:47 EDT
Created attachment 580671 [details]
alsa-info output
Comment 9 Kamil Páral 2012-04-27 04:31:59 EDT
I have learned that when I ran pulseaudio manually, it is expected to terminate after playing a sound. I have to put -1 to exit-idle-time in /etc/pulse/daemon.conf to stop doing that. So that's not the issue. I'm trying to reproduce the core issue (PA quitting/crashing after system boot) to get some useful information.
Comment 10 Kamil Páral 2012-04-30 17:06:57 EDT
It took my quite a long time (two full days) to reproduce PA crash again. Every time I logged in, I attached gdb to pulseaudio process. Once it finally crashed, it didn't print out any stacktrace. Maybe I just don't know how to control gdb. Suggestions?

$ pgrep -fl pulse
1151 /usr/bin/pulseaudio --start
1168 /usr/libexec/pulse/gconf-helper
$ gdb -p 1151
<snip>
(gdb) c
Continuing.
[Thread 0x7f85c9d14700 (LWP 1166) exited]
[Thread 0x7f85bfffe700 (LWP 1167) exited]

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) thread apply all bt full
(gdb) bt
No stack.

It seems like it didn't crash, but something sent it a KILL signal.
Comment 11 Kamil Páral 2012-04-30 17:28:08 EDT
ABRT doesn't detect any crash either.
Comment 12 Kamil Páral 2012-05-03 08:16:37 EDT
There is definitely some problem with PA. But now that I know that some of my previous attempts were invalid, the real frequency of these problems dropped substantially. It seems to occur just once or twice a week for me. I have a hard time to get some interesting information out of it, because I have to wait a few days every time to get another chance. This low frequency might also be a reason for considering having it removed from the blocker list.
Comment 13 Lennart Poettering 2012-05-03 13:35:31 EDT
Problems like these are usually caused by broken device drivers (or drivers which lack the appropriate quirks to work around broken devices), where the timing doesn't work. Usually this results in PA creating too much load which in order to avoid a locked up system is then terminated by the kernel with SIGKILL, due to RLIMIT_RTTIME.

Please attach syslog/kmsg when this happens.

Also, please watch CPU load closely when this happens. Usually it should spike immediately before this happens.

I am quite sure that this is specific to a specific driver or device or combination, I don't think this should be a blocker.
Comment 14 Adam Williamson 2012-05-03 15:26:45 EDT
Discussed at 2012-05-03 blocker review meeting. We are happy to go with Lennart's comment #13 and agree this is not a blocker on the basis that it's too hardware-specific.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 15 Adam Williamson 2012-05-03 15:59:54 EDT
Correction to the above: we actually agreed to keep it on the proposed list for a week for observation.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 16 Kamil Páral 2012-05-04 13:27:43 EDT
Today I saw the problem again. PA terminated right after I plugged in my network cable.

There is a weird bug probably in Network Manager. If I have my wifi on, and plug in a network cable, half of the requests from 'ping' time out. Also any running graphical applications (like Firefox) can't resolve any host name. After I disable wifi everything starts to work smoothly.

So PA terminated right after I plugged in the network cable. "/usr/bin/pulseaudio --start" received SIGKILL, "/usr/libexec/pulse/gconf-helper" received SIGTERM. It seems it was directly related to that network event. Can inability to resolve some hostname terminate PA?

I am not sure whether I was switching cable and wifi in all the other times PA terminated for me. I don't really remember. But I don't think it was this immediate (plug in cable -> PA gone), I would have noticed that.

I tried to reproduce it several times, but I did not succeed. PA is now still running, no matter what I do.
Comment 17 Kamil Páral 2012-05-04 13:28:36 EDT
Created attachment 582168 [details]
dmesg

dmesg output right after I pluged in network cable and PA terminated
Comment 18 Kamil Páral 2012-05-04 13:29:09 EDT
Created attachment 582169 [details]
messages

messages output right after I pluged in network cable and PA terminated
Comment 19 Kamil Páral 2012-05-07 11:11:54 EDT
I have a hard time to reproduce this. It happens so scarcely. And because it seems a lot of other people are not affected by this, therefore it might be a specific hardware issue (or my specific use case of swapping cable and wireless sometimes), I propose to drop this from the blocker list.
Comment 20 Kamil Páral 2012-05-08 04:52:08 EDT
Created attachment 582922 [details]
dmesg

It happened again, very shortly after boot. I'm absolutely positive I didn't change any networking connections. Attaching dmesg and messages, I don't see any printout from pulseaudio except for starting messages.
Comment 21 Kamil Páral 2012-05-08 04:52:30 EDT
Created attachment 582924 [details]
messages
Comment 22 Kamil Páral 2012-05-08 15:14:46 EDT
Created attachment 583062 [details]
messages with PA debug logging

It happened again, but this time I had PA logging level set to debug. Again, no network changes, so I believe it is not related. Messages end like this:

May  8 13:33:35 localhost pulseaudio[1161]: [alsa-sink] alsa-sink.c: Device suspended...
May  8 13:33:47 localhost pulseaudio[1161]: [pulseaudio] client.c: Created 128 "Native client (UNIX socket client)"
May  8 13:33:47 localhost pulseaudio[1161]: [pulseaudio] protocol-native.c: Got credentials: uid=1000 gid=1000 success=1
May  8 13:33:47 localhost pulseaudio[1161]: [pulseaudio] module-stream-restore.c: Restoring volume for sink input sink-input-by-application-id:org.VideoLAN.VLC.
May  8 13:33:47 localhost pulseaudio[1161]: [pulseaudio] module-stream-restore.c: Restoring mute state for sink input sink-input-by-application-id:org.VideoLAN.VLC.
May  8 13:33:47 localhost pulseaudio[1161]: [alsa-sink] alsa-sink.c: Trying resume...
May  8 13:33:47 localhost pulseaudio[1161]: [alsa-sink] alsa-util.c: Trying to disable ALSA period wakeups, using timers only

And after that PA was gone. I don't see any error or warning there.
Comment 23 Kamil Páral 2012-05-08 15:15:08 EDT
Created attachment 583063 [details]
dmesg with PA debug logging
Comment 24 Kamil Páral 2012-05-10 16:35:52 EDT
Ford_Prefect on #pulseaudio IRC advised me that this bug could be caused by https://bugs.freedesktop.org/show_bug.cgi?id=47830

I tried both reproducers mentioned there (in comments 1 and 2), but they don't terminate my PA. Therefore I'm not sure whether my issue is the same or different.
Comment 25 Kamil Páral 2012-05-10 18:21:32 EDT
Created attachment 583688 [details]
tail of messages with PA debug level

Another "crash", another log (just tail of messages) in hope it will be useful. Again, I don't see anything wrong in the log.
Comment 26 Adam Williamson 2012-05-10 23:08:18 EDT
As things stand I'm -1 blocker on this. It still seems like too much of a corner case. One person has shown up on test@ who's possibly suffering from it, but that's all. And of course for installed systems it could be fixed with an update.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 27 Braden McDaniel 2012-05-11 00:50:19 EDT
Created attachment 583717 [details]
Log from running pulseaudio

I'm seeing something like this as well; but it's got nothing to do with cables being plugged in (or unplugged).

In fact, I'm not even sure I'm seeing a single bug, as I seem to have observed two failure modes (so far). In my case, pulseaudio tends to die when the totem plug-in is being used in a Web browser. But there's no particular video that seems to trigger it: sometimes it dies and sometimes it doesn't.

Here's a log from one of the failure modes.
Comment 28 Braden McDaniel 2012-05-11 00:51:55 EDT
Created attachment 583719 [details]
Log from running pulseaudio (second failure mode)

And here's a log showing what appears to be a second failure mode.
Comment 29 Arun Raghavan 2012-05-11 07:14:49 EDT
The first of those two failures looks similar to Kamil's -- the kernel is coming and killing of PA, which, as Lennart mentions, is likely because of an ALSA driver bug. The second is just PA going away after some idle time. It will be autospawned the next time an application needs it.

Since the problem is so infrequent, it would be really helpful to have a concrete test case. The I wrote up on the upstream bug (which just starts up audacity in a loop) is one example.

If that's hard, please try to get a kernel with the patch referenced in that upstream bug. There's a pretty decent chance that it will solve your problem.
Comment 30 Kamil Páral 2012-05-11 07:44:07 EDT
Reassigning to alsa then. Alsa developers, can you please direct us to a kernel version in Koji that would have this patch included?

http://mailman.alsa-project.org/pipermail/alsa-devel/2012-May/051678.html
Comment 31 Arun Raghavan 2012-05-11 10:27:13 EDT
Note: reproducing the bug with the audacity loop I suggested will also need you to change the line in /etc/pulse/default.pa from:

load-module module-suspend-on-idle

to:

load-module module-suspend-on-idle timeout=0

This will force the device to suspend as soon as a client goes away and thus makes the bug easier to trigger in the load-audacity-loop.
Comment 32 Adam Williamson 2012-05-11 20:39:25 EDT
Discussed at the 2012-05-11 blocker review meeting: http://meetbot.fedoraproject.org/fedora-bugzappers/2012-05-11/f17-final-blocker-review-meeting-5.2012-05-11-17.04.html . This was rejected as a blocker on the basis that it only appears to be hitting a few users, and can be fixed well enough with an update.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers
Comment 33 Kamil Páral 2012-05-13 15:36:28 EDT
(In reply to comment #31)
> Note: reproducing the bug with the audacity loop I suggested will also need you
> to change the line in /etc/pulse/default.pa from:
> 
> load-module module-suspend-on-idle
> 
> to:
> 
> load-module module-suspend-on-idle timeout=0
> 
> This will force the device to suspend as soon as a client goes away and thus
> makes the bug easier to trigger in the load-audacity-loop.

After following the above I was able to reproduce PA problem several times in a row using the audacity loop. Unfortunately not consistently, on next reboot I couldn't repeat it. But it is highly possible that the referenced alsa bug is really the cause of my issue.

Now we need to make alsa developers post a fixed kernel to fedora. Maybe I should change component to kernel, does anybody know?
Comment 34 Pavel Šimerda (pavlix) 2012-05-16 03:31:54 EDT
Same problem on Lenovo x61:

00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 03)
Comment 35 Stephen 2012-05-24 16:21:32 EDT
This affects my x61 thinkpad. I noticed it on F16 x64 with multiple kernels. Just swapped in a new hard drive so jumped straight to F17, and problem is still present.

I'd link to smolt but it's throwing 500's.

Here some spew from the log mentioning pulse:

May 24 08:57:06 yo rtkit-daemon[1046]: Successfully made thread 6140 of process 6140 (/usr/bin/pulseaudio) owned by '1000' high priority at nice level -11.
May 24 08:57:06 yo pulseaudio[6140]: [pulseaudio] pid.c: Stale PID file, overwriting.
May 24 08:57:06 yo rtkit-daemon[1046]: Successfully made thread 6150 of process 6140 (/usr/bin/pulseaudio) owned by '1000' RT at priority 5.
May 24 08:57:06 yo rtkit-daemon[1046]: Successfully made thread 6151 of process 6140 (/usr/bin/pulseaudio) owned by '1000' RT at priority 5.
May 24 08:57:06 yo pulseaudio[6140]: [pulseaudio] module-alsa-card.c: Failed to find a working profile.
May 24 08:57:06 yo pulseaudio[6140]: [pulseaudio] module.c: Failed to load module "module-alsa-card" (argument: "device_id="29" name="platform-thinkpad_acpi" card_name="alsa_card.platf
orm-thinkpad_acpi" namereg_fail=false tsched=yes fixed_latency_range=no ignore_dB=no deferred_volume=yes card_properties="module-udev-detect.discovered=1""): initialization failed.
Comment 36 Arun Raghavan 2012-05-24 22:16:40 EDT
The fix has landed in 3.4, 3.3.7, 3.2.18 and I believe 3.0.32 as well.
Comment 37 Pavel Šimerda (pavlix) 2012-05-27 10:00:22 EDT
Thanks for fixing this before F17 release.
Comment 38 Kamil Páral 2012-05-27 10:45:50 EDT
This hasn't been fixed on F17 final media, but it will be part of 0-day updates (kernel-3.3.7-1.fc17 is in updates repo now). I've been running on that kernel for two days now, I haven't seen the problem yet. Doesn't mean it's fixed, we will see after a week or two.
Comment 39 Kamil Páral 2012-06-04 16:10:55 EDT
I haven't seen the problem since I updated to kernel-3.3.7-1.fc17. I believe it is fixed now. Thanks everyone for helpful comments.

Note You need to log in before you can comment on or make changes to this bug.