Bug 425763

Summary: Stale files and/or directories prevent the pulseaudio daemon from loading.
Product: [Fedora] Fedora Reporter: Russ <admin>
Component: pulseaudioAssignee: Lennart Poettering <lpoetter>
Status: CLOSED DUPLICATE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: rawhideCC: bnocera, jorton, pierre-bugzilla, selinux
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-15 16:50:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Russ 2007-12-15 04:53:17 UTC
Description of problem:
An X11 crash leaves temporary files which prevent the pulseaudio daemon from
loading.

Version-Release number of selected component (if applicable):
pulseaudio-0.9.7-0.17.svn20071017.fc8

How reproducible:
always

Steps to Reproduce:
1. Start X 
2. CTRL-BACKSPACE
3. Restart X
  
Actual results:
Sound server informational message:
Could not start the sound server

or:

pulseaudio -D
E: main.c: could not find file


Expected results:
Sound :(


Additional info:
Is it really that difficult to check for stale files/directories/lockfiles.
Shouldn't the daemon at least report that stale files/directories exist?
Do lockfiles or normal daemon type checks even exist for this daemon? 
If not, well maybe thats the problem.

How did this kind of stuff make it into a final product release, replacing a
functional audio system that had no bugs? I hate to be critical, and realize
that putting into a production release is the only way to spur adoption, but
something as simple as failure to detect stale files should not cause users
countless hours of aggravation trying to find the cause of no sound. Things like
this simple problem, which should not exist in a production release, will cause
users to simply switch back to alsa, or even worse, switch to another distro or
back to Windows. . .

Comment 1 Lubomir Kundrak 2007-12-20 12:49:48 UTC
Russ: Thanks for your report. I can not reproduce this. Could you please explain
why do you think this is related to a stale lockfile? Apart from that, could you
please try to launch the sound server with "pulseaudio -vvv" and eventually
strace it to track down which file is in question?

(In reply to comment #0)
> How did this kind of stuff make it into a final product release, replacing a
> functional audio system that had no bugs?

It's questionable if there exists software without bugs. Answer to your question
is that it (the whole release) went throught the cycle of testing releases until
users were satisfied with it. Everyone has a chance to report and have fixed
what bothers him and if he cares about having it fixed in release he reports
before the release os frozen. Complaining is cheap -- patches and good reports
aren't -- you can make the next release better.

Comment 2 Russ 2008-01-04 01:41:42 UTC
Lubomir,
Thanks for your resposnse. Please excuse my delayed response due to the holiday
season.

As a developer myself if familiar with the development & release procedures. You
will have to excuse my comments above a bit. I was, at the time, quite dismayed
that such a simple and easily reproduced and correctable problem could find its
way into a final release. I took a significant amount of time for me to locate
the offending problem, as I assumed it was a configuration problem since I had
just installed pulseaudio Once I found the cause I was flabbergasted that the
problem existed in a final release.

Part of testing for a release should consist of creating a system crash, and
then evaluating the outcome of it so that the programs and daemons currently
running can recover from it without intervention of the end user or a sysadmin.
Obviously nobody did enough of that for pulseaudio. Or, if they did, then either
they missed something or my system is not configured properly. If for some
reason it is a configuration problem with my system I apologize for an erroneous
bug report, but I still need the problem resolved and I'm fairly certain that it
exists for other users.

I never said it was a lockfile problem. If you read my comments above it says:
"leaves temporary files," not "leaves stale lockfiles." It is a tempfile problem
(the pid and the socket, which are obviously needed if pulse is to run). When
tempfiles exist the daemon won't start. Maybe the reason you can't reproduce it
is that your machine is runlevel 5 default. The problem exists when you start X
from runlevel 3. Since you are still the same user when you start it again,
pulse wants to use the same directory. Since it can't, the process can't create
the socket and the pid. 

I don't need to do a trace. Since I had to delete the files manually I know
which ones are the offending ones. I apologize for not including them
previously. I assumed the person responding to the report (developer/maintainer)
would know which ones they were, since it's a relatively simple problem which
was so easy for me to reproduce.

The directory and files are:
/tmp/pulse-[username of previously crashed pulseaudio daemon]
/tmp/pulse-[username of previously crashed pulseaudio daemon]/native
/tmp/pulse-[username of previously crashed pulseaudio daemon]/pid

It seems like the problem should be fairly simple to resolve.
Thanks again.

Comment 3 Tom London 2008-01-13 01:05:14 UTC
I can confirm that I see this problem too:

[tbl@localhost ~]$ pulseaudio -v
I: main.c: PolicyKit grants us acquire-high-priority privilige.
I: main.c: We're in the group 'pulse-rt', allowing real-time and high-priority
scheduling.
I: core-util.c: Successfully gained nice level -11.
E: pid.c: Daemon already running.
E: main.c: pa_pid_file_create() failed.
process 3600: dbus_shutdown() called but connections were still live. This
probably means the application did not drop all its references to bus connections.
  D-Bus not built with -rdynamic so unable to print a backtrace
Aborted


I believe the issue is that the pid file in /tmp-USER/pid can be "punned", that
is, it seems to only check for a process with the process id in the pid file,
and not if that process is actually a 'copy' of pulseaudio.

So I'm guessing that the scenario is:
  1. pulseaudio starts successfully and creates /tmp/pulse-USER/pid (say, pid ==
1234).
  2. pulseaudio daemon dies (either due to reboot, crash, etc.).
  3. "some time later" (say, after reboot), pulseaudio tries to start again,
sees the "stale" contents of /tmp/pulse-USER/pid, and checks to see if a process
is running with that PID.
  4. Assume that something else has started with that PID.
  5. No joy.

Wouldn't checking /proc/1234/exe against /usr/bin/pulseaudio be appropriate?

Anyway, doing a "rm -rf /tmp/pulse-USER; pulseaudio -d" always works.

Comment 4 Tom London 2008-01-15 23:26:12 UTC
Here is some specific data:

After a reboot, pulseaudio failed to startup.  Running 'pulseaudio -vvv' produces:

[tbl@localhost pulse-tbl]$ pulseaudio -vvv
I: main.c: PolicyKit grants us acquire-high-priority privilige.
I: main.c: We're in the group 'pulse-rt', allowing real-time and high-priority
scheduling.
I: core-util.c: Successfully gained nice level -11.
E: pid.c: Daemon already running.
E: main.c: pa_pid_file_create() failed.
process 3270: dbus_shutdown() called but connections were still live. This
probably means the application did not drop all its references to bus connections.
  D-Bus not built with -rdynamic so unable to print a backtrace
Aborted
[tbl@localhost pulse-tbl]$ 

Checking out /tmp/pulse-tbl/pid:
[tbl@localhost ~]$ cd /tmp/pulse-tbl
[tbl@localhost pulse-tbl]$ ls -l
total 12
srwxrwxrwx 1 tbl tbl 0 2008-01-15 12:51 native
-rw------- 1 tbl tbl 5 2008-01-15 12:51 pid
[tbl@localhost pulse-tbl]$ cat pid
2833
[tbl@localhost pulse-tbl]$ ps agx | grep 2833
 3253 pts/1    S+     0:00 grep 2833
[tbl@localhost pulse-tbl]$ 

So, 2833 (left over from previous boot) is in the PID file, but there is no 2833
when I checked.

Checking /var/log/messages:
Jan 15 14:10:01 localhost pulseaudio[2835]: pid.c: Daemon already running.
Jan 15 14:10:01 localhost pulseaudio[2835]: main.c: pa_pid_file_create() failed.

[Notice that the process's pid is 2835.]

From the previous boot:
Jan 15 12:51:08 localhost pulseaudio[2833]: pid.c: Stale PID file, overwriting.
Jan 15 12:51:08 localhost pulseaudio[2833]: main.c: setrlimit(RLIMIT_NICE, (31,
31)) failed: Operation not permitted
Jan 15 12:51:10 localhost pulseaudio[2833]: module.c: Failed to load  module
"module-rtp-recv" (argument: ""): initialization failed.
Jan 15 12:51:10 localhost pulseaudio[2833]: module-gconf.c: pa_module_load() failed

[System was turned off for 8 hours before this reboot.]

Finally, removing /tmp/pulse-tbl/pid and retrying:
[tbl@localhost pulse-tbl]$ rm pid
[tbl@localhost pulse-tbl]$ pulseaudio -vvv
I: main.c: PolicyKit grants us acquire-high-priority privilige.
I: main.c: We're in the group 'pulse-rt', allowing real-time and high-priority
scheduling.
I: core-util.c: Successfully gained nice level -11.
W: main.c: setrlimit(RLIMIT_NICE, (31, 31)) failed: Operation not permitted
I: main.c: Page size is 4096 bytes
I: main.c: Fresh high-resolution timers available! Bon appetit!
D: cli-command.c: Checking for existance of
'/usr/lib/pulse-0.9/modules//module-hal-detect.so': success
I: module-hal-detect.c: Trying capability alsa
<<<<< SNIP >>>>>

So it starts up fine after removing pid file, even though there appears no
process 2833......




Comment 5 Tom London 2008-01-16 05:12:44 UTC
I looked at the code in pid.c.

It appears that the code that checks the pid file and then checks if another
version of pulseaudio is running may be weak.

Here is what I think is the pertinent code sequence:

    if ((pid = read_pid(fn, fd)) == (pid_t) -1)
        pa_log_warn("Corrupt PID file, overwriting.");
    else if (pid > 0) {
#ifdef OS_IS_WIN32
        if ((process = OpenProcess(PROCESS_QUERY_INFORMATION, FALSE, pid)) !=
NULL) {
            CloseHandle(process);
#else
        if (kill(pid, 0) >= 0 || errno != ESRCH) {
#endif
            pa_log("Daemon already running.");
            goto fail;
        }

        pa_log_warn("Stale PID file, overwriting.");
    }

So the code tries to send a "zero kill" (really no kill is sent) to determine if
the pid is "live" (i.e., there is a running process with than number).

Here is the test:
        if (kill(pid, 0) >= 0 || errno != ESRCH) {

I think it has 2 problems:
1. It interprets the ability to send the kill as meaning pulseaudio is running.
(What if it can send the kill to a non-pulseaudio process?)
2. I think it ignores the case of kill() returning EPERM; if EPERM is returned,
it concludes that pulseaudio is running).

Shouldn't it check this differently, say something like:

	if( pid_file exists && pid_file contains valid pid && pid is live &&
pid_process is pulseaudio )

Does the "kill(pid,0)" works here?


Comment 6 Lennart Poettering 2008-02-15 16:46:34 UTC
*** Bug 426965 has been marked as a duplicate of this bug. ***

Comment 7 Lennart Poettering 2008-02-15 16:50:33 UTC

*** This bug has been marked as a duplicate of 405351 ***

Comment 8 Joe Orton 2008-03-13 08:57:53 UTC
I see this failure fairly regularly; I boot up, and randomly I get no sound. 
Usually some other process happens to be running with the pid that the pa daemon
used on a previous boot.  

Tom's analysis for how to fix this looks exactly correct.

Comment 9 Tom London 2008-03-13 13:33:50 UTC
I've been lazy. I added the following to /etc/rc.d/rc.local, and the problem is
masked:

echo "Clearing out /tmp/pulse-*"
rm -rf /tmp/pulse-*

Anyway, something like a call to "pidof" would seem to verify the pid in the PID
file.