Bug 384271

Summary: Hal does occassionally not restore ACLs
Product: [Fedora] Fedora Reporter: Nick Lamb <redhat>
Component: halAssignee: David Zeuthen <davidz>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 8CC: bche, belegdol, bojan, chris, clasohm, cra, dbaron, jhutar, lakshminaras2002, lkundrak, luis, mads, mcepl, mclasen, pierre-bugzilla, sankarshan.mukhopadhyay, steve
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: 0.5.10-1.fc8.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-13 03:42:52 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 397601, 434909    

Description Nick Lamb 2007-11-15 05:04:22 EST
Description of problem:

This is for a Thinkpad Z60m upgraded from Fedora 7, though I suspect the
hardware doesn't matter for this bug.

When this laptop has been hibernated and then restored, sound stops working.
On closer inspection I see that in fact the ACLs that permit me to use the
sound device have been removed.

I don't know which component is responsible, so udev is my first guess. Please
pass this on to another component if you know better.

Version-Release number of selected component (if applicable):

udev-116-3.fc8

How reproducible:

Absolutely reliable so far

Steps to Reproduce:
1. Hibernate the laptop
2. Restore the laptop
3. Check access to sound device with e.g. getfacl /dev/snd/pcmC0D0c
  
Actual results:

No ACL for my user

# file: dev/snd/pcmC0D0c
# owner: root
# group: root
user::rw-
user:gdm:rw-
group::rw-
mask::rw-
other::---

Expected results:

Permissive ACL for my (logged in) user

# file: dev/snd/pcmC0D0c
# owner: root
# group: root
user::rw-
user:gdm:rw-
user:njl:rw-
group::rw-
mask::rw-
other::---

Additional info:

Sound worked after restoring in Fedora 7, but I don't know whether ACLs were
used in that release.
Comment 1 Harald Hoyer 2007-11-15 05:19:38 EST
udev does not set the ACLs
Comment 2 Jack Spaar 2007-11-17 19:47:10 EST
I have the same problem, but sound works OK for root.

This is almost certainly a dupe of
https://bugzilla.redhat.com/show_bug.cgi?id=376011

When the problem occurs, gnome-volume-control reports "No volume control
GStreamer plugins and/or devices found." as in the above bug.

But sound works for root.

E.g.
luser@system$ aplay /usr/share/sounds/startup3.wav 
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:3510:(_snd_config_evaluate) function snd_func_card_driver
returned error: No such device
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:3510:(_snd_config_evaluate) function snd_func_concat returned
error: No such device
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:3510:(_snd_config_evaluate) function snd_func_refer returned
error: No such device
ALSA lib conf.c:3982:(snd_config_expand) Evaluate error: No such device
ALSA lib pcm.c:2145:(snd_pcm_open_noupdate) Unknown PCM default
aplay: main:546: audio open error: No such device


luser@system$ sudo aplay /usr/share/sounds/startup3.wav 
Playing WAVE '/usr/share/sounds/startup3.wav' : Signed 16 bit Little Endian,
Rate 44100 Hz, Stereo

Comment 3 Luis Villa 2007-11-17 20:04:56 EST
*** Bug 376011 has been marked as a duplicate of this bug. ***
Comment 4 David Zeuthen 2007-11-17 20:09:19 EST
Does this problem go away if you switch to VT1 and then back? 

(from a root shell in the session you can do 'chvt 1; sleep 2; chvt 7')
Comment 5 Jack Spaar 2007-11-17 20:20:48 EST
(In reply to comment #4)
> Does this problem go away if you switch to VT1 and then back? 
> 

Yes!
Comment 6 Luis Villa 2007-11-17 22:10:18 EST
yup, that fixes it here too.
Comment 7 Nick Lamb 2007-11-19 04:45:33 EST
Yes, the ACLs are put back when I switch away and back again, thanks David.

Also I discovered that this isn't 100% reproducible after all. Sometimes it
doesn't happen. I've yet to pin down what makes the difference.
Comment 8 David Zeuthen 2007-11-19 11:02:06 EST
Glad to hear the VT switching "fixes" it. I've been working on and off for a fix
but it's pretty difficult to reproduce this one...
Comment 9 thu992 2007-11-20 06:09:32 EST
Same problem and fix here too. Maybe this bug qualifies as a "known issue"
(https://fedoraproject.org/wiki/Bugs/F8Common)? I'm sure many other laptop users
are facing the same problem.
Comment 10 Will Woods 2007-12-03 15:39:25 EST
*** Bug 395581 has been marked as a duplicate of this bug. ***
Comment 11 Lubomir Kundrak 2008-01-25 06:30:28 EST
*** Bug 314411 has been marked as a duplicate of this bug. ***
Comment 12 Lubomir Kundrak 2008-01-25 06:34:16 EST
David: I can reporoduce it in roughly 30% of cases. It doesn't only happen after
resumes, but also on console switches (though less frequently), with fast user
swhitching your chances to reproduce it are higher. Also, I observed that it
happens more frequently on some machines, and less frequently on other ones.

As I seem to be able to reproduce this often, is there I way I can be helpful?
Would output of hald running verbosely be helpful?
Comment 13 Nick Lamb 2008-02-25 19:40:11 EST
I had the idea to capture the DBus system bus messages to see what's actually
going on when this happens.

For a while after I did this, I did not see the problem. However I notice that
today, the ACLs were not restored.

Disappointingly there is no much further clue beyond the absence of the expected
ACLAdded messages following the ACLRemoved messages in the log. So, I suppose we
at least know that the software doesn't think it has restored the ACLs, and the
question remains why it either isn't trying or doesn't succeed, most likely the
former.

However, I do notice that in every case during a Hibernation, nothing happens
until after the subsequent restore. That is, the ACLs aren't even removed, let
alone restored, until after the machine is defrosted from its hibernation.

What's the next step? Add some diagnostics to hal-acl-tool  to see whether it's
being called at all when this goes wrong? Which program actually runs
hal-acl-tool, the hal daemon ?
Comment 14 Lubomir Kundrak 2008-02-26 01:01:15 EST
tialaramex: Actually what happens is that after two VT switches in short time
two instances of hal-acl-tool get spawned, and the later-created can get
scheduled sooner, co they run in reverse.

Currently hal passes the information on sessions in environment variables to
hal-acl-tool (an optimization), which is incorrect, since the information can be
invalid when hal-acl-tool applies those. My idea is to get the session
information from consolekit via dbus calls after locking acl list.

Unless davidz does it it can take some time, as I'm rather poor-minded in this area.
Comment 15 David Zeuthen 2008-02-26 01:14:31 EST
My plan is to work on hal this week. The info in comment 14 is useful; I'll
review the locking code and torture test the whole thing.
Comment 16 Lubomir Kundrak 2008-02-26 03:36:29 EST
David: I commited the fix for F-8 to CVS [1]. Please have a short look at it, it
fixed the issue for me, I tried several suspend/resumes, VT switches, fast user
switching. Mostly stolen from William Jon McCann.

[1]
https://www.redhat.com/archives/fedora-extras-commits/2008-February/msg10485.html

If you won't object we may push this for F-8 as it's pretty simple, and issue it
fixes is serious enough; no matter what will be the more elegant fix for future
upstream version :)
Comment 17 Lubomir Kundrak 2008-02-26 03:39:51 EST
Note that that patch also need changes to selinux policy, as currently
hal-acl-tool is not allowed to talk to dbus
Comment 18 David Zeuthen 2008-02-26 10:47:47 EST
(In reply to comment #16)
> David: I commited the fix for F-8 to CVS [1]. Please have a short look at it, it
> fixed the issue for me, I tried several suspend/resumes, VT switches, fast user
> switching. Mostly stolen from William Jon McCann.

No, please avoid committing this for F-8. It fixes only the symptom, not the
real bug. But thanks for the patch, testing and data points; might be useful to
get to the bottom of this bug.
Comment 19 Nick Lamb 2008-02-26 11:26:28 EST
David, so far as I can tell Lubomir has identified the bug, and his fix is
unavoidable. The current hal-acl-tool design can't work because it assumes that
sub-process execution is synchronous, which isn't true on any remotely modern
computer. So it needs to be replaced, as is done in Lubomir's patch.

What is the "real bug" that you think is the problem here, and how will your fix
avoid the case where the hal-acl-tool acts at the wrong moment ?
Comment 20 David Zeuthen 2008-02-26 11:43:29 EST
(In reply to comment #19)
> David, so far as I can tell Lubomir has identified the bug, and his fix is
> unavoidable. The current hal-acl-tool design can't work because it assumes that
> sub-process execution is synchronous, which isn't true on any remotely modern
> computer. So it needs to be replaced, as is done in Lubomir's patch.

That's an interesting claim to make. FWIW, Lubomir's patch creates a ton of
extra work because it gets information from CK via D-Bus instead of using the
information passed from hald who is already watching CK asynchronously. This
information was specifically added to avoid doing all this work.

I think the bug is just that one or more hal-acl-tool processes gets in the way
of each other e.g. that the locking is somehow broken.

> 
> What is the "real bug" that you think is the problem here, and how will your fix
> avoid the case where the hal-acl-tool acts at the wrong moment ?

I won't have time to debug this until tomorrow.
Comment 21 Nick Lamb 2008-02-26 15:08:15 EST
(In reply to comment #20)
> That's an interesting claim to make. FWIW, Lubomir's patch creates a ton of
> extra work because it gets information from CK via D-Bus instead of using the
> information passed from hald who is already watching CK asynchronously. This
> information was specifically added to avoid doing all this work.

Yes, but I think the work is unavoidable with the current design.

> I think the bug is just that one or more hal-acl-tool processes gets in the
> way of each other e.g. that the locking is somehow broken.

Let me spell out a race condition, which I believe is commonplace.

1. Laptop lid closed, hibernate begins, console kit changes to 'false'
2. hal-acl-tool pid #846 is created to remove ACLs, but it doesn't run yet
because the kernel is trying to hibernate.
3. Hibernation complete, power off
4. Restore initiates thawing, console kit back to 'true'
5. hal-acl-tool pid #851 is created to restore ACLs
6. hal-acl-tool pid #851 runs, ACLs are already present, nothing to do, exits
7. hal-acl-tool pid #846 is thawed, runs, removes ACLs

Now, this is contrary to what you might /expect/ to happen, since you started
#846 first, but there we are, this is a pre-emptive multitasking operating
system and things don't necessarily happen in the order you expected.

Any reliable fix for my bug report will need to address this race condition.
Lubomir's fix addresses this race condition. Poking around in hal-acl-tool's own
locking won't address the race condition. Maybe you'll find another bug, maybe
you won't, but I think Lubomir's found the real cause of my trouble.

If you're sure that D-Bus messages are too expensive, your other option is to
arrange for HAL to only run one hal-acl-tool at a time (always waiting for the
previous one to complete before starting another) and queue up ACL changes until
a new sub-process can be started. If you don't already have a utility
sub-routine to do this sort of thing correctly it will undoubtedly take some
serious debugging to make this robust.
Comment 22 Lubomir Kundrak 2008-02-28 12:54:08 EST
*** Bug 422751 has been marked as a duplicate of this bug. ***
Comment 23 Lubomir Kundrak 2008-02-28 13:16:34 EST
*** Bug 431349 has been marked as a duplicate of this bug. ***
Comment 24 Nick Lamb 2008-02-28 13:54:24 EST
Did you find anything in your investigation David ?

Lubomir, perhaps you can create a (Fedora 8) RPM with your change that
interested parties can test while we wait a little while to see if David finds a
more elegant solution? Also, you mentioned SELinux policy. Is the current
situation that your patch does not function on systems where SELinux policy is
enforced ? Or did I misunderstand.
Comment 25 Lubomir Kundrak 2008-02-28 17:31:02 EST
tialaramex: the package is built in koji already for some time. [1]

[1] http://koji.fedoraproject.org/koji/buildinfo?buildID=39956

I plan pushing it to testing tomorrow unless davidz comes up with a better
solution until then -- it's been three months since this has been reported and
caused lot of trouble to laptop users. I'm always in favour of fix from David!

When it comes to SELinux, you're right. In enforcing mode it won't allow
hal-acl-tool to communicate via dbus' socket and therefore it can't find
information on seats and sessions from consolekit, so effectively it won't work
at all. Modifying a SELinux policy should be trivial though, so if we agree on
the fix (depending of whether solution comes from davidz), i'd do that.
Comment 26 David Zeuthen 2008-02-28 18:20:56 EST
> I plan pushing it to testing tomorrow

No, please don't do this. Thanks.
Comment 27 Lubomir Kundrak 2008-02-29 04:35:46 EST
*** Bug 397601 has been marked as a duplicate of this bug. ***
Comment 28 David Zeuthen 2008-03-04 00:20:35 EST
(In reply to comment #21)
> Let me spell out a race condition, which I believe is commonplace.
> 
> 1. Laptop lid closed, hibernate begins, console kit changes to 'false'
> 2. hal-acl-tool pid #846 is created to remove ACLs, but it doesn't run yet
> because the kernel is trying to hibernate.
> 3. Hibernation complete, power off
> 4. Restore initiates thawing, console kit back to 'true'
> 5. hal-acl-tool pid #851 is created to restore ACLs
> 6. hal-acl-tool pid #851 runs, ACLs are already present, nothing to do, exits
> 7. hal-acl-tool pid #846 is thawed, runs, removes ACLs
> 
> Now, this is contrary to what you might /expect/ to happen, since you started
> #846 first, but there we are, this is a pre-emptive multitasking operating
> system and things don't necessarily happen in the order you expected.

Right. The bug is really that we don't serialize the hal-acl-tool calls. I've
done this now

http://gitweb.freedesktop.org/?p=hal.git;a=commitdiff;h=f047f03869b2f5d20de1eafdae02d4ebc6eddc06

and this fixes it for me (will land in Rawhide tomorrow with a ton of other
fixes). Any chance anyone can check if this patch applies to the F8 srpm and if
it fixes the problem? Thanks.
Comment 29 Nick Lamb 2008-03-04 05:53:18 EST
From reading the patch this fix looks correct assuming that the callback is the
only way the affected code can be entered asynchronously -- which I can't verify
without reading the rest of the HAL code. Also I assumed that the callback isn't
actually running in signal handler context (for SIGCHLD) but just afterwards,
since it does far too much and too dangerous work for a signal handler.

I have been running Lubomir's RPMs for a few days now without problems, but this
fix is more in the spirit of the original design. If no-one else does it then I
will try to find time this month to look at getting my RPM build environment
working again and test your patch on my F8 laptop.
Comment 30 Fedora Update System 2008-03-04 07:35:35 EST
hal-0.5.10-1.fc8.2 has been submitted as an update for Fedora 8
Comment 31 Fedora Update System 2008-03-06 11:33:50 EST
hal-0.5.10-1.fc8.2 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update hal'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-2246
Comment 32 Chris Green 2008-03-07 07:02:57 EST
I'm interested in this.  I think I'm seeing another symptom of the same problem.

When I start X from run level 3 using startx sound doesn't work, I get the
gnome-volume-control reports "No volume control GStreamer plugins and/or devices
found." message.

However if I start X from run level 5 sound works fine.

I suspect I may be seeing other symptoms of the same issue from VMWare.
Comment 33 Lubomir Kundrak 2008-03-07 07:08:03 EST
(In reply to comment #32)
> I'm interested in this.  I think I'm seeing another symptom of the same problem.
> 
> When I start X from run level 3 using startx sound doesn't work, I get the
> gnome-volume-control reports "No volume control GStreamer plugins and/or devices
> found." message.
> 
> However if I start X from run level 5 sound works fine.
> 
> I suspect I may be seeing other symptoms of the same issue from VMWare.

What makes you believe it is a symptom of this problem?
Launch "ck-list-sessions" to see if ConsoleKit knows about your session
(I suspect it won't. Do not use startx. Use gdm. Or maybe reusing the same VT
would help?)
Comment 34 Fedora Update System 2008-03-13 03:42:49 EDT
hal-0.5.10-1.fc8.2 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 35 Narasimhan 2009-11-06 00:19:57 EST
Hello,
 I am seeing the issue of "no sound after resume" in Fedora 11 with T60P. Killing and restarting pulse audio solves the issue.The desktop is kde 4.3.2.
Please let me know the list of output files that I could attach here.

Thanks.