Bug 135202

Summary: HAL hangs - stops processing when a Netgear WG511 (Prism54) card inserted
Product: [Fedora] Fedora Reporter: Andrew Duggan <lists>
Component: halAssignee: David Zeuthen <davidz>
Status: CLOSED RAWHIDE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: davej, dcbw, mclasen
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-10-19 01:27:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 130887    
Attachments:
Description Flags
Output of hald --daemon=no --verbose=yes
none
lspci -vvv just in case
none
Output of lshal
none
hald --daemon=no --verbose=yes
none
lshal output 20041011
none
lspci -vvv 20041011
none
dbus-monitor --system output 20041011
none
snip of var/log/messages showing the udev removing device node /dev/0000:03:00.0
none
dmesg from the same time as the messages-20041015
none
output of a front end to /sbin/hotplug
none
hald --daemon=no --verbose=yes (while dbug-hotplug was being collected)
none
hoplug sequence number patch in lib/kobject.c that greatly reduces gaps in hotplug seqnums
none
Updated lib/kobject.patch based on feed back up stream none

Description Andrew Duggan 2004-10-10 04:15:01 UTC
Description of problem:
Insert a cardbus Prism54 card (Netgear WG511) PCI ID 1260:3890 Hal
stops processing events -- things get queued but don't move, eg.
inserting a USB Flash Drive doesn't produce the normal VFS entries and
automount - g-v-m does not mount the drive. /dev/sda1 is created
though. Invoking g-v-m-properties complains that hald is not running
even though it is a 
ps aux | grep hald | grep -v grep 
shows hal running.  Everything is OK before inserting the WG511.  

Version-Release number of selected component (if applicable):
hal-0.2.98.cvs20040929-3

How reproducible:
Always

Steps to Reproduce:
1. Insert WG511
2. Insert previously OK usb flash drive 
3. hald is all clogged up
  
Actual results:
hald is all clogged up - hw changes are not processed by hald

Expected results:
hald is not clogged up, and normal hald hw processing e.g
hardware just works


Additional info:
Attaching complete output of a 
hald --daemon=no --verbose=yes 
(the good stuff is all at the bottom, but the entire run is included) 
Attaching an lshal and and lspci -vvv (with the WG511 in)
fedora kernel 2.6.8-1.603 (PLUS NTFS RO only change!)
I tried with with both udev-034-2 and I also rolled back to 
udev-032-10 as the changes in udev 034 with the wait_for_sysfs changes
were complaining about needing an update for the a ../firmware...
class entry. related to the WG511 - so I backed down to udev-032-10
just to make sure it wasn't udev 034.)

Comment 1 Andrew Duggan 2004-10-10 04:26:24 UTC
Created attachment 104979 [details]
Output of hald --daemon=no --verbose=yes 

Sorry about the 60-printer_remove.hal stuff - after I started seeing this
problem, I notice that the printe_removal.hal didn't start with a number so I
just guessed at one and renamed it; I also change the -x to a -r since the
--help didn't say anything about -x (looking at the output now I see that's not
right either, but I don't think its involved as the hald hanging was their
before I changed that. If you require, I will change it back and regenerate the
output.

Comment 2 Andrew Duggan 2004-10-10 04:27:56 UTC
Created attachment 104980 [details]
lspci -vvv just in case

Sorry for all the mispellings

Comment 3 Andrew Duggan 2004-10-10 04:31:55 UTC
Created attachment 104981 [details]
Output of lshal

One last thing, unplugging the WG511 does not unhang hald

Comment 4 Andrew Duggan 2004-10-11 12:27:27 UTC
Created attachment 105002 [details]
hald --daemon=no --verbose=yes

I redid the tests after reinstall hal*0.2.98.cvs20040929-3 + hal-cups-utils
here is the hald output.  Plus I did this a little more scientifically.  I run
this from vc3 then logged into a gnome desktop.  I did 2 normal plugin, use,
unmount and unplug of my usb flash disk, then I plugged in the WG511 then the
usb flash disk again.

Comment 5 Andrew Duggan 2004-10-11 12:29:16 UTC
Created attachment 105003 [details]
lshal output 20041011

20041011 output of lshal at the end WG511 and flash disk inserted

Comment 6 Andrew Duggan 2004-10-11 12:29:59 UTC
Created attachment 105004 [details]
lspci -vvv 20041011

Comment 7 Andrew Duggan 2004-10-11 12:33:08 UTC
Created attachment 105006 [details]
dbus-monitor --system output 20041011

Here is the output of the dbus-monitor --system during the test

Comment 8 Andrew Duggan 2004-10-11 12:38:45 UTC
Sorry to deluge you with changes on this -- just thought it was better
to have the test based on reference.  Also FWIW I took the latest hal
CVS as of 20041010, and I got the same net results, which is not
surprising as there were not a lot of changes.  But I put the fc3t2+
hal-*.cvs20040929-3 back before running the tests results I uploaded
today. 


Comment 9 David Zeuthen 2004-10-14 22:30:03 UTC
Hi, sorry for the lag and thanks for the detailed logs. 

According to the hald trace in comment 4, hald never receives the
hotplug event with the sequence number 330 and that is what is causing
the processing of hotplug events in hald to freeze. I'm wondering why
that is - does (temporarily) removing the /etc/dev.d/net/hotplug.dev
file make things work again?

Thanks,
David

Comment 10 Andrew Duggan 2004-10-15 06:08:12 UTC
David,

No problem -- I actually wasn't expecting such a quick response, so cool.
However, I tried that tonight, and it did not make any difference (I
even rebooted just to make sure)  I did notice (and I should have seen
it sooner) the following in my /var/log/messages
... udev: removing device node '/dev/0000:03:00.0'  ... that just
happens to match the end of the sysfs path for the card when plugged
into the bottom slot, and is contemporary with other messages relating
to the insertion of the card.  I am attaching that section
[messages.20041015] plus a dmesg [dmesg.20041015] output. 

Pardon the speculation, but since there is nothing in the log about
creating /dev/0000:03:00.0 could that be involved? esp considering we
having a MIA hotplug seqnum.  

After collecting that I ginned up a front end to  /sbin/hotplug to log
every invocation to debug-hotplug.log, then call /sbin/hotplug.  Sorry
this is still all new to me, there is probably an easier way to do
that, but I would be much more comfortable looking at a VMS crash dump
:-)   I am attaching that [debug-hotplug.log] as well as the hald
[hald-20041015] output.  They are both missing the same (appearently
critical) sequence number seqnum=351 so what ever that event is, at
least to me, it does not look like its comming from the kernel. FWIW
the messages and dmesg are from before I started down the hotplug
script path.

Is this maybe a hotplug{udev,kernel?} bug that hal is a downstream
victim of?

Also just so you have the complete picture, I had already built a hal
with Martin Pitts patch to pci_bus_device.c, and that's what I am
running right now. (I had to try)

I would be happy to run anything in debug but I would need some
guidance on where to set break or watch points.

Thanks,

Andrew

Comment 11 Andrew Duggan 2004-10-15 06:12:09 UTC
Created attachment 105264 [details]
snip of var/log/messages showing the udev removing device node /dev/0000:03:00.0

Comment 12 Andrew Duggan 2004-10-15 06:13:22 UTC
Created attachment 105265 [details]
dmesg from the same time as the messages-20041015

Comment 13 Andrew Duggan 2004-10-15 06:14:51 UTC
Created attachment 105266 [details]
output of a front end to /sbin/hotplug

Comment 14 Andrew Duggan 2004-10-15 06:17:50 UTC
Created attachment 105267 [details]
hald --daemon=no --verbose=yes (while dbug-hotplug was being collected)

Comment 15 Andrew Duggan 2004-10-17 04:01:50 UTC
Hi David.

I am sure this is either a kernel or Prism54 driver bug, so I guess
you can close it. I spent the afternoon playing with the
/lib/kobject.c from the kernel.  After the sequence_number has been
incremented there are two paths that allow an exit without actually
calling call_usermodehelper() to invoke /sbin/hotplug. This is the
missing hotplug event.  I hacked a trival patch to kobject.c and it
fixes this problem exactly. I am attaching the patch just as an fyi as
this is really a kernel "bug".

The other solution I am working on is a similar patch to kobject.c but
to emit a "void" hotplug event if the call to
kset->hotplug_ops->hotplug() fails.  That is still building so I've
not tested that one yet.  Plus I need to verify my understanding of
the env block that is being created so that any garbage created by the
failed call to kset->hotplug_ops->hotplug() in the env block is
cleaned up and the original argv [1] (name) gets put as an env var
e.g. OLDNAME. 

Of course this does not address the immediate root cause of the
behaviour I was seeing as I am guessing its something in the prism54
driver that is making the kset->hotplug_ops->hotplug() fail. However,
if (as it seems) that HAL really prefers there be no "gaps" in the
hotplug sequence numbers, then it is better that bugs in drivers 
(that don't even cause an Oops) should make hald virtually hang.

Thanks,

Andrew


Comment 16 Andrew Duggan 2004-10-17 04:06:25 UTC
Created attachment 105341 [details]
hoplug sequence number patch in lib/kobject.c that greatly reduces gaps in hotplug seqnums

Comment 17 Dan Williams 2004-10-18 16:40:53 UTC
Dave, any comments on the patch?

Comment 18 David Zeuthen 2004-10-18 21:50:56 UTC
*** Bug 135677 has been marked as a duplicate of this bug. ***

Comment 19 Andrew Duggan 2004-10-19 01:07:47 UTC
Created attachment 105412 [details]
Updated lib/kobject.patch based on feed back up stream 

If you are still interested, here is an updated patch based on some feedback
from Greg KH. The big change is that sequence_number never needs to be
decremented.  This one puts the seqnum in a different buffer, so it can put off
getting a sequence number until after all the conditional exits. A spot in the
envp array had to reserved for the sequence number as the call to
kset->hotplug_ops->hotplug() does not return a new value for i. It can't update
the scratch pointer either so there is no way to know where the end of the used
portion of buffer was. 

Thx Andrew

Comment 20 David Zeuthen 2004-10-19 01:27:51 UTC
Well, hald should really be able to cope with missing hotplug events
so I did this. It's a sign of a driver bug for sure, but you'll see
something along the lines of this in your syslog

 Oct 18 18:17:11 davidz hald[1145]: Timed out waiting for hotplug
event 784

This fix is in hal-0.4.0-4 which is available in Rawhide tomorrow. I
was also able to test this with a Cardbus card with using this driver
so I'm closing as RAWHIDE. Andrew: Feel free to reopen this bug if it
doesn't work for you.

Thanks,
David