Description of problem: Insert a cardbus Prism54 card (Netgear WG511) PCI ID 1260:3890 Hal stops processing events -- things get queued but don't move, eg. inserting a USB Flash Drive doesn't produce the normal VFS entries and automount - g-v-m does not mount the drive. /dev/sda1 is created though. Invoking g-v-m-properties complains that hald is not running even though it is a ps aux | grep hald | grep -v grep shows hal running. Everything is OK before inserting the WG511. Version-Release number of selected component (if applicable): hal-0.2.98.cvs20040929-3 How reproducible: Always Steps to Reproduce: 1. Insert WG511 2. Insert previously OK usb flash drive 3. hald is all clogged up Actual results: hald is all clogged up - hw changes are not processed by hald Expected results: hald is not clogged up, and normal hald hw processing e.g hardware just works Additional info: Attaching complete output of a hald --daemon=no --verbose=yes (the good stuff is all at the bottom, but the entire run is included) Attaching an lshal and and lspci -vvv (with the WG511 in) fedora kernel 2.6.8-1.603 (PLUS NTFS RO only change!) I tried with with both udev-034-2 and I also rolled back to udev-032-10 as the changes in udev 034 with the wait_for_sysfs changes were complaining about needing an update for the a ../firmware... class entry. related to the WG511 - so I backed down to udev-032-10 just to make sure it wasn't udev 034.)
Created attachment 104979 [details] Output of hald --daemon=no --verbose=yes Sorry about the 60-printer_remove.hal stuff - after I started seeing this problem, I notice that the printe_removal.hal didn't start with a number so I just guessed at one and renamed it; I also change the -x to a -r since the --help didn't say anything about -x (looking at the output now I see that's not right either, but I don't think its involved as the hald hanging was their before I changed that. If you require, I will change it back and regenerate the output.
Created attachment 104980 [details] lspci -vvv just in case Sorry for all the mispellings
Created attachment 104981 [details] Output of lshal One last thing, unplugging the WG511 does not unhang hald
Created attachment 105002 [details] hald --daemon=no --verbose=yes I redid the tests after reinstall hal*0.2.98.cvs20040929-3 + hal-cups-utils here is the hald output. Plus I did this a little more scientifically. I run this from vc3 then logged into a gnome desktop. I did 2 normal plugin, use, unmount and unplug of my usb flash disk, then I plugged in the WG511 then the usb flash disk again.
Created attachment 105003 [details] lshal output 20041011 20041011 output of lshal at the end WG511 and flash disk inserted
Created attachment 105004 [details] lspci -vvv 20041011
Created attachment 105006 [details] dbus-monitor --system output 20041011 Here is the output of the dbus-monitor --system during the test
Sorry to deluge you with changes on this -- just thought it was better to have the test based on reference. Also FWIW I took the latest hal CVS as of 20041010, and I got the same net results, which is not surprising as there were not a lot of changes. But I put the fc3t2+ hal-*.cvs20040929-3 back before running the tests results I uploaded today.
Hi, sorry for the lag and thanks for the detailed logs. According to the hald trace in comment 4, hald never receives the hotplug event with the sequence number 330 and that is what is causing the processing of hotplug events in hald to freeze. I'm wondering why that is - does (temporarily) removing the /etc/dev.d/net/hotplug.dev file make things work again? Thanks, David
David, No problem -- I actually wasn't expecting such a quick response, so cool. However, I tried that tonight, and it did not make any difference (I even rebooted just to make sure) I did notice (and I should have seen it sooner) the following in my /var/log/messages ... udev: removing device node '/dev/0000:03:00.0' ... that just happens to match the end of the sysfs path for the card when plugged into the bottom slot, and is contemporary with other messages relating to the insertion of the card. I am attaching that section [messages.20041015] plus a dmesg [dmesg.20041015] output. Pardon the speculation, but since there is nothing in the log about creating /dev/0000:03:00.0 could that be involved? esp considering we having a MIA hotplug seqnum. After collecting that I ginned up a front end to /sbin/hotplug to log every invocation to debug-hotplug.log, then call /sbin/hotplug. Sorry this is still all new to me, there is probably an easier way to do that, but I would be much more comfortable looking at a VMS crash dump :-) I am attaching that [debug-hotplug.log] as well as the hald [hald-20041015] output. They are both missing the same (appearently critical) sequence number seqnum=351 so what ever that event is, at least to me, it does not look like its comming from the kernel. FWIW the messages and dmesg are from before I started down the hotplug script path. Is this maybe a hotplug{udev,kernel?} bug that hal is a downstream victim of? Also just so you have the complete picture, I had already built a hal with Martin Pitts patch to pci_bus_device.c, and that's what I am running right now. (I had to try) I would be happy to run anything in debug but I would need some guidance on where to set break or watch points. Thanks, Andrew
Created attachment 105264 [details] snip of var/log/messages showing the udev removing device node /dev/0000:03:00.0
Created attachment 105265 [details] dmesg from the same time as the messages-20041015
Created attachment 105266 [details] output of a front end to /sbin/hotplug
Created attachment 105267 [details] hald --daemon=no --verbose=yes (while dbug-hotplug was being collected)
Hi David. I am sure this is either a kernel or Prism54 driver bug, so I guess you can close it. I spent the afternoon playing with the /lib/kobject.c from the kernel. After the sequence_number has been incremented there are two paths that allow an exit without actually calling call_usermodehelper() to invoke /sbin/hotplug. This is the missing hotplug event. I hacked a trival patch to kobject.c and it fixes this problem exactly. I am attaching the patch just as an fyi as this is really a kernel "bug". The other solution I am working on is a similar patch to kobject.c but to emit a "void" hotplug event if the call to kset->hotplug_ops->hotplug() fails. That is still building so I've not tested that one yet. Plus I need to verify my understanding of the env block that is being created so that any garbage created by the failed call to kset->hotplug_ops->hotplug() in the env block is cleaned up and the original argv [1] (name) gets put as an env var e.g. OLDNAME. Of course this does not address the immediate root cause of the behaviour I was seeing as I am guessing its something in the prism54 driver that is making the kset->hotplug_ops->hotplug() fail. However, if (as it seems) that HAL really prefers there be no "gaps" in the hotplug sequence numbers, then it is better that bugs in drivers (that don't even cause an Oops) should make hald virtually hang. Thanks, Andrew
Created attachment 105341 [details] hoplug sequence number patch in lib/kobject.c that greatly reduces gaps in hotplug seqnums
Dave, any comments on the patch?
*** Bug 135677 has been marked as a duplicate of this bug. ***
Created attachment 105412 [details] Updated lib/kobject.patch based on feed back up stream If you are still interested, here is an updated patch based on some feedback from Greg KH. The big change is that sequence_number never needs to be decremented. This one puts the seqnum in a different buffer, so it can put off getting a sequence number until after all the conditional exits. A spot in the envp array had to reserved for the sequence number as the call to kset->hotplug_ops->hotplug() does not return a new value for i. It can't update the scratch pointer either so there is no way to know where the end of the used portion of buffer was. Thx Andrew
Well, hald should really be able to cope with missing hotplug events so I did this. It's a sign of a driver bug for sure, but you'll see something along the lines of this in your syslog Oct 18 18:17:11 davidz hald[1145]: Timed out waiting for hotplug event 784 This fix is in hal-0.4.0-4 which is available in Rawhide tomorrow. I was also able to test this with a Cardbus card with using this driver so I'm closing as RAWHIDE. Andrew: Feel free to reopen this bug if it doesn't work for you. Thanks, David