Description of problem: If I allow haldaemon to start, the system locks hard after a couple of minutes. If haldaemon is disabled, no trouble. kernel-smp-2.6.16-1.2080_FC5, ancient dual PIII Supermicro system (pre-APIC). Version-Release number of selected component (if applicable): 0.5.7-3 How reproducible: 100% Steps to Reproduce: 1.Boot system with haldaemon enabled. Actual results: After a few minutes, the system hangs and needs power cycling. Expected results: System does not hang. Additional info:
Looks like one of your old components might have a bug that hal is triggering. Can you attach an lshal output? Thanks.
Created attachment 127911 [details] lshal output Here it is.
I rebooted with kernel 2.6.16-1.2096_FC5smp with the external firewire enclosure switched off, then enabled haldaemon. No problem. I turned on the firewire enclosure and the system locked. I rebooted with the firewire device on and haldaemon off. No problem again. So, is there some way to blacklist the firewire so hal does not see it?
David, What is the best way to handle this?
Obviously the right answer is to fix the Firewire kernel drivers but I doubt that is going to happen anytime soon. I don't really have any other answer, my position is that 1) it's certainly not a hal bug, it's a firewire bug in the kernel; and 2) having blacklists anywhere else than the kernel for this kind of stuff is not the right answer Suggest to reassign to the kernel component.
Reassigning as per comment #7
Problem still present with kernel-2.6.16-1.2111_FC5smp. The system was stable overnight with haldaemon running, but within less than five minutes of switching on the firewire enclosure, another hard lockup. /var/log/messages shows udev finding the firewire device OK, however the kernel does not like the look of it: May 17 11:35:26 malbec udevd-event[14450]: udev_event_run: seq 700 finished May 17 11:35:26 malbec udevd[421]: udev_done: seq 700, pid [14450] exit with 0, 1 seconds old May 17 11:35:26 malbec gconfd (mpope-13551): Received signal 15, shutting down cleanly May 17 11:35:27 malbec gconfd (mpope-13551): Exiting May 17 11:35:32 malbec ainit: May 17 11:35:32 malbec ainit: May 17 11:35:52 malbec kernel: ieee1394: Error parsing configrom for node 0-00:1023 May 17 11:36:03 malbec udevd[421]: udev_event_run: seq 701 forked, pid [14494], 'add' 'ieee1394', 0 seconds old May 17 11:36:03 malbec udevd-event[14494]: wait_for_sysfs: file '/sys/devices/pci0000:00/0000:00:0f.0/fw-host0/0050770e00071002/bus' appeared after 0 loops I was able to mount and unmount the enclosed disk before the lockup struck.
Created attachment 129296 [details] kernel crash traceback Its getting worse. This time the firewire enclosure was completely disconnected, yet as soon as I started haldaemon, the kernel crashed, as shown in the attachment. This is with kernel-2.6.16-1.2111_FC5smp.
Still present at kernel-smp-2.6.16-1.2133_FC5.
Still present at kernel-smp-2.6.17-1.2139_FC5.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you.
Still present with 2.6.18-1.2200.fc5. Some improvement though--- the device could be hotplugged successfully, and it took more than 24 hours for the bug to hit.
Mike, thanks for the log in comment #10. This may be a bug in the kernel's ieee1394 base driver or in the kernel's sysfs code. I will ask around about sysfs_readdir.
On sysfs_readdir: http://lkml.org/lkml/2006/10/21/179 On ieee1394: http://lkml.org/lkml/2006/10/22/64
Created attachment 139112 [details] lshal output with offending device in USB mode
I had a look at the links above. Thanks, but I am unclear how to proceed at present, as the usual failure mode is a hard lock, not a crash. I have another piece of information though: the enclosure that is causing the trouble is dual firewire/USB, so I have switched it to USB mode. Its still locking. Here is the lshal output in the new configuration.
OK, so the bug is probably located in the kernel's sysfs and driver core code, and probably not in the FireWire and USB subsystems. It may be possible to get a few last kernel log messages at the lock-up by switching to a text console (Strg Alt F1), login as root, restart klogd there: # killall klogd; klogd (This should run klogd in background on this console, i.e. you can continue to enter commands in the console but will see the klogd output. But you may even get a panic message without diverting klogd.) Then plug in the device and wait for bad things to happen. Keep a camera to photograph the screen or paper+pencil around. If this doesn't get you messages when the box locks up, you would need a serial console or netconsole. But this requires a second box and some amount of work to set up. There is one other potential failure cause: The device could send bad data. If attached to FireWire, it could even send data to bad addresses. I don't know if this is possible on USB to the same extent. The host OS has only limited possibilities to check data and addresses. If you have a Windows PC to run a firmware upload utility on, you can try if one of these updates apply: http://www.prolific.com.tw/eng/downloads.asp?ID=44 But please try to get a kernel panic message first.
> It may be possible to get a few last kernel log messages at the lock-up by > switching to a text console (Strg Alt F1) Alas no. With the current kernel, once the bug hits the machine is locked hard. No keyboard, no mouse, no response to ping(1). I will try leaving it running in console mode and see if it is spitting a crash message, but you are probably right that a serial console is more likely to be helpful. The bad data idea is worth a look too, albeit less likely to be the root cause, as the enclosure works fine otherwise--- its disk has my music collection on it, which is in constant use when I am using the machine. That will take a while to organize... we are Windows-free.
(Of course I meant to put the console in front before a crash.)
Someone suggested to dump a kernel image for post mortem analysis: http://lkml.org/lkml/2006/10/30/106 http://fedoraproject.org/wiki/FC6KdumpKexecHowTo?highlight=%28FC6KdumpKexecHowTo%29 (However _I_ don't know how to proceed once you had such an image. I don't even know if you can get an image this way in the first place if the kernel freezes.)
(In reply to comment #22) > Someone suggested to dump a kernel image for post mortem analysis: > http://lkml.org/lkml/2006/10/30/106 > http://fedoraproject.org/wiki/FC6KdumpKexecHowTo > (However _I_ don't know how to proceed once you had such an image. I don't even > know if you can get an image this way in the first place if the kernel freezes.) Yes, you *should* be able to get a vmcore this way (and if not, we'd like to figure out why). Once you have a vmcore, install the matching kernel-debuginfo packages, then you can use crash for analysis. Ex: $ crash /usr/lib/debug/lib/modules/2.6.18-1.2798.fc6/vmlinux /var/crash/2006-11-01-07:43/vmcore ... crash> bt If you're at all familiar with gdb, you should be right at home, as crash is based around gdb.
Back from a week of management training:-(. I have followed the KdumpKexec Howto up to the point of starting the kdump service, but have stuck there: [root@malbec ~]# service kdump start Base address: 125f880 is not page aligned /boot/grub/grub.conf contains the line: kernel /vmlinuz-2.6.18-1.2200.fc5smp ro root=/dev/VolGroup01/LogVol00 acpi=off crashkernel=128M@16M No time to chase this up right now.
Still no luck. I tried to reboot into the kernel provided in the kernel-kdump package. Oddly though, this does not seem to be in bzImage format as implied by the Howto. 43 Malbec> rpm -qil kernel-kdump | grep vmlin /boot/vmlinux-2.6.18-1.2200.fc5kdump I have added an entry for this kernel in grub.conf, but my machine refuses to boot it. The grub doco states that linux zImage or bzImage is required. I suspect the Howto is leaving out detail needed for FC5. At this point I am getting tempted to upgrade to FC6 before pushing this further.
You can't boot directly into the kdump kernel, it gets loaded via a 'kexec -p' command, which is what 'service kdump start' should be doing (among other things). The kdump kernel gets loaded from the context of your normal kernel, and boots directly, rather than going through the bios and grub. The problem here is the failure when you try to start up the kdump service. I have a suspicion there are some 2.6.18-isms that broke kexec-tools in FC5, which I don't think has been updated in a while. Rather than do a full FC6 upgrade, you might try just upgrading kexec-tools and your kernel to the FC6 versions. I'm reasonably certain everything still works with that combo... :) Just to clarify... You *did* reboot after adding the crashkernel parameter to grub, yes? Oh yeah, I finally dug out a firewire burner, so I'll be trying to reproduce the problem here as well...
Hrm, no problems at all w/a firewire burner connected for the past 3 hours, but then this is on an FC6 system, so I wouldn't be surprised if the newer hal helped the situation...
Re #27: Which wouldn't fix the underlying kernel bug, alas. Note that I never heard of a similar report. But then, (a) not everybody who experiences hard lockups of his Linux machine reports them at proper places and (b) I only watch FireWire related bugs, not USB related bugs. And (c) maybe the bug strikes only on older SMP hardware.
Indeed, could be papering over some holes, and certainly could be far more hardware-specific than "firewire burner". FWIW, the burner I've got is an oldish (~2yr) TEAC 40x CD-RW in an ADS Tech Pyro case, hooked to an HP xw9400 workstation (dual dual-core opteron procs), running an x86_64 FC6 install.
OK, I have discovered kdump.txt in the kernel doco and think I understand a lot better now (clarifying #26, yes I did reboot). However in parallel an unrelated showstopper has forced me to start the FC6 upgrade anyway, which was still going as I left for work. We can certainly hope a hal update will help. Just recapping: the attention on firewire is probably misplaced--- the device that causes trouble is a Prolific PL-3507 external firewire/USB disk enclosure (http://www.prolific.com.tw/eng/Products.asp?ID=9), currently containing a 300GB disk half full of .flac files. The lock occurs if it is connected in *either USB or firewire mode* AND haldaemon is running. The last few weeks of locks have been all USB.
It's certainly an issue with sysfs/ driver core.
Or does hald access the device and is issuing SCSI requests?
I can not speak for what haldaemon does, but the device certainly provides a filesystem that is mounted through /dev/sd?1 for both USB and firewire. Its looking hopeful though that the FC6 upgrade has cured the problem (actually a fresh full install, as the upgrade stuffed up due to rpm corruption). The machine has now been up for two days solid, with the offending device in regular use and haldaemon up. Not quite back to my standard operating environment though, so claiming victory is still premature.
OK, the hang is back. I ran kernel-xen-2.6.18-1.2849.fc6 for a week with the device on USB without trouble, then another week on firewire without trouble. When I switched back to vanilla kernel-2.6.18-1.2849.fc6 and USB, the hang reappeared after two days. So, back to chasing that serial console...
Finally. I have a serial console. However I am not clear on how this was supposed to help. When the bug hits, the serial console is just as unresponsive as the main system--- no crash messages, no response. The bug is present but hard to trigger under kernel-xen-2.6.18-1.2868.fc6, but happens within minutes of boot under kernel-2.6.18-1.2868.fc6.
(In reply to comment #35) > Finally. I have a serial console. However I am not clear on how this was > supposed to help. When the bug hits, the serial console is just as > unresponsive as the main system--- no crash messages, no response. Assuming you've booted with something along the lines of 'console=ttyS0' in your kernel args, there's usually spew that will wind up on the serial console that could help to track down the problem.
Shouldn't this bug be renamed? Since USB and FireWire are both affected, it is more likely a bug in sysfs.
Yeah, given that both USB and FireWire devices cause the lockup, it seems more likely a generic sysfs bug than one USB bug and one FireWire bug that both result in the same lockup. Changing summary accordingly.
It's still a kernel bug though, not really a hal bug. Michael, what if you give the latest kernel from kernel.org a try? The latest stable one is 2.6.19.1 (soon 2.6.19.2), but you could as well try 2.6.20-rcX (-rc4 at the moment). There are driver core fixes and sysfs fixes going in now and then. I didn't watch closely, therefore I don't know if there is a hot candidate among the updates that might fix the issue. I'm merely poking in the dark here.
Reply to #39. I am planning to try the recently released 2.6.19/fc7 rawhide kernel rpm when time permits (source kernel builds are pretty tedious at 700MHz). Reply to #36. Yes, I have console=ttyS1. It shows a fine collection of boot and `normal operation' messages, but no spew to coincide with the machine locking--- the last message is usually minutes old and from a random unrelated daemon. Reply on renaming: The name is still a bit misleading. I have not seen the crash in sysfs_readdir since upgrading to FC6. Its just locking hard now. Reply to #19(!). I finally got the disk enclosure firmware updated last week. It has not helped.
Whoops, meant to leave that assigned to kernel, not hal... I'll adjust the bug name again too...
Tried kernel-2.6.19-1.2888.fc6. It took a day to trigger the hang, and left no crash messages on the serial console.
Could this be related? "[PATCH 2.6.19.2] SCSI sd: udev accessing an uninitialized scsi_disk results in a crash", LSML/LKML on 2007-02-02, http://lkml.org/lkml/2007/2/2/94
The patch in comment #43 was released with kernel.org's linux-2.6.20.
Working on it... overnight compiles are a pain. What I need is a 2.6.20.fc6 rpm (not currently keen to grab rawhide fc7 rpms as they need a mkinitrd upgrade). For the record I still see the hang *rarely* on kernel-xen-2.6.19-1.2895.fc6, but reasonably commonly with the non-xen version, which is alas unusable due to a different problem with a dvb-capture card.
Created attachment 151425 [details] kernel crash traceback
Created attachment 154599 [details] syslog record of crash of 2.6.20-1.2944.fc6xen Unusual to crash a xen kernel.
Hope I have the right bug here.... After plugging in a firewire drive, the whole machine sometimes locks up shortly after mounting the drive. Nautilus pops up and the disk starts to be read during which it completely locks up. Upon reboot the last line in the log before crash always says mounted drive on behalf of user, and thats it, which to me is normal output. No errors at all. It seems random in occurance, it has happened 3 times in the last few days. If Nautilus displays the root of the drive it is fine thereafter. Via the command line, can 'ls' no problem until lock up (if there is a lock up). To check the drive tried mounting it on OSX and it comes up no problem. No log as such to add.
Darren, can you prevent the lockups if you disable haldaemon? (Mike's bug is caused by haldaemon accessing sysfs, not by disk I/O.)
Can this bug be closed?
Close it. The dodgy hardware in question has been retired.
Closed.