Red Hat Bugzilla – Bug 188140
Hard lockup using both USB and/or FireWire external HDs w/haldaemon on
Last modified: 2008-03-04 19:08:43 EST
Description of problem:
If I allow haldaemon to start, the system locks hard after a couple of
minutes. If haldaemon is disabled, no trouble.
kernel-smp-2.6.16-1.2080_FC5, ancient dual PIII Supermicro system (pre-APIC).
Version-Release number of selected component (if applicable): 0.5.7-3
How reproducible: 100%
Steps to Reproduce:
1.Boot system with haldaemon enabled.
After a few minutes, the system hangs and needs power cycling.
System does not hang.
Looks like one of your old components might have a bug that hal is triggering.
Can you attach an lshal output? Thanks.
Created attachment 127911 [details]
Here it is.
I rebooted with kernel 2.6.16-1.2096_FC5smp with the external firewire
enclosure switched off, then enabled haldaemon. No problem. I turned on the
firewire enclosure and the system locked. I rebooted with the firewire device
on and haldaemon off. No problem again. So, is there some way to blacklist
the firewire so hal does not see it?
What is the best way to handle this?
Obviously the right answer is to fix the Firewire kernel drivers but I doubt
that is going to happen anytime soon. I don't really have any other answer, my
position is that
1) it's certainly not a hal bug, it's a firewire bug in the kernel; and
2) having blacklists anywhere else than the kernel for this kind
of stuff is not the right answer
Suggest to reassign to the kernel component.
Reassigning as per comment #7
Problem still present with kernel-2.6.16-1.2111_FC5smp. The system was stable
overnight with haldaemon running, but within less than five minutes of
switching on the firewire enclosure, another hard lockup. /var/log/messages
shows udev finding the firewire device OK, however the kernel does not like
the look of it:
May 17 11:35:26 malbec udevd-event: udev_event_run: seq 700 finished
May 17 11:35:26 malbec udevd: udev_done: seq 700, pid  exit with
0, 1 seconds old
May 17 11:35:26 malbec gconfd (mpope-13551): Received signal 15, shutting down
May 17 11:35:27 malbec gconfd (mpope-13551): Exiting
May 17 11:35:32 malbec ainit:
May 17 11:35:32 malbec ainit:
May 17 11:35:52 malbec kernel: ieee1394: Error parsing configrom for node
May 17 11:36:03 malbec udevd: udev_event_run: seq 701 forked, pid
, 'add' 'ieee1394', 0 seconds old
May 17 11:36:03 malbec udevd-event: wait_for_sysfs: file
after 0 loops
I was able to mount and unmount the enclosed disk before the lockup struck.
Created attachment 129296 [details]
kernel crash traceback
Its getting worse. This time the firewire enclosure was completely
disconnected, yet as soon as I started haldaemon, the kernel crashed, as shown
in the attachment. This is with kernel-2.6.16-1.2111_FC5smp.
Still present at kernel-smp-2.6.16-1.2133_FC5.
Still present at kernel-smp-2.6.17-1.2139_FC5.
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.
Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.
This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.
Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.
In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed. See bug 207474 for further details.
If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.
If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.
Still present with 2.6.18-1.2200.fc5. Some improvement though--- the device
could be hotplugged successfully, and it took more than 24 hours for the bug
Mike, thanks for the log in comment #10. This may be a bug in the kernel's
ieee1394 base driver or in the kernel's sysfs code. I will ask around about
On sysfs_readdir: http://lkml.org/lkml/2006/10/21/179
On ieee1394: http://lkml.org/lkml/2006/10/22/64
Created attachment 139112 [details]
lshal output with offending device in USB mode
I had a look at the links above. Thanks, but I am unclear how to proceed at
present, as the usual failure mode is a hard lock, not a crash.
I have another piece of information though: the enclosure that is causing the
trouble is dual firewire/USB, so I have switched it to USB mode. Its still
locking. Here is the lshal output in the new configuration.
OK, so the bug is probably located in the kernel's sysfs and driver core code,
and probably not in the FireWire and USB subsystems.
It may be possible to get a few last kernel log messages at the lock-up by
switching to a text console (Strg Alt F1), login as root, restart klogd there:
# killall klogd; klogd
(This should run klogd in background on this console, i.e. you can continue to
enter commands in the console but will see the klogd output. But you may even
get a panic message without diverting klogd.) Then plug in the device and wait
for bad things to happen. Keep a camera to photograph the screen or paper+pencil
If this doesn't get you messages when the box locks up, you would need a serial
console or netconsole. But this requires a second box and some amount of work to
There is one other potential failure cause: The device could send bad data. If
attached to FireWire, it could even send data to bad addresses. I don't know if
this is possible on USB to the same extent. The host OS has only limited
possibilities to check data and addresses. If you have a Windows PC to run a
firmware upload utility on, you can try if one of these updates apply:
But please try to get a kernel panic message first.
> It may be possible to get a few last kernel log messages at the lock-up by
> switching to a text console (Strg Alt F1)
Alas no. With the current kernel, once the bug hits the machine is locked
hard. No keyboard, no mouse, no response to ping(1). I will try leaving it
running in console mode and see if it is spitting a crash message, but you are
probably right that a serial console is more likely to be helpful.
The bad data idea is worth a look too, albeit less likely to be the root
cause, as the enclosure works fine otherwise--- its disk has my music
collection on it, which is in constant use when I am using the machine. That
will take a while to organize... we are Windows-free.
(Of course I meant to put the console in front before a crash.)
Someone suggested to dump a kernel image for post mortem analysis:
(However _I_ don't know how to proceed once you had such an image. I don't even
know if you can get an image this way in the first place if the kernel freezes.)
(In reply to comment #22)
> Someone suggested to dump a kernel image for post mortem analysis:
> (However _I_ don't know how to proceed once you had such an image. I don't even
> know if you can get an image this way in the first place if the kernel freezes.)
Yes, you *should* be able to get a vmcore this way (and if not, we'd like to
figure out why). Once you have a vmcore, install the matching kernel-debuginfo
packages, then you can use crash for analysis. Ex:
$ crash /usr/lib/debug/lib/modules/2.6.18-1.2798.fc6/vmlinux
If you're at all familiar with gdb, you should be right at home, as crash is
based around gdb.
Back from a week of management training:-(. I have followed the KdumpKexec
Howto up to the point of starting the kdump service, but have stuck there:
[root@malbec ~]# service kdump start
Base address: 125f880 is not page aligned
/boot/grub/grub.conf contains the line:
kernel /vmlinuz-2.6.18-1.2200.fc5smp ro root=/dev/VolGroup01/LogVol00
No time to chase this up right now.
Still no luck. I tried to reboot into the kernel provided in the kernel-kdump
package. Oddly though, this does not seem to be in bzImage format as implied
by the Howto.
43 Malbec> rpm -qil kernel-kdump | grep vmlin
I have added an entry for this kernel in grub.conf, but my machine refuses to
boot it. The grub doco states that linux zImage or bzImage is required. I
suspect the Howto is leaving out detail needed for FC5. At this point I am
getting tempted to upgrade to FC6 before pushing this further.
You can't boot directly into the kdump kernel, it gets loaded via a 'kexec -p'
command, which is what 'service kdump start' should be doing (among other
things). The kdump kernel gets loaded from the context of your normal kernel,
and boots directly, rather than going through the bios and grub.
The problem here is the failure when you try to start up the kdump service. I
have a suspicion there are some 2.6.18-isms that broke kexec-tools in FC5, which
I don't think has been updated in a while. Rather than do a full FC6 upgrade,
you might try just upgrading kexec-tools and your kernel to the FC6 versions.
I'm reasonably certain everything still works with that combo... :)
Just to clarify... You *did* reboot after adding the crashkernel parameter to
Oh yeah, I finally dug out a firewire burner, so I'll be trying to reproduce the
problem here as well...
Hrm, no problems at all w/a firewire burner connected for the past 3 hours, but
then this is on an FC6 system, so I wouldn't be surprised if the newer hal
helped the situation...
Re #27: Which wouldn't fix the underlying kernel bug, alas.
Note that I never heard of a similar report. But then, (a) not everybody who
experiences hard lockups of his Linux machine reports them at proper places and
(b) I only watch FireWire related bugs, not USB related bugs. And (c) maybe the
bug strikes only on older SMP hardware.
Indeed, could be papering over some holes, and certainly could be far more
hardware-specific than "firewire burner". FWIW, the burner I've got is an oldish
(~2yr) TEAC 40x CD-RW in an ADS Tech Pyro case, hooked to an HP xw9400
workstation (dual dual-core opteron procs), running an x86_64 FC6 install.
OK, I have discovered kdump.txt in the kernel doco and think I understand a
lot better now (clarifying #26, yes I did reboot). However in parallel an
unrelated showstopper has forced me to start the FC6 upgrade anyway, which was
still going as I left for work. We can certainly hope a hal update will help.
Just recapping: the attention on firewire is probably misplaced--- the device
that causes trouble is a Prolific PL-3507 external firewire/USB disk enclosure
(http://www.prolific.com.tw/eng/Products.asp?ID=9), currently containing a
300GB disk half full of .flac files. The lock occurs if it is connected in
*either USB or firewire mode* AND haldaemon is running. The last few weeks of
locks have been all USB.
It's certainly an issue with sysfs/ driver core.
Or does hald access the device and is issuing SCSI requests?
I can not speak for what haldaemon does, but the device certainly provides a
filesystem that is mounted through /dev/sd?1 for both USB and firewire.
Its looking hopeful though that the FC6 upgrade has cured the problem
(actually a fresh full install, as the upgrade stuffed up due to rpm
corruption). The machine has now been up for two days solid, with the
offending device in regular use and haldaemon up. Not quite back to my
standard operating environment though, so claiming victory is still premature.
OK, the hang is back. I ran kernel-xen-2.6.18-1.2849.fc6 for a week with the
device on USB without trouble, then another week on firewire without trouble.
When I switched back to vanilla kernel-2.6.18-1.2849.fc6 and USB, the hang
reappeared after two days. So, back to chasing that serial console...
Finally. I have a serial console. However I am not clear on how this was
supposed to help. When the bug hits, the serial console is just as
unresponsive as the main system--- no crash messages, no response.
The bug is present but hard to trigger under kernel-xen-2.6.18-1.2868.fc6,
but happens within minutes of boot under kernel-2.6.18-1.2868.fc6.
(In reply to comment #35)
> Finally. I have a serial console. However I am not clear on how this was
> supposed to help. When the bug hits, the serial console is just as
> unresponsive as the main system--- no crash messages, no response.
Assuming you've booted with something along the lines of 'console=ttyS0' in your
kernel args, there's usually spew that will wind up on the serial console that
could help to track down the problem.
Shouldn't this bug be renamed? Since USB and FireWire are both affected, it is
more likely a bug in sysfs.
Yeah, given that both USB and FireWire devices cause the lockup, it seems more
likely a generic sysfs bug than one USB bug and one FireWire bug that both
result in the same lockup. Changing summary accordingly.
It's still a kernel bug though, not really a hal bug.
Michael, what if you give the latest kernel from kernel.org a try? The latest
stable one is 126.96.36.199 (soon 188.8.131.52), but you could as well try 2.6.20-rcX
(-rc4 at the moment). There are driver core fixes and sysfs fixes going in now
and then. I didn't watch closely, therefore I don't know if there is a hot
candidate among the updates that might fix the issue. I'm merely poking in the
Reply to #39. I am planning to try the recently released 2.6.19/fc7 rawhide
kernel rpm when time permits (source kernel builds are pretty tedious at
Reply to #36. Yes, I have console=ttyS1. It shows a fine collection of boot
and `normal operation' messages, but no spew to coincide with the machine
locking--- the last message is usually minutes old and from a random unrelated
Reply on renaming: The name is still a bit misleading. I have not seen the
crash in sysfs_readdir since upgrading to FC6. Its just locking hard now.
Reply to #19(!). I finally got the disk enclosure firmware updated last week.
It has not helped.
Whoops, meant to leave that assigned to kernel, not hal... I'll adjust the bug name again too...
Tried kernel-2.6.19-1.2888.fc6. It took a day to trigger the hang, and left
no crash messages on the serial console.
Could this be related?
"[PATCH 184.108.40.206] SCSI sd: udev accessing an uninitialized scsi_disk results in
a crash", LSML/LKML on 2007-02-02, http://lkml.org/lkml/2007/2/2/94
The patch in comment #43 was released with kernel.org's linux-2.6.20.
Working on it... overnight compiles are a pain. What I need is a 2.6.20.fc6
rpm (not currently keen to grab rawhide fc7 rpms as they need a mkinitrd
upgrade). For the record I still see the hang *rarely* on
kernel-xen-2.6.19-1.2895.fc6, but reasonably commonly with the non-xen
version, which is alas unusable due to a different problem with a dvb-capture
Created attachment 151425 [details]
kernel crash traceback
Created attachment 154599 [details]
syslog record of crash of 2.6.20-1.2944.fc6xen
Unusual to crash a xen kernel.
Hope I have the right bug here....
After plugging in a firewire drive, the whole machine sometimes locks up
shortly after mounting the drive. Nautilus pops up and the disk starts to be
read during which it completely locks up.
Upon reboot the last line in the log before crash always says mounted drive on
behalf of user, and thats it, which to me is normal output. No errors at all.
It seems random in occurance, it has happened 3 times in the last few days. If
Nautilus displays the root of the drive it is fine thereafter.
Via the command line, can 'ls' no problem until lock up (if there is a lock up).
To check the drive tried mounting it on OSX and it comes up no problem. No log
as such to add.
Darren, can you prevent the lockups if you disable haldaemon? (Mike's bug is
caused by haldaemon accessing sysfs, not by disk I/O.)
Can this bug be closed?
Close it. The dodgy hardware in question has been retired.