188140 – Hard lockup using both USB and/or FireWire external HDs w/haldaemon on

Bug 188140 - Hard lockup using both USB and/or FireWire external HDs w/haldaemon on

Summary: Hard lockup using both USB and/or FireWire external HDs w/haldaemon on

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-04-06 12:06 UTC by Mike Pope
Modified:	2008-03-05 00:08 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-03-05 00:08:43 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lshal outpu (80.33 KB, text/plain) 2006-04-18 09:16 UTC, Radek Vokál	no flags	Details
lshal output (119.27 KB, text/plain) 2006-04-18 10:57 UTC, Mike Pope	no flags	Details
kernel crash traceback (5.03 KB, application/mbox) 2006-05-17 03:57 UTC, Mike Pope	no flags	Details
lshal output with offending device in USB mode (121.29 KB, text/plain) 2006-10-23 10:34 UTC, Mike Pope	no flags	Details
kernel crash traceback (3.53 KB, text/plain) 2007-04-02 13:56 UTC, Mike Pope	no flags	Details
syslog record of crash of 2.6.20-1.2944.fc6xen (6.00 KB, text/plain) 2007-05-13 03:47 UTC, Mike Pope	no flags	Details
Show Obsolete (1) View All

Description Mike Pope 2006-04-06 12:06:23 UTC

Description of problem: 
If I allow haldaemon to start, the system locks hard after a couple of 
minutes.  If haldaemon is disabled, no trouble. 
kernel-smp-2.6.16-1.2080_FC5, ancient dual PIII Supermicro system (pre-APIC). 
 
 
Version-Release number of selected component (if applicable): 0.5.7-3 
 
 
How reproducible: 100% 
 
 
Steps to Reproduce: 
1.Boot system with haldaemon enabled. 
 
   
Actual results: 
After a few minutes, the system hangs and needs power cycling. 
 
Expected results: 
System does not hang. 
 
Additional info:

Comment 1 John (J5) Palmieri 2006-04-17 21:17:11 UTC

Looks like one of your old components might have a bug that hal is triggering. 
Can you attach an lshal output?  Thanks.

Comment 4 Mike Pope 2006-04-18 10:57:11 UTC

Created attachment 127911 [details]
lshal output

Here it is.

Comment 5 Mike Pope 2006-05-02 10:55:20 UTC

I rebooted with kernel 2.6.16-1.2096_FC5smp with the external firewire 
enclosure switched off, then enabled haldaemon.  No problem.  I turned on the 
firewire enclosure and the system locked.  I rebooted with the firewire device 
on and haldaemon off.  No problem again.  So, is there some way to blacklist 
the firewire so hal does not see it?

Comment 6 John (J5) Palmieri 2006-05-02 14:27:33 UTC

David,

What is the best way to handle this?

Comment 7 David Zeuthen 2006-05-02 15:03:50 UTC

Obviously the right answer is to fix the Firewire kernel drivers but I doubt
that is going to happen anytime soon. I don't really have any other answer, my
position is that 

 1) it's certainly not a hal bug, it's a firewire bug in the kernel; and 

 2) having blacklists anywhere else than the kernel for this kind
    of stuff is not the right answer

Suggest to reassign to the kernel component.

Comment 8 John (J5) Palmieri 2006-05-02 15:51:20 UTC

Reassigning as per comment #7

Comment 9 Mike Pope 2006-05-17 03:45:57 UTC

Problem still present with kernel-2.6.16-1.2111_FC5smp.  The system was stable 
overnight with haldaemon running, but within less than five minutes of 
switching on the firewire enclosure, another hard lockup.  /var/log/messages 
shows udev finding the firewire device OK, however the kernel does not like 
the look of it:

May 17 11:35:26 malbec udevd-event[14450]: udev_event_run: seq 700 finished
May 17 11:35:26 malbec udevd[421]: udev_done: seq 700, pid [14450] exit with 
0, 1 seconds old
May 17 11:35:26 malbec gconfd (mpope-13551): Received signal 15, shutting down 
cleanly
May 17 11:35:27 malbec gconfd (mpope-13551): Exiting
May 17 11:35:32 malbec ainit: 
May 17 11:35:32 malbec ainit: 
May 17 11:35:52 malbec kernel: ieee1394: Error parsing configrom for node 
0-00:1023
May 17 11:36:03 malbec udevd[421]: udev_event_run: seq 701 forked, pid 
[14494], 'add' 'ieee1394', 0 seconds old
May 17 11:36:03 malbec udevd-event[14494]: wait_for_sysfs: file 
'/sys/devices/pci0000:00/0000:00:0f.0/fw-host0/0050770e00071002/bus' appeared 
after 0 loops

I was able to mount and unmount the enclosed disk before the lockup struck.

Comment 10 Mike Pope 2006-05-17 03:57:55 UTC

Created attachment 129296 [details]
kernel crash traceback

Its getting worse.  This time the firewire enclosure was completely
disconnected, yet as soon as I started haldaemon, the kernel crashed, as shown
in the attachment.  This is with kernel-2.6.16-1.2111_FC5smp.

Comment 11 Mike Pope 2006-06-21 12:57:15 UTC

Still present at kernel-smp-2.6.16-1.2133_FC5.

Comment 12 Mike Pope 2006-07-06 11:21:12 UTC

Still present at kernel-smp-2.6.17-1.2139_FC5.

Comment 13 Dave Jones 2006-10-16 19:31:50 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 14 Mike Pope 2006-10-18 12:01:27 UTC

Still present with 2.6.18-1.2200.fc5.  Some improvement though--- the device 
could be hotplugged successfully, and it took more than 24 hours for the bug 
to hit.

Comment 15 Stefan Richter 2006-10-21 08:26:52 UTC

Mike, thanks for the log in comment #10. This may be a bug in the kernel's
ieee1394 base driver or in the kernel's sysfs code. I will ask around about
sysfs_readdir.

Comment 16 Stefan Richter 2006-10-22 15:14:55 UTC

On sysfs_readdir: http://lkml.org/lkml/2006/10/21/179
On ieee1394: http://lkml.org/lkml/2006/10/22/64

Comment 17 Mike Pope 2006-10-23 10:34:48 UTC

Created attachment 139112 [details]
lshal output with offending device in USB mode

Comment 18 Mike Pope 2006-10-23 10:36:46 UTC

I had a look at the links above.  Thanks, but I am unclear how to proceed at 
present, as the usual failure mode is a hard lock, not a crash.

I have another piece of information though: the enclosure that is causing the 
trouble is dual firewire/USB, so I have switched it to USB mode.  Its still 
locking.  Here is the lshal output in the new configuration.

Comment 19 Stefan Richter 2006-10-23 11:34:29 UTC

OK, so the bug is probably located in the kernel's sysfs and driver core code,
and probably not in the FireWire and USB subsystems.

It may be possible to get a few last kernel log messages at the lock-up by
switching to a text console (Strg Alt F1), login as root, restart klogd there:
# killall klogd; klogd
(This should run klogd in background on this console, i.e. you can continue to
enter commands in the console but will see the klogd output. But you may even
get a panic message without diverting klogd.) Then plug in the device and wait
for bad things to happen. Keep a camera to photograph the screen or paper+pencil
around.

If this doesn't get you messages when the box locks up, you would need a serial
console or netconsole. But this requires a second box and some amount of work to
set up.

There is one other potential failure cause: The device could send bad data. If
attached to FireWire, it could even send data to bad addresses. I don't know if
this is possible on USB to the same extent. The host OS has only limited
possibilities to check data and addresses. If you have a Windows PC to run a
firmware upload utility on, you can try if one of these updates apply:
http://www.prolific.com.tw/eng/downloads.asp?ID=44
But please try to get a kernel panic message first.

Comment 20 Mike Pope 2006-10-24 00:35:26 UTC

> It may be possible to get a few last kernel log messages at the lock-up by
> switching to a text console (Strg Alt F1)

Alas no.  With the current kernel, once the bug hits the machine is locked 
hard.  No keyboard, no mouse, no response to ping(1).  I will try leaving it 
running in console mode and see if it is spitting a crash message, but you are 
probably right that a serial console is more likely to be helpful.

The bad data idea is worth a look too, albeit less likely to be the root 
cause, as the enclosure works fine otherwise--- its disk has my music 
collection on it, which is in constant use when I am using the machine.  That 
will take a while to organize... we are Windows-free.

Comment 21 Stefan Richter 2006-10-24 05:25:52 UTC

(Of course I meant to put the console in front before a crash.)

Comment 22 Stefan Richter 2006-10-30 15:42:15 UTC

Someone suggested to dump a kernel image for post mortem analysis:
http://lkml.org/lkml/2006/10/30/106
http://fedoraproject.org/wiki/FC6KdumpKexecHowTo?highlight=%28FC6KdumpKexecHowTo%29
(However _I_ don't know how to proceed once you had such an image. I don't even
know if you can get an image this way in the first place if the kernel freezes.)

Comment 23 Jarod Wilson 2006-11-01 18:08:23 UTC

(In reply to comment #22)
> Someone suggested to dump a kernel image for post mortem analysis:
> http://lkml.org/lkml/2006/10/30/106
> http://fedoraproject.org/wiki/FC6KdumpKexecHowTo
> (However _I_ don't know how to proceed once you had such an image. I don't even
> know if you can get an image this way in the first place if the kernel freezes.)

Yes, you *should* be able to get a vmcore this way (and if not, we'd like to
figure out why). Once you have a vmcore, install the matching kernel-debuginfo
packages, then you can use crash for analysis. Ex:

$ crash /usr/lib/debug/lib/modules/2.6.18-1.2798.fc6/vmlinux
/var/crash/2006-11-01-07:43/vmcore
...
crash> bt

If you're at all familiar with gdb, you should be right at home, as crash is
based around gdb.

Comment 24 Mike Pope 2006-11-06 21:42:21 UTC

Back from a week of management training:-(.  I have followed the KdumpKexec 
Howto up to the point of starting the kdump service, but have stuck there:

[root@malbec ~]# service kdump start
Base address: 125f880 is not page aligned

/boot/grub/grub.conf contains the line:
        kernel /vmlinuz-2.6.18-1.2200.fc5smp ro root=/dev/VolGroup01/LogVol00 
acpi=off crashkernel=128M@16M

No time to chase this up right now.

Comment 25 Mike Pope 2006-11-08 10:13:44 UTC

Still no luck.  I tried to reboot into the kernel provided in the kernel-kdump 
package.  Oddly though, this does not seem to be in bzImage format as implied 
by the Howto.

  43 Malbec> rpm -qil kernel-kdump | grep vmlin
  /boot/vmlinux-2.6.18-1.2200.fc5kdump

I have added an entry for this kernel in grub.conf, but my machine refuses to 
boot it.  The grub doco states that linux zImage or bzImage is required.  I 
suspect the Howto is leaving out detail needed for FC5.  At this point I am 
getting tempted to upgrade to FC6 before pushing this further.

Comment 26 Jarod Wilson 2006-11-08 16:05:30 UTC

You can't boot directly into the kdump kernel, it gets loaded via a 'kexec -p'
command, which is what 'service kdump start' should be doing (among other
things). The kdump kernel gets loaded from the context of your normal kernel,
and boots directly, rather than going through the bios and grub.

The problem here is the failure when you try to start up the kdump service. I
have a suspicion there are some 2.6.18-isms that broke kexec-tools in FC5, which
I don't think has been updated in a while. Rather than do a full FC6 upgrade,
you might try just upgrading kexec-tools and your kernel to the FC6 versions.
I'm reasonably certain everything still works with that combo... :)

Just to clarify... You *did* reboot after adding the crashkernel parameter to
grub, yes?

Oh yeah, I finally dug out a firewire burner, so I'll be trying to reproduce the
problem here as well...

Comment 27 Jarod Wilson 2006-11-08 19:05:18 UTC

Hrm, no problems at all w/a firewire burner connected for the past 3 hours, but
then this is on an FC6 system, so I wouldn't be surprised if the newer hal
helped the situation...

Comment 28 Stefan Richter 2006-11-08 19:45:03 UTC

Re #27: Which wouldn't fix the underlying kernel bug, alas.

Note that I never heard of a similar report. But then, (a) not everybody who
experiences hard lockups of his Linux machine reports them at proper places and
(b) I only watch FireWire related bugs, not USB related bugs. And (c) maybe the
bug strikes only on older SMP hardware.

Comment 29 Jarod Wilson 2006-11-08 20:22:33 UTC

Indeed, could be papering over some holes, and certainly could be far more
hardware-specific than "firewire burner". FWIW, the burner I've got is an oldish
(~2yr) TEAC 40x CD-RW in an ADS Tech Pyro case, hooked to an HP xw9400
workstation (dual dual-core opteron procs), running an x86_64 FC6 install.

Comment 30 Mike Pope 2006-11-08 23:44:12 UTC

OK, I have discovered kdump.txt in the kernel doco and think I understand a 
lot better now (clarifying #26, yes I did reboot).  However in parallel an 
unrelated showstopper has forced me to start the FC6 upgrade anyway, which was 
still going as I left for work.  We can certainly hope a hal update will help.

Just recapping: the attention on firewire is probably misplaced--- the device 
that causes trouble is a Prolific PL-3507 external firewire/USB disk enclosure 
(http://www.prolific.com.tw/eng/Products.asp?ID=9), currently containing a 
300GB disk half full of .flac files.  The lock occurs if it is connected in 
*either USB or firewire mode* AND haldaemon is running.  The last few weeks of 
locks have been all USB.

Comment 31 Stefan Richter 2006-11-09 00:19:14 UTC

It's certainly an issue with sysfs/ driver core.

Comment 32 Stefan Richter 2006-11-09 16:40:03 UTC

Or does hald access the device and is issuing SCSI requests?

Comment 33 Mike Pope 2006-11-15 23:29:34 UTC

I can not speak for what haldaemon does, but the device certainly provides a 
filesystem that is mounted through /dev/sd?1 for both USB and firewire.

Its looking hopeful though that the FC6 upgrade has cured the problem 
(actually a fresh full install, as the upgrade stuffed up due to rpm 
corruption).  The machine has now been up for two days solid, with the 
offending device in regular use and haldaemon up.  Not quite back to my 
standard operating environment though, so claiming victory is still premature.

Comment 34 Mike Pope 2006-12-01 10:48:10 UTC

OK, the hang is back.  I ran kernel-xen-2.6.18-1.2849.fc6 for a week with the 
device on USB without trouble, then another week on firewire without trouble.  
When I switched back to vanilla kernel-2.6.18-1.2849.fc6 and USB, the hang 
reappeared after two days.  So, back to chasing that serial console...

Comment 35 Mike Pope 2006-12-30 11:22:10 UTC

Finally.  I have a serial console.  However I am not clear on how this was 
supposed to help.  When the bug hits, the serial console is just as 
unresponsive as the main system--- no crash messages, no response.

The bug is present but hard to trigger under kernel-xen-2.6.18-1.2868.fc6,
but happens within minutes of boot under kernel-2.6.18-1.2868.fc6.

Comment 36 Jarod Wilson 2007-01-02 15:44:48 UTC

(In reply to comment #35)
> Finally.  I have a serial console.  However I am not clear on how this was 
> supposed to help.  When the bug hits, the serial console is just as 
> unresponsive as the main system--- no crash messages, no response.

Assuming you've booted with something along the lines of 'console=ttyS0' in your
kernel args, there's usually spew that will wind up on the serial console that
could help to track down the problem.

Comment 37 Stefan Richter 2007-01-05 13:09:43 UTC

Shouldn't this bug be renamed? Since USB and FireWire are both affected, it is
more likely a bug in sysfs.

Comment 38 Jarod Wilson 2007-01-08 22:22:25 UTC

Yeah, given that both USB and FireWire devices cause the lockup, it seems more
likely a generic sysfs bug than one USB bug and one FireWire bug that both
result in the same lockup. Changing summary accordingly.

Comment 39 Stefan Richter 2007-01-08 23:02:06 UTC

It's still a kernel bug though, not really a hal bug.

Michael, what if you give the latest kernel from kernel.org a try? The latest
stable one is 2.6.19.1 (soon 2.6.19.2), but you could as well try 2.6.20-rcX
(-rc4 at the moment). There are driver core fixes and sysfs fixes going in now
and then. I didn't watch closely, therefore I don't know if there is a hot
candidate among the updates that might fix the issue. I'm merely poking in the
dark here.

Comment 40 Mike Pope 2007-01-09 00:07:39 UTC

Reply to #39.  I am planning to try the recently released 2.6.19/fc7 rawhide 
kernel rpm when time permits (source kernel builds are pretty tedious at 
700MHz).

Reply to #36.  Yes, I have console=ttyS1.  It shows a fine collection of boot 
and `normal operation' messages, but no spew to coincide with the machine 
locking--- the last message is usually minutes old and from a random unrelated 
daemon.

Reply on renaming: The name is still a bit misleading.  I have not seen the 
crash in sysfs_readdir since upgrading to FC6.  Its just locking hard now.

Reply to #19(!).  I finally got the disk enclosure firmware updated last week.  
It has not helped.

Comment 41 Jarod Wilson 2007-01-09 17:01:58 UTC

Whoops, meant to leave that assigned to kernel, not hal... I'll adjust the bug name again too...

Comment 42 Mike Pope 2007-01-12 10:08:24 UTC

Tried kernel-2.6.19-1.2888.fc6.  It took a day to trigger the hang, and left 
no crash messages on the serial console.

Comment 43 Stefan Richter 2007-02-02 12:57:56 UTC

Could this be related?
"[PATCH 2.6.19.2] SCSI sd:  udev accessing an uninitialized scsi_disk results in
a crash", LSML/LKML on 2007-02-02, http://lkml.org/lkml/2007/2/2/94

Comment 44 Stefan Richter 2007-02-04 20:38:16 UTC

The patch in comment #43 was released with kernel.org's linux-2.6.20.

Comment 45 Mike Pope 2007-02-06 11:44:28 UTC

Working on it... overnight compiles are a pain.  What I need is a 2.6.20.fc6 
rpm (not currently keen to grab rawhide fc7 rpms as they need a mkinitrd 
upgrade).  For the record I still see the hang *rarely* on 
kernel-xen-2.6.19-1.2895.fc6, but reasonably commonly with the non-xen 
version, which is alas unusable due to a different problem with a dvb-capture 
card.

Comment 46 Mike Pope 2007-04-02 13:56:19 UTC

Created attachment 151425 [details]
kernel crash traceback

Comment 47 Mike Pope 2007-05-13 03:47:55 UTC

Created attachment 154599 [details]
syslog record of crash of 2.6.20-1.2944.fc6xen

Unusual to crash a xen kernel.

Comment 48 Darren Naessens 2007-10-18 21:02:34 UTC

Hope I have the right bug here....

After plugging in a firewire drive, the whole machine sometimes locks up
shortly after mounting the drive. Nautilus pops up and the disk starts to be
read during which it completely locks up. 
Upon reboot the last line in the log before crash always says mounted drive on
behalf of user, and thats it, which to me is normal output. No errors at all.

It seems random in occurance, it has happened 3 times in the last few days. If
Nautilus displays the root of the drive it is fine thereafter.

Via the command line, can 'ls' no problem until lock up (if there is a lock up).

To check the drive tried mounting it on OSX and it comes up no problem. No log
as such to add.

Comment 49 Stefan Richter 2007-10-18 21:15:04 UTC

Darren, can you prevent the lockups if you disable haldaemon?  (Mike's bug is
caused by haldaemon accessing sysfs, not by disk I/O.)

Comment 50 Stefan Richter 2008-03-04 16:32:36 UTC

Can this bug be closed?

Comment 51 Mike Pope 2008-03-04 23:49:06 UTC

Close it.  The dodgy hardware in question has been retired.

Comment 52 Jarod Wilson 2008-03-05 00:08:43 UTC

Closed.

Note You need to log in before you can comment on or make changes to this bug.