Red Hat Bugzilla – Bug 103821
loading firewire drivers while external hard disk is connected causes several problems
Last modified: 2007-04-18 12:57:24 EDT
If I boot the system without a firewire disk plugged in, then, after the boot
completes, I plug the disk in (a Maxtor 5000DV) and run the rescan-scsi-bus.sh
script, everything seems to work just fine.
However, I'd like to be able to use the external disk for devices used at boot
time, so I arrange for initrd to load scsi_mod, sd_mod, ieee1349, ohci1394 and
sbp2, in this order. Sometimes, ohci1394: sbp2 will simply fail to log in with
the disk and proceed to attempting to mount the root filesystem, and failing at
that because the external disk wasn't available (or drop the raid1 replicas in
it and require me to resync them). However, most often it succeeds to log in,
but then, every time it does, it hangs before running the next command in
(This is similar in behavior to what I get when I boot with the disk unplugged,
rmmod all of ieee1349, ohci1394 and sbp2, plug the disk in, then modprobe
ohci1394: this for some reason loads sbp2, that in turn gets stuck in
`initializing' state. From then on, attempting to read from
/proc/bus/ieee1394/devices blocks forever. This is what causes `initializing
firewire' and `probing for new hardware' to hang.)
I've managed to arrange for the boot to complete by reordering the modules in
initrd's linuxrc, by loading sbp2 after ieee1394 but *before* ohci1394. Then,
by adding some `echo "scsi add-single-device ..." > /proc/scsi/scsi' commands
after we mount /proc, I even managed to get partitions in the external hard disk
to be usable as members of a raid devices that are physical volumes of the
volume group that contains the logical volume holding my root filesystem! Whee!
Unfortunately, loading ohci1394 from initrd seems to be a perfect recipe to get
/proc/bus/ieee1394/devices to get stuck, so I had to add nofirewire to the
kernel boot command line, such that rc.sysinit wouldn't attempt to read from it
(it doesn't actually disable firewire, since the modules are already loaded),
and to chkconfig kudzu off. Then boot completes, and the system is usable, to a
Unfortunately, there seems to still be a hidden problem somewhere. The system
load never goes below 1.00, but there's no user process consuming CPU, and the
external disk is totally inactive. If I raidhotadd the raid1 partitions in the
external disk to the corresponding raid devices, raid syncing from internal
disks to this external disk doesn't go faster than 6MB/s, whereas if I boot
Shrike's kernel on Severn+updates, I get up to 18MB/s. Investigating the
possible causes for this poor performance (in spite of Shrike's kernel claims to
be using S400, whereas the Severn update kernel says the max SBP-2 speed is
S800), I ended up with these two suspicious processes that a sysrq keystroke was
kind enough to dump to syslog for me:
knodemgrd_0 D C03B5594 0 29 1 31 8 (L-TLB)
Call Trace: [<c01198b8>] schedule [kernel] 0x118 (0xdf957e2c)
[<c010885a>] __down [kernel] 0x6a (0xdf957e4c)
[<c01089b4>] __down_failed [kernel] 0x8 (0xdf957e70)
[<e089c20f>] .text.lock.sbp2 [sbp2] 0x5 (0xdf957e80)
[<e0899bd4>] sbp2_start_device [sbp2] 0x2b4 (0xdf957ea8)
[<e08998e0>] sbp2_start_ud [sbp2] 0xa0 (0xdf957ec8)
[<e08851c0>] .rodata.str1.32 [ieee1394] 0xb00 (0xdf957ed4)
[<e0899566>] sbp2_probe [sbp2] 0x46 (0xdf957ef4)
[<e089e0e0>] sbp2_driver [sbp2] 0x0 (0xdf957f00)
[<e086cf9e>] nodemgr_bind_drivers [ieee1394] 0x3e (0xdf957f04)
[<e086c2c6>] nodemgr_create_node [ieee1394] 0xa6 (0xdf957f20)
[<e086d5a1>] nodemgr_node_probe_one [ieee1394] 0xf1 (0xdf957f54)
[<e086d691>] nodemgr_node_probe [ieee1394] 0x91 (0xdf957fa0)
[<e086d968>] nodemgr_host_thread [ieee1394] 0xf8 (0xdf957fc8)
[<e086d870>] nodemgr_host_thread [ieee1394] 0x0 (0xdf957fe0)
[<c010745d>] kernel_thread_helper [kernel] 0x5 (0xdf957ff0)
scsi_eh_0 S C03B5594 0 36 1 41 31 (L-TLB)
Call Trace: [<c01198b8>] schedule [kernel] x118 (0xdffc7f70)
[<c0108911>] __down_interruptible [kernel] 0x71 (0xdffc7f90)
[<c01089bf>] __down_failed_interruptible [kernel] 0x7 (0xdffc7fb8)
[<e0853b17>] .rodata.str1.1 [scsi_mod] 0x2053 (xdffc7fc0)
[<e084c61f>] .text.lock.scsi_error [scsi_mod] 0x55 (0xdffc7fc4)
[<e0853b0d>] .rodata.str1.1 [scsi_mod] 0x2049 (0xdffc7fcc)
[<e084c1e0>] scsi_error_handler [scsi_mod] 0x0 (0xdffc7fe4)
[<c010745d>] kernel_thread_helper [kernel] 0x5 (0xdffc7ff0)
Version-Release number of selected component (if applicable):
Ditto what aoliva said.
Created attachment 94792 [details]
Message to 1394 list containing patch
A patch was recently posted to the 1394-devel mailing list for this bug. I
applied this patch to my machine and here are the results:
- machine boots cleanly with 1394 drive turned on
- drive detected with rescan-scsi-bus.sh
- reads and writes on drive do not show any warnings or lockups.
To give proper credit to the author (Sergey Vlasov), I am attaching the
Created attachment 94793 [details]
Should be fixed in beta2 ?
$ cat /home/davej/firewire.diff | patch -p1
patching file drivers/ieee1394/nodemgr.c
Reversed (or previously applied) patch detected! Assume -R? [n] n
Apply anyway? [n] n
1 out of 1 hunk ignored -- saving rejects to file drivers/ieee1394/nodemgr.c.rej
I'm sorry, but I would have to beg to differ. I was able to reproduce this with
the default kernel in test2, and I applied the patch manually (with vi and not
patch) so I know that piece of code did not have the patch
2061 certainly fails in just the same way as Severn1.
This was fixed in 22.214.171.124.2064, which unfortunatly was too late for beta2
(for the record, after the bugzilla database was restored without this transaction)
Just tried 2075, the problem is still there.
Reading from /proc/bus/ieee1394/devices still hangs, kudzu still hangs, loading
sbp2 after ohci1394 still gets stuck in initializing state, and load is still
stuck at >=1.0 if sbp2 is loaded before ohci1394.
Curioser and curioser.
I saw this bug with the binary 2.4.22-1.2082 kernel, but when I installed the
kernel-source rpm and recompiled with the default i686 config file (with SMP
turned on to work around another bug) I then was able to boot with my 1394 drive
Created attachment 95129 [details]
mkinitrd patch that enables boot-time loading of firewire drivers without hanging sbp2
Kernel 2088 still has the same problem. This patch to mkinitrd is the hack
I've been using to work around part of the problem (the hang at boot time), but
the following problems are still present:
- throughput is limited to 5MB/s, instead of 17MB/s as in Shrike
- it's probably operating without DMA (how do I tell?)
- there's still some kernel thread that gets the system load stuck at >= 1.0
- reading from /proc/bus/firewire still hangs. This requires kudzu to be
disabled *and* nofirewire to be added to the kernel command line. The latter
doesn't effectively disable firewire, since the modules have already been
As for the mkinird patch itself, the changes for insmod -k and the reformatting
of usb-storage can probably be taken out, but the rest of the patch would be
very nice to have in the next mkinitrd build, at least until the kernel sbp2
module is fixed so as to not hang if loaded after ohci1394 when there are
firewire devices conected.
Still no improvements in 2097.
Created attachment 95302 [details]
fix problems in sbp2 module
This patch fixes all of the problems I'd run into when sbp2.o is loaded when
there are firewire devices already connected to the bus. It no longer hangs if
loaded after ohci1394, throughput is back to the expected range, reading from
/proc/bus/firewire displays the correct information and kudzu no longer hangs.
I won't pretend to understand why the semaphore was down()ed twice before, and
why it's ok to down() it only once now, but this is certainly an improvement.
The mkinitrd patch was broken up in smaller pieces in bug 103665. The only one
that is really needed now in order for firewire devices to be visible when raid
devices needed for the root filesystem are started is mkinitrd.01-sbp2-rescan.patch.
Ben Collins gives me reasons to consider this patch wrong, so I take it back.
I'm investigating further.
Created attachment 95330 [details]
proper fix for the sbp2-hangs-on-load problem
We were freeing the packet data structure before the thread we woke up had a
chance to look at the semaphore. This patch fixes the problem properly, and I
guess Ben Collins actually likes it, because he sent me a very similar patch
just as I finished to test mine :-)
*** Bug 101901 has been marked as a duplicate of this bug. ***
Works for me now on detecting my lone BUSLink firewire device. No hang on kernel
boot with deviced plugged. Works very nicely on kernel 2.4.22-1.2108.nptl.i686!
Well folks 2115 worked since rawhide push but...
2.4.22-1.2115.nptl.i686 broke today! Can't leave CDRW hot plugged on
boot, I can use rescan-scsi-bus after booting then plugging.
Can't figure it out. Nothing has changed on my system at all. No
packages were changed since the last push to rawhide.
`Can't leave' meaning what? Does it hang? Does it just fail to be
I had some problems with sbp2 failing to load into devices with sbp2
loaded after ohci1394. Arranging for sbp2 to be loaded before
ohci1394 fixes it, but I won't even pretend to understand why. This
problem has always happend to me with these modules loaded from
initrd.img, where hotplug just doesn't work, but I can't tell whether
I believe I was very clear on my comment. With 2105.nptl all the way
through and including 2115.nptl installed a few days back I could
leave my BUSLink Firewire CDRW drive plugged. Now all of a sudden the
boot process hangs just as before. If I leave my drive unplugged and
wait till after that point in the kernel where ohci loads, then I can
successfully use the bourne script rescan-scsi-bus to successfully
detect the drive and use it... go figure. The only addition I have
added recently was BitTorrent-3.3-1.
BitTorrent 3.3-1 hoses Firewire detection in kernel 2115.nptl
I did an rpm -e on BitTorrent-3.3-1 that I built last night from
source and voila... FIXED
This bug is back in kernel-2.6.0-0.test11.1.13, and is only detected
reliably with slab poisoning enabled. Attachment 95330 [details] applies
cleanly and fixes the problem. I wish Ben Collins would merge the fix
into 2.6 before it's too late...
I just tested 2.6.1-rc1 vanilla and it seems to work
Confirmed fixed in FC2test1.