If I boot the system without a firewire disk plugged in, then, after the boot completes, I plug the disk in (a Maxtor 5000DV) and run the rescan-scsi-bus.sh script, everything seems to work just fine. However, I'd like to be able to use the external disk for devices used at boot time, so I arrange for initrd to load scsi_mod, sd_mod, ieee1349, ohci1394 and sbp2, in this order. Sometimes, ohci1394: sbp2 will simply fail to log in with the disk and proceed to attempting to mount the root filesystem, and failing at that because the external disk wasn't available (or drop the raid1 replicas in it and require me to resync them). However, most often it succeeds to log in, but then, every time it does, it hangs before running the next command in initrd's linuxrc. (This is similar in behavior to what I get when I boot with the disk unplugged, rmmod all of ieee1349, ohci1394 and sbp2, plug the disk in, then modprobe ohci1394: this for some reason loads sbp2, that in turn gets stuck in `initializing' state. From then on, attempting to read from /proc/bus/ieee1394/devices blocks forever. This is what causes `initializing firewire' and `probing for new hardware' to hang.) I've managed to arrange for the boot to complete by reordering the modules in initrd's linuxrc, by loading sbp2 after ieee1394 but *before* ohci1394. Then, by adding some `echo "scsi add-single-device ..." > /proc/scsi/scsi' commands after we mount /proc, I even managed to get partitions in the external hard disk to be usable as members of a raid devices that are physical volumes of the volume group that contains the logical volume holding my root filesystem! Whee! Unfortunately, loading ohci1394 from initrd seems to be a perfect recipe to get /proc/bus/ieee1394/devices to get stuck, so I had to add nofirewire to the kernel boot command line, such that rc.sysinit wouldn't attempt to read from it (it doesn't actually disable firewire, since the modules are already loaded), and to chkconfig kudzu off. Then boot completes, and the system is usable, to a point. Unfortunately, there seems to still be a hidden problem somewhere. The system load never goes below 1.00, but there's no user process consuming CPU, and the external disk is totally inactive. If I raidhotadd the raid1 partitions in the external disk to the corresponding raid devices, raid syncing from internal disks to this external disk doesn't go faster than 6MB/s, whereas if I boot Shrike's kernel on Severn+updates, I get up to 18MB/s. Investigating the possible causes for this poor performance (in spite of Shrike's kernel claims to be using S400, whereas the Severn update kernel says the max SBP-2 speed is S800), I ended up with these two suspicious processes that a sysrq keystroke was kind enough to dump to syslog for me: knodemgrd_0 D C03B5594 0 29 1 31 8 (L-TLB) Call Trace: [<c01198b8>] schedule [kernel] 0x118 (0xdf957e2c) [<c010885a>] __down [kernel] 0x6a (0xdf957e4c) [<c01089b4>] __down_failed [kernel] 0x8 (0xdf957e70) [<e089c20f>] .text.lock.sbp2 [sbp2] 0x5 (0xdf957e80) [<e0899bd4>] sbp2_start_device [sbp2] 0x2b4 (0xdf957ea8) [<e08998e0>] sbp2_start_ud [sbp2] 0xa0 (0xdf957ec8) [<e08851c0>] .rodata.str1.32 [ieee1394] 0xb00 (0xdf957ed4) [<e0899566>] sbp2_probe [sbp2] 0x46 (0xdf957ef4) [<e089e0e0>] sbp2_driver [sbp2] 0x0 (0xdf957f00) [<e086cf9e>] nodemgr_bind_drivers [ieee1394] 0x3e (0xdf957f04) [<e086c2c6>] nodemgr_create_node [ieee1394] 0xa6 (0xdf957f20) [<e086d5a1>] nodemgr_node_probe_one [ieee1394] 0xf1 (0xdf957f54) [<e086d691>] nodemgr_node_probe [ieee1394] 0x91 (0xdf957fa0) [<e086d968>] nodemgr_host_thread [ieee1394] 0xf8 (0xdf957fc8) [<e086d870>] nodemgr_host_thread [ieee1394] 0x0 (0xdf957fe0) [<c010745d>] kernel_thread_helper [kernel] 0x5 (0xdf957ff0) scsi_eh_0 S C03B5594 0 36 1 41 31 (L-TLB) Call Trace: [<c01198b8>] schedule [kernel] x118 (0xdffc7f70) [<c0108911>] __down_interruptible [kernel] 0x71 (0xdffc7f90) [<c01089bf>] __down_failed_interruptible [kernel] 0x7 (0xdffc7fb8) [<e0853b17>] .rodata.str1.1 [scsi_mod] 0x2053 (xdffc7fc0) [<e084c61f>] .text.lock.scsi_error [scsi_mod] 0x55 (0xdffc7fc4) [<e0853b0d>] .rodata.str1.1 [scsi_mod] 0x2049 (0xdffc7fcc) [<e084c1e0>] scsi_error_handler [scsi_mod] 0x0 (0xdffc7fe4) [<c010745d>] kernel_thread_helper [kernel] 0x5 (0xdffc7ff0) Version-Release number of selected component (if applicable): kernel-2.4.22-20.1.2024.2.36.nptl
Ditto what aoliva said.
Created attachment 94792 [details] Message to 1394 list containing patch A patch was recently posted to the 1394-devel mailing list for this bug. I applied this patch to my machine and here are the results: - machine boots cleanly with 1394 drive turned on - drive detected with rescan-scsi-bus.sh - reads and writes on drive do not show any warnings or lockups. To give proper credit to the author (Sergey Vlasov), I am attaching the original email
Created attachment 94793 [details] patch itself
Should be fixed in beta2 ? $ cat /home/davej/firewire.diff | patch -p1 patching file drivers/ieee1394/nodemgr.c Reversed (or previously applied) patch detected! Assume -R? [n] n Apply anyway? [n] n Skipping patch. 1 out of 1 hunk ignored -- saving rejects to file drivers/ieee1394/nodemgr.c.rej
I'm sorry, but I would have to beg to differ. I was able to reproduce this with the default kernel in test2, and I applied the patch manually (with vi and not patch) so I know that piece of code did not have the patch
2061 certainly fails in just the same way as Severn1.
This was fixed in 2.4.22.1.2064, which unfortunatly was too late for beta2
(for the record, after the bugzilla database was restored without this transaction) Just tried 2075, the problem is still there. Reading from /proc/bus/ieee1394/devices still hangs, kudzu still hangs, loading sbp2 after ohci1394 still gets stuck in initializing state, and load is still stuck at >=1.0 if sbp2 is loaded before ohci1394.
Curioser and curioser. I saw this bug with the binary 2.4.22-1.2082 kernel, but when I installed the kernel-source rpm and recompiled with the default i686 config file (with SMP turned on to work around another bug) I then was able to boot with my 1394 drive turned on.
Created attachment 95129 [details] mkinitrd patch that enables boot-time loading of firewire drivers without hanging sbp2 Kernel 2088 still has the same problem. This patch to mkinitrd is the hack I've been using to work around part of the problem (the hang at boot time), but the following problems are still present: - throughput is limited to 5MB/s, instead of 17MB/s as in Shrike - it's probably operating without DMA (how do I tell?) - there's still some kernel thread that gets the system load stuck at >= 1.0 - reading from /proc/bus/firewire still hangs. This requires kudzu to be disabled *and* nofirewire to be added to the kernel command line. The latter doesn't effectively disable firewire, since the modules have already been loaded. As for the mkinird patch itself, the changes for insmod -k and the reformatting of usb-storage can probably be taken out, but the rest of the patch would be very nice to have in the next mkinitrd build, at least until the kernel sbp2 module is fixed so as to not hang if loaded after ohci1394 when there are firewire devices conected.
Still no improvements in 2097.
Created attachment 95302 [details] fix problems in sbp2 module This patch fixes all of the problems I'd run into when sbp2.o is loaded when there are firewire devices already connected to the bus. It no longer hangs if loaded after ohci1394, throughput is back to the expected range, reading from /proc/bus/firewire displays the correct information and kudzu no longer hangs. I won't pretend to understand why the semaphore was down()ed twice before, and why it's ok to down() it only once now, but this is certainly an improvement.
The mkinitrd patch was broken up in smaller pieces in bug 103665. The only one that is really needed now in order for firewire devices to be visible when raid devices needed for the root filesystem are started is mkinitrd.01-sbp2-rescan.patch.
Ben Collins gives me reasons to consider this patch wrong, so I take it back. I'm investigating further.
Created attachment 95330 [details] proper fix for the sbp2-hangs-on-load problem We were freeing the packet data structure before the thread we woke up had a chance to look at the semaphore. This patch fixes the problem properly, and I guess Ben Collins actually likes it, because he sent me a very similar patch just as I finished to test mine :-)
*** Bug 101901 has been marked as a duplicate of this bug. ***
Works for me now on detecting my lone BUSLink firewire device. No hang on kernel boot with deviced plugged. Works very nicely on kernel 2.4.22-1.2108.nptl.i686! Good work!
Well folks 2115 worked since rawhide push but... 2.4.22-1.2115.nptl.i686 broke today! Can't leave CDRW hot plugged on boot, I can use rescan-scsi-bus after booting then plugging. Can't figure it out. Nothing has changed on my system at all. No packages were changed since the last push to rawhide.
`Can't leave' meaning what? Does it hang? Does it just fail to be brought up? I had some problems with sbp2 failing to load into devices with sbp2 loaded after ohci1394. Arranging for sbp2 to be loaded before ohci1394 fixes it, but I won't even pretend to understand why. This problem has always happend to me with these modules loaded from initrd.img, where hotplug just doesn't work, but I can't tell whether that's related.
I believe I was very clear on my comment. With 2105.nptl all the way through and including 2115.nptl installed a few days back I could leave my BUSLink Firewire CDRW drive plugged. Now all of a sudden the boot process hangs just as before. If I leave my drive unplugged and wait till after that point in the kernel where ohci loads, then I can successfully use the bourne script rescan-scsi-bus to successfully detect the drive and use it... go figure. The only addition I have added recently was BitTorrent-3.3-1.
BitTorrent 3.3-1 hoses Firewire detection in kernel 2115.nptl I did an rpm -e on BitTorrent-3.3-1 that I built last night from source and voila... FIXED
This bug is back in kernel-2.6.0-0.test11.1.13, and is only detected reliably with slab poisoning enabled. Attachment 95330 [details] applies cleanly and fixes the problem. I wish Ben Collins would merge the fix into 2.6 before it's too late...
I just tested 2.6.1-rc1 vanilla and it seems to work
Confirmed fixed in FC2test1.