103821 – loading firewire drivers while external hard disk is connected causes several problems

Bug 103821 - loading firewire drivers while external hard disk is connected causes several problems

Summary: loading firewire drivers while external hard disk is connected causes several...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Raw Hide
Classification:	Retired
Component:	kernel
Sub Component:
Version:	1.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Matthew Galgoci
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	101901 (view as bug list)
Depends On:
Blocks:	CambridgeBlocker 106926
TreeView+	depends on / blocked

Reported:	2003-09-05 12:42 UTC by Alexandre Oliva
Modified:	2007-04-18 16:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:	kernel-2.4.22-1.2105.nptl
Clone Of:
Environment:
Last Closed:	2004-02-16 22:33:49 UTC
Embargoed:

Attachments	(Terms of Use)
Message to 1394 list containing patch (6.65 KB, text/plain) 2003-09-28 03:42 UTC, Ben Hsu	no flags	Details
patch itself (572 bytes, patch) 2003-09-28 03:43 UTC, Ben Hsu	no flags	Details \| Diff
mkinitrd patch that enables boot-time loading of firewire drivers without hanging sbp2 (2.58 KB, patch) 2003-10-13 15:36 UTC, Alexandre Oliva	no flags	Details \| Diff
fix problems in sbp2 module (448 bytes, patch) 2003-10-20 01:00 UTC, Alexandre Oliva	no flags	Details \| Diff
proper fix for the sbp2-hangs-on-load problem (2.23 KB, patch) 2003-10-21 00:30 UTC, Alexandre Oliva	no flags	Details \| Diff
Show Obsolete (1) View All

Description Alexandre Oliva 2003-09-05 12:42:09 UTC

If I boot the system without a firewire disk plugged in, then, after the boot
completes, I plug the disk in (a Maxtor 5000DV) and run the rescan-scsi-bus.sh
script, everything seems to work just fine.  

However, I'd like to be able to use the external disk for devices used at boot
time, so I arrange for initrd to load scsi_mod, sd_mod, ieee1349, ohci1394 and
sbp2, in this order.  Sometimes, ohci1394: sbp2 will simply fail to log in with
the disk and proceed to attempting to mount the root filesystem, and failing at
that because the external disk wasn't available (or drop the raid1 replicas in
it and require me to resync them).  However, most often it succeeds to log in,
but then, every time it does, it hangs before running the next command in
initrd's linuxrc.

(This is similar in behavior to what I get when I boot with the disk unplugged,
rmmod all of ieee1349, ohci1394 and sbp2, plug the disk in, then modprobe
ohci1394: this for some reason loads sbp2, that in turn gets stuck in
`initializing' state.  From then on, attempting to read from
/proc/bus/ieee1394/devices blocks forever.  This is what causes `initializing
firewire' and `probing for new hardware' to hang.)

I've managed to arrange for the boot to complete by reordering the modules in
initrd's linuxrc, by loading sbp2 after ieee1394 but *before* ohci1394.  Then,
by adding some `echo "scsi add-single-device ..." > /proc/scsi/scsi' commands
after we mount /proc, I even managed to get partitions in the external hard disk
to be usable as members of a raid devices that are physical volumes of the
volume group that contains the logical volume holding my root filesystem!  Whee!

Unfortunately, loading ohci1394 from initrd seems to be a perfect recipe to get
/proc/bus/ieee1394/devices to get stuck, so I had to add nofirewire to the
kernel boot command line, such that rc.sysinit wouldn't attempt to read from it
(it doesn't actually disable firewire, since the modules are already loaded),
and to chkconfig kudzu off.  Then boot completes, and the system is usable, to a
point.

Unfortunately, there seems to still be a hidden problem somewhere.  The system
load never goes below 1.00, but there's no user process consuming CPU, and the
external disk is totally inactive.  If I raidhotadd the raid1 partitions in the
external disk to the corresponding raid devices, raid syncing from internal
disks to this external disk doesn't go faster than 6MB/s, whereas if I boot
Shrike's kernel on Severn+updates, I get up to 18MB/s.  Investigating the
possible causes for this poor performance (in spite of Shrike's kernel claims to
be using S400, whereas the Severn update kernel says the max SBP-2 speed is
S800), I ended up with these two suspicious processes that a sysrq keystroke was
kind enough to dump to syslog for me:

knodemgrd_0   D C03B5594     0    29      1            31     8 (L-TLB)
Call Trace:   [<c01198b8>] schedule [kernel] 0x118 (0xdf957e2c)
[<c010885a>] __down [kernel] 0x6a (0xdf957e4c)
[<c01089b4>] __down_failed [kernel] 0x8 (0xdf957e70)
[<e089c20f>] .text.lock.sbp2 [sbp2] 0x5 (0xdf957e80)
[<e0899bd4>] sbp2_start_device [sbp2] 0x2b4 (0xdf957ea8)
[<e08998e0>] sbp2_start_ud [sbp2] 0xa0 (0xdf957ec8)
[<e08851c0>] .rodata.str1.32 [ieee1394] 0xb00 (0xdf957ed4)
[<e0899566>] sbp2_probe [sbp2] 0x46 (0xdf957ef4)
[<e089e0e0>] sbp2_driver [sbp2] 0x0 (0xdf957f00)
[<e086cf9e>] nodemgr_bind_drivers [ieee1394] 0x3e (0xdf957f04)
[<e086c2c6>] nodemgr_create_node [ieee1394] 0xa6 (0xdf957f20)
[<e086d5a1>] nodemgr_node_probe_one [ieee1394] 0xf1 (0xdf957f54)
[<e086d691>] nodemgr_node_probe [ieee1394] 0x91 (0xdf957fa0)
[<e086d968>] nodemgr_host_thread [ieee1394] 0xf8 (0xdf957fc8)
[<e086d870>] nodemgr_host_thread [ieee1394] 0x0 (0xdf957fe0)
[<c010745d>] kernel_thread_helper [kernel] 0x5 (0xdf957ff0)

scsi_eh_0     S C03B5594     0    36      1      41    31 (L-TLB)
Call Trace:   [<c01198b8>] schedule [kernel] x118 (0xdffc7f70)
[<c0108911>] __down_interruptible [kernel] 0x71 (0xdffc7f90)
[<c01089bf>] __down_failed_interruptible [kernel] 0x7 (0xdffc7fb8)
[<e0853b17>] .rodata.str1.1 [scsi_mod] 0x2053 (xdffc7fc0)
[<e084c61f>] .text.lock.scsi_error [scsi_mod] 0x55 (0xdffc7fc4)
[<e0853b0d>] .rodata.str1.1 [scsi_mod] 0x2049 (0xdffc7fcc)
[<e084c1e0>] scsi_error_handler [scsi_mod] 0x0 (0xdffc7fe4)
[<c010745d>] kernel_thread_helper [kernel] 0x5 (0xdffc7ff0)



Version-Release number of selected component (if applicable):
kernel-2.4.22-20.1.2024.2.36.nptl

Comment 1 raxet 2003-09-07 13:53:11 UTC

Ditto what aoliva said.

Comment 2 Ben Hsu 2003-09-28 03:42:57 UTC

Created attachment 94792 [details]
Message to 1394 list containing patch

A patch was recently posted to the 1394-devel mailing list for this bug. I
applied this patch to my machine and here are the results:
 - machine boots cleanly with 1394 drive turned on
 - drive detected with rescan-scsi-bus.sh
 - reads and writes on drive do not show any warnings or lockups.

To give proper credit to the author (Sergey Vlasov), I am attaching the
original email

Comment 3 Ben Hsu 2003-09-28 03:43:43 UTC

Created attachment 94793 [details]
patch itself

Comment 4 Dave Jones 2003-09-29 01:40:58 UTC

Should be fixed in beta2 ?

$ cat /home/davej/firewire.diff | patch -p1
patching file drivers/ieee1394/nodemgr.c
Reversed (or previously applied) patch detected!  Assume -R? [n] n
Apply anyway? [n] n
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file drivers/ieee1394/nodemgr.c.rej

Comment 5 Ben Hsu 2003-09-29 03:36:50 UTC

I'm sorry, but I would have to beg to differ. I was able to reproduce this with
the default kernel in test2, and I applied the patch manually (with vi and not
patch) so I know that piece of code did not have the patch

Comment 6 Alexandre Oliva 2003-09-29 08:20:22 UTC

2061 certainly fails in just the same way as Severn1.

Comment 7 Dave Jones 2003-09-29 11:38:08 UTC

This was fixed in 2.4.22.1.2064, which unfortunatly was too late for beta2

Comment 8 Alexandre Oliva 2003-10-01 22:51:33 UTC

(for the record, after the bugzilla database was restored without this transaction)

Just tried 2075, the problem is still there.

Reading from /proc/bus/ieee1394/devices still hangs, kudzu still hangs, loading
sbp2 after ohci1394 still gets stuck in initializing state, and load is still
stuck at >=1.0 if sbp2 is loaded before ohci1394.

Comment 9 Ben Hsu 2003-10-02 04:13:25 UTC

Curioser and curioser.

I saw this bug with the binary 2.4.22-1.2082 kernel, but when I installed the
kernel-source rpm and recompiled with the default i686 config file (with SMP
turned on to work around another bug) I then was able to boot with my 1394 drive
turned on.

Comment 10 Alexandre Oliva 2003-10-13 15:36:46 UTC

Created attachment 95129 [details]
mkinitrd patch that enables boot-time loading of firewire drivers without hanging sbp2

Kernel 2088 still has the same problem.  This patch to mkinitrd is the hack
I've been using to work around part of the problem (the hang at boot time), but
the following problems are still present:

- throughput is limited to 5MB/s, instead of 17MB/s as in Shrike
- it's probably operating without DMA (how do I tell?)
- there's still some kernel thread that gets the system load stuck at >= 1.0
- reading from /proc/bus/firewire still hangs.	This requires kudzu to be
disabled *and* nofirewire to be added to the kernel command line.  The latter
doesn't effectively disable firewire, since the modules have already been
loaded.

As for the mkinird patch itself, the changes for insmod -k and the reformatting
of usb-storage can probably be taken out, but the rest of the patch would be
very nice to have in the next mkinitrd build, at least until the kernel sbp2
module is fixed so as to not hang if loaded after ohci1394 when there are
firewire devices conected.

Comment 11 Alexandre Oliva 2003-10-18 20:49:25 UTC

Still no improvements in 2097.

Comment 12 Alexandre Oliva 2003-10-20 01:00:02 UTC

Created attachment 95302 [details]
fix problems in sbp2 module

This patch fixes all of the problems I'd run into when sbp2.o is loaded when
there are firewire devices already connected to the bus.  It no longer hangs if
loaded after ohci1394, throughput is back to the expected range, reading from
/proc/bus/firewire displays the correct information and kudzu no longer hangs.

I won't pretend to understand why the semaphore was down()ed twice before, and
why it's ok to down() it only once now, but this is certainly an improvement.

Comment 13 Alexandre Oliva 2003-10-20 01:06:30 UTC

The mkinitrd patch was broken up in smaller pieces in bug 103665.  The only one
that is really needed now in order for firewire devices to be visible when raid
devices needed for the root filesystem are started is mkinitrd.01-sbp2-rescan.patch.

Comment 14 Alexandre Oliva 2003-10-20 21:04:51 UTC

Ben Collins gives me reasons to consider this patch wrong, so I take it back. 
I'm investigating further.

Comment 15 Alexandre Oliva 2003-10-21 00:30:28 UTC

Created attachment 95330 [details]
proper fix for the sbp2-hangs-on-load problem

We were freeing the packet data structure before the thread we woke up had a
chance to look at the semaphore.  This patch fixes the problem properly, and I
guess Ben Collins actually likes it, because he sent me a very similar patch
just as I finished to test mine :-)

Comment 16 Alexandre Oliva 2003-10-21 20:47:38 UTC

*** Bug 101901 has been marked as a duplicate of this bug. ***

Comment 17 raxet 2003-10-26 21:25:13 UTC

Works for me now on detecting my lone BUSLink firewire device. No hang on kernel
boot with deviced plugged. Works very nicely on kernel 2.4.22-1.2108.nptl.i686!

Good work!

Comment 18 raxet 2003-11-02 15:23:34 UTC

Well folks 2115 worked since rawhide push but...

2.4.22-1.2115.nptl.i686 broke today! Can't leave CDRW hot plugged on
boot, I can use rescan-scsi-bus after booting then plugging.
Can't figure it out. Nothing has changed on my system at all. No
packages were changed since the last push to rawhide.

Comment 19 Alexandre Oliva 2003-11-02 18:57:54 UTC

`Can't leave' meaning what?  Does it hang?  Does it just fail to be
brought up?

I had some problems with sbp2 failing to load into devices with sbp2
loaded after ohci1394.  Arranging for sbp2 to be loaded before
ohci1394 fixes it, but I won't even pretend to understand why.  This
problem has always happend to me with these modules loaded from
initrd.img, where hotplug just doesn't work, but I can't tell whether
that's related.

Comment 20 raxet 2003-11-02 19:47:37 UTC

I believe I was very clear on my comment. With 2105.nptl all the way
through and including 2115.nptl installed a few days back I could
leave my BUSLink Firewire CDRW drive plugged. Now all of a sudden the
boot process hangs just as before. If I leave my drive unplugged and
wait till after that point in the kernel where ohci loads, then I can
successfully use the bourne script rescan-scsi-bus to successfully
detect the drive and use it... go figure. The only addition I have
added recently was BitTorrent-3.3-1.

Comment 21 raxet 2003-11-02 20:06:54 UTC

BitTorrent 3.3-1 hoses Firewire detection in kernel 2115.nptl

I did an rpm -e on BitTorrent-3.3-1 that I built last night from
source and voila... FIXED

Comment 22 Alexandre Oliva 2003-12-15 16:22:31 UTC

This bug is back in kernel-2.6.0-0.test11.1.13, and is only detected
reliably with slab poisoning enabled.  Attachment 95330 [details] applies
cleanly and fixes the problem.  I wish Ben Collins would merge the fix
into 2.6 before it's too late...

Comment 23 Ben Hsu 2004-01-03 19:25:40 UTC

I just tested 2.6.1-rc1 vanilla and it seems to work

Comment 24 Alexandre Oliva 2004-02-16 22:33:49 UTC

Confirmed fixed in FC2test1.

Note You need to log in before you can comment on or make changes to this bug.