Bug 428554

Summary: kernel fails to boot with firewire_sbp2 external drive
Product: [Fedora] Fedora Reporter: Andrew Farris <lordmorgul>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 8CC: jarod, lordmorgul, stefan-r-rhbz
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-02-20 16:22:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screen image of kernel boot failure
none
screen image of kernel boot failure
none
screen image of kernel boot failure
none
dmesg_2.6.23.13-104.i686.txt
none
dmesg_2.6.23.13-105.i686.txt
none
dmesg_2.6.23.13-106.i686.txt none

Description Andrew Farris 2008-01-13 07:36:35 UTC
+++ This bug was initially created as a clone of Bug #349921 +++ (closed original)

I have had problems booting kernels with my external firewire drive (western
digital ata133 caviar drive in I-rocks external usb2/firewire enclosure).

Fedora 8 kernels after 2.6.23-6.fc8.i686 have all had this problem including
some 7-8 various rawhide and updates-testing kernels, up to kernel
2.6.23.9-85.fc8.i686 which also fails to boot.

Some of the above mentioned kernels would have panics, some would not, but all
would stop booting after repeated messages:
firewire_sbp2: orb reply timed out, rcode=0x11

Prior to upgrading above 2.6.23-6 (which works fine) I had never seen this. 
This is also when I had to change the modprobe.conf alias (original bug) from
sbp2 to firewire-sbp2.

Now I've cloned this bug and re-explained it because this problem seems to be
fixed in kernel-2.6.23.13-106.fc8.i686 out of koji today (20080112).  Using this
kernel the machine boots fine with the drive connected and turned on, and mounts
the partitions correctly.

Since the drive behaves fairly 'normally' on my macbook also I think the kernel
changes have been a significant re-improvement in firewire support (it used to
work, got broken, now works!).

Original comments below apply, some were removed to clarify.

-- Additional comment from lordmorgul on 2007-10-25 05:05 EST --
The message indeed goes away when that change is made, however it doesn't make
the kernel work when its installed.  Both kernel 2.6.23.1-30 and 2.6.23.1-31
have kernel panics very early in the boot process, with messages about
firewire_sbp2.  I'll get back to this in a couple days and perhaps get the
messages posted (not quite the same thing happening for both kernels, different
messages but both still panic).

-- Additional comment from lordmorgul on 2007-11-04 06:35 EST --
I'm still having issues with these kernels and the external firewire drive, but
beginning to think its due to my hardware and its odd drive sleep behavior.  I
have had trouble with the kernel letting the drive power down and losing
connection to it despite still thinking the drive is mounted... this has been
going on for months, but the booting issues are new within a few weeks.

All rawhide kernels after 2.6.23-6 still have the same or similar failure to
boot if the drive enclosure power is ON at boot time, but they all boot fine and
access the drive if it is not turned on until after booting (I have only really
tried waiting until I login to X, so I'm assuming that once kernel modules are
loaded and disk checks are done it would probably be fine to turn it on).  If
the drive has power when the kernel starts booting it fails every time.

To clarify that it fails if:
- power down machine, leave external drive power on (drive just parked)
- power machine on, external drive seems not to wake up

The kernel boots fine if:
- power down machine, turn off external drive power
- power machine on, leave drive power off until after login
- drive is auto mounted and works fine for hours


The drive is a western digital caviar internal 120gb ata133 drive in an I-Rocks
usb2/firewire external enclosure.

There are repeated messages of: firewire_sbp2: orb reply timed out, rcode=0x11

Comment 1 Andrew Farris 2008-01-13 07:36:35 UTC
Created attachment 291489 [details]
screen image of kernel boot failure

Comment 2 Andrew Farris 2008-01-13 07:37:07 UTC
Created attachment 291490 [details]
screen image of kernel boot failure

Comment 3 Andrew Farris 2008-01-13 07:37:25 UTC
Created attachment 291491 [details]
screen image of kernel boot failure

Comment 4 Andrew Farris 2008-01-13 07:38:18 UTC
*** Bug 349921 has been marked as a duplicate of this bug. ***

Comment 5 Andrew Farris 2008-01-14 06:59:00 UTC
The screen images attached above were from several kernels in the 2.6.23.1-* to
2.6.23.9-85 range (sorry I have forgotten exactly which they were).  The kernels
that improve this issue, 2.6.23.13-104, -105, and -106, all have some different
error messages displayed during boot, but none of them panic or fail to continue
booting like the earlier kernels do.  So the problem isn't really fixed.


See the dmesg outputs from kernels -104 to -106 attached below.

Kernel 2.6.23.13-104 has repeated messages: firewire_sbp2: orb reply timed out,
rcode=0x11

Kernel 2.6.23.13-105 has the following: firewire_sbp2: error status: 0:12
Then later: firewire_sbp2: logged in to fw1.0 LUN 000 (5 retries)

The -105 kernel has also had a message that said failed to login after x retries.

Kernel 2.6.23.13-106 has:
firewire_ohci: Added fw-ohci device 0000:02:09.2, OHCI version 1.10
firewire_core: created new fw device fw0 (0 config rom retries, S400)
scsi2 : SBP-2 IEEE-1394
firewire_core: created new fw device fw1 (0 config rom retries, S400)
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
firewire_sbp2: orb reply timed out, rcode=0x11
firewire_sbp2: orb reply timed out, rcode=0x11
firewire_sbp2: status write for unknown orb
firewire_sbp2: orb reply timed out, rcode=0x11
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip: f8f60234 *pde = 3f458067 
Oops: 0000 [#1] SMP 
Modules linked in: firewire_sbp2 firewire_ohci firewire_core crc_itu_t
ata_generic ata_piix libata sd_mod scsi_mod xfs uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<f8f60234>]    Not tainted VLI
EFLAGS: 00010292   (2.6.23.13-106.fc8 #1)
EIP is at sbp2_cancel_orbs+0x12/0xb2 [firewire_sbp2]
eax: 00000000   ebx: f7c5f0ec   ecx: c1bb36b4   edx: c1bb36b4
esi: f7c5f0c0   edi: c1981efc   ebp: c1bb3678   esp: c1981ef8
ds: 007b   es: 007b   fs: 00d8  gs: 0000  ss: 0068
Process firewire_sbp2 (pid: 429, ti=c1981000 task=c1bc1840 task.ti=c1981000)
Stack: c0427c6f 00100100 00200200 c1bb36ac c1bb3600 f7c5f0c0 c1bb3678 f8f60434 
       00000002 f0010000 0000ffff 00000002 0000ffc0 c1bb3658 f7c5f10c f7c5f0c0 
       c1bb4400 0000ffc0 f8f6089f 00000000 00000000 c1981f70 c18f4fb0 c1bbeabc 
Call Trace:
 [<c0427c6f>] default_wake_function+0x0/0xc
 [<f8f60434>] sbp2_send_management_orb+0x135/0x1f0 [firewire_sbp2]
 [<f8f6089f>] sbp2_login+0x59/0x19d [firewire_sbp2]
 [<f8f60846>] sbp2_login+0x0/0x19d [firewire_sbp2]
 [<c043a4c3>] run_workqueue+0x7d/0x109
 [<c043d5db>] prepare_to_wait+0x24/0x3f
 [<c043ad1c>] worker_thread+0x0/0xc4
 [<c043add6>] worker_thread+0xba/0xc4
 [<c043d491>] autoremove_wake_function+0x0/0x35
 [<c043d3ca>] kthread+0x38/0x5e
 [<c043d392>] kthread+0x0/0x5e
 [<c0405dbb>] kernel_thread_helper+0x7/0x10
 =======================
Code: 24 08 8b 54 24 40 89 54 24 04 89 f2 e8 a2 75 f9 ff 83 c4 2c 5b 5e 5f 5d c3
55 57 56 89 c6 53 83 ec 0c 8b 00 8d 7c 24 04 8d 5e 2c <8b> 40 04 8b 68 74 89 7c
24 04 89 7c 24 08 83 ed 18 8b 45 14 05 
EIP: [<f8f60234>] sbp2_cancel_orbs+0x12/0xb2 [firewire_sbp2] SS:ESP 0068:c1981ef8

That trace is very similar to what is seen in some of the 2.6.23.1-* kernels
when they fail to boot.  However -106 continues to boot after this error, the
earlier kernels do not.

Comment 6 Andrew Farris 2008-01-14 06:59:34 UTC
Created attachment 291568 [details]
dmesg_2.6.23.13-104.i686.txt

Comment 7 Andrew Farris 2008-01-14 06:59:49 UTC
Created attachment 291569 [details]
dmesg_2.6.23.13-105.i686.txt

Comment 8 Andrew Farris 2008-01-14 07:00:07 UTC
Created attachment 291570 [details]
dmesg_2.6.23.13-106.i686.txt

Comment 9 Stefan Richter 2008-01-14 12:27:15 UTC
So the status with latest 2.6.23.x-y is:
  - boots,
  - sometimes fails to log in (i.e. disk may be inaccessible),
  - sometimes OOPSes in sbp2_cancel_orbs.
Correct me if I got you wrong.

Side notes:
  - Further operation after an OOPS is principally dangerous.
  - Transient error messages may happen; what counts is that login and I/O
finally succeeds.

Question to Jarod et al:  Does the firewire code in Fedora's 2.6.23.13-10*
differ from 2.6.24-rcX in notable ways?  (Regarding transactions, fw-sbp2...)



Comment 10 Jarod Wilson 2008-01-14 14:19:19 UTC
The 2.6.23.13-106.fc8 kernel and the most recent rawhide kernels (2.6.24-0.147.rc7.git2.fc9 and later, 
iirc) should have more or less identical firewire code, updated with your (Stefan's) patches to bring them 
up to the latest linux1394 git code. Less certain about kernels prior to 2.6.23.13-106.fc8, but they did 
carry a few additional linux1394 git patches that were dropped when I put in the full firewire update, one 
of which was definitely sbp2-related...

I just got a drive late last week from someone in the office that crashes some systems, and simply fails to 
log in (orb timeouts) on others, so I'll be playing with that some this week. The OOPs is a new one to me 
though. I've not really touched the storage code a whole lot thus far, but I guess its probably time I do...

Comment 11 Stefan Richter 2008-01-14 17:39:54 UTC
> I just got a drive late last week from someone in the office that
> crashes some systems, and simply fails to log in (orb timeouts) on others,

Drives which have frequent login failures should be cross-checked with the old
stack or/and alternative OSs.  If the drives make trouble with the other stack/
OSs too, than that's something to be kept in mind.  Of course we shall make Juju
to come out best in interoperability with junk devices in the long term...

I say so because I've got a junk PL3507 based device (bad firmware) and a junk
INIC-2430 based device (bad PHY, besides substandard firmware) which don't fully
work under Juju, Linux1394, and OS X.

In nay case, we don't want Juju to crash the kernel.

> The OOPs is a new one to me though.

I haven't seen it either... or maybe I have it seen very infrequently.  I can't
be entirely sure because wild hotplugging sessions combined with working on
driver updates sometimes comes with kernel panics of different sorts.

Comment 12 Andrew Farris 2008-01-14 20:42:53 UTC
Stefan, minor correction on status:
2.6.23.13-y  you have correct, boots, sometimes messages, sometimes oops
2.6.23.9 or anything before, cannot boot at all with drive turned on

My drive in particular had no problem working with the 2.6.23-6 kernel, even after I noticed the issues 
with newer kernels (I had kept 2.6.23-6 for several months and checked that it still never had any of 
these symptoms, not even the occasional warning messages).  The drive also has no problems at all 
with OSX 10.5 on a recent macbook.  I earlier commented that sometimes my external drive 
disconnects with the kernels after 2.6.23-6, this does not happen on OSX.

I want to test the latest kernel firewire stack handling my macbook in target disk mode, but I am 
waiting until I get it better backed up maybe the end of this week.  I have an ext3 and ntfs partition on 
the macbook's HD and will see how the juju stack handles mounting those at kernel boot time.  I'll then 
have the same kernel and firewire card but different drive to test.

Comment 13 Jarod Wilson 2008-01-15 22:49:57 UTC
(In reply to comment #11)
> > I just got a drive late last week from someone in the office that
> > crashes some systems, and simply fails to log in (orb timeouts) on others,
> 
> Drives which have frequent login failures should be cross-checked with the old
> stack or/and alternative OSs.  If the drives make trouble with the other stack/
> OSs too, than that's something to be kept in mind.

I checked last night, and the disk in question (a 1TB WD My Book) works
perfectly with Mac OS X (10.4 and 10.5), haven't tried it w/anything else yet.
I've also got a LaCie drive that behaves similarly -- works w/OS X, but not
w/juju (never could login w/a few different controllers, then hard-locked my
laptop plugging/unplugging/replugging).


> Of course we shall make Juju
> to come out best in interoperability with junk devices in the long term...

Definitely.

On tap for me tonight is to read through the SBP-2 spec and code, to get a
better idea of where the failures I'm seeing are actually happening. Oh, and
I've got a nice stack of drives to work with now...

Comment 14 Stefan Richter 2008-01-16 00:04:13 UTC
https://bugzilla.redhat.com/attachment.cgi?id=291491 looks like bug 391701.


Comment 15 Andrew Farris 2008-01-16 00:17:36 UTC
Stefan, I agree, I do remember that kernel (2.6.23.1-42) failing to boot and it
may have been the kernel that my image is from.

Comment 16 Andrew Farris 2008-01-16 10:40:32 UTC
Some improvement with 2.6.23.14-107.fc8.i686, no oops (good!) and all 3
partitions did mount at boot, no firewire timeout messages:

firewire_ohci: Added fw-ohci device 0000:02:09.2, OHCI version 1.10
firewire_core: created new fw device fw0 (0 config rom retries, S400)
scsi2 : SBP-2 IEEE-1394
firewire_core: created new fw device fw1 (0 config rom retries, S400)
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
firewire_sbp2: logged in to fw1.0 LUN 0000 (0 retries)
firewire_sbp2: sbp2_scsi_abort
scsi 2:0:0:0: scsi: Device offlined - not ready after error recovery
firewire_sbp2: logged in to fw1.0 LUN 0000 (0 retries)
scsi 2:0:0:0: Direct-Access-RBC Maxtor 6 L120P0                PQ: 0 ANSI: 4

And uhm.. /sheepish the drive in the enclosure is a maxtor, not wd as I
commented above more than once.

Comment 17 Jarod Wilson 2008-01-18 20:17:13 UTC
Nb: see the tail end of bug 238606, I think we'll have the orb reply timeout
thing fixed shortly. :)

Doesn't fix the panic seen in sbp2_cancel_orbs, but should make it much harder
to trigger in normal use.

Comment 18 Andrew Farris 2008-01-19 02:28:20 UTC
New messages in dmesg from kernel-2.6.23.14-111.fc8.i686
- login to the drive succeeds, but filesystems do not mount, later this message
appears

firewire_sbp2: sbp2_scsi_abort
sd 2:0:0:0: scsi: Device offlined - not ready after error recovery
sd 2:0:0:0: rejecting I/O to offline device
sd 2:0:0:0: rejecting I/O to offline device
sd 2:0:0:0: rejecting I/O to offline device
sd 2:0:0:0: [sdc] READ CAPACITY failed
sd 2:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
sd 2:0:0:0: [sdc] Sense not available.
sd 2:0:0:0: rejecting I/O to offline device
sd 2:0:0:0: [sdc] Write Protect is off
sd 2:0:0:0: [sdc] Mode Sense: 00 00 00 00
sd 2:0:0:0: rejecting I/O to offline device
sd 2:0:0:0: [sdc] Asking for cache data failed
sd 2:0:0:0: [sdc] Assuming drive cache: write through
sd 2:0:0:0: [sdc] Attached SCSI disk
sd 2:0:0:0: Attached scsi generic sg4 type 14
firewire_sbp2: reconnected to fw1.0 LUN 0000 (0 retries)

I haven't mentioned before (maybe irrelevant), but this drive is connected to
this machine through an SB Audigy Gamer sound card external firewire port.  See:
02:09.2 FireWire (IEEE 1394): Creative Labs SB Audigy FireWire Port


Comment 19 Stefan Richter 2008-01-19 12:41:54 UTC
The dreaded bus reset during SCSI inquiry or during SCSI read capacity.  This
irritates the Linux SCSI core a lot, and perhaps firmwares too.  This is also
not handled too well by the sbp2 driver of the linux1394 stack, but perhaps in a
manner which is more palatable to the Linux SCSI core.  I am wondering whether
we should switch back to sbp2's scheme of blocking the Scsi_Host (alternatively:
the scsi_device) between bus reset event and reconnect.

Comment 20 Andrew Farris 2008-01-19 15:16:19 UTC
The -111 kernel does that consistently on every boot, drive offlined almost
immediately after the login.

Comment 21 Stefan Richter 2008-01-19 19:09:37 UTC
> The dreaded bus reset during SCSI inquiry or during SCSI read
> capacity.  This irritates the Linux SCSI core a lot,

In comment #18, the SCSI core already succeeded with INQUIRY and handed
everything over to sd-mod.  sd-mod tried READ CAPACITY, which timed out and was
aborted (most certainly because of a bus generation change, as indicated by the
later reconnect).  This failure motivated SCSI core to take the device offline,
and the subsequent MODE SENSE command from sd-mod failed too.  All this went on
within __scsi_add_device(), called by fw-sbp2's login job.

The bug in fw-sbp2 is that reconnect jobs cannot run in parallel to the
__scsi_add_device() job (which is currently tied to the login job).  It wouldn't
hurt as much if the SCSI core completely failed __scsi_add_device(), because
then fw-sbp2 would retry the login + __scsi_add_device() job.

Comment 22 Stefan Richter 2008-01-23 01:02:45 UTC
Potential quick and dirty solution:
In fw-sbp2.c::sbp2_login(), check whether sdev.sdev_state == SDEV_OFFLINE right
after __scsi_add_device() succeeded.  If so, remove the sdev again, logout, and
reschedule the login job.

Real solution:
Perform __scsi_add_device() in a different process context.  (Not in the shared
workqueue though.)  Maybe replace all of fw-sbp2's workqueue jobs by short-lived
kthreads.

Comment 23 Stefan Richter 2008-01-26 17:15:43 UTC
quick and dirty solution posted at http://lkml.org/lkml/2008/1/26/161

also uploaded at http://me.in-berlin.de/~s5r6/linux1394/updates/
(not yet in linux1394-2.6.git, pending review)

Comment 24 Chuck Ebbert 2008-01-29 01:10:36 UTC
Jarod, do we want the patches from comment #23 in rawhide?

Comment 25 Jarod Wilson 2008-01-29 02:00:07 UTC
We do. In fact, I added 'em earlier today. :)

Comment 26 Andrew Farris 2008-02-19 06:00:48 UTC
I haven't tried the f8 kernels since I've moved the machines to rawhide, but
I've seen no problems with this drive since this was added in.

Is this bug needed to track that quick solution or should it be closed?  Do you
need me to go back and see how the drive behaves with the latest f8 kernel build?

Comment 27 Jarod Wilson 2008-02-20 16:22:21 UTC
Nah, I think we can go ahead and close this one, the same fixes are already in
the 2.6.24.x kernels being build for f8, which should be in updates-testing shortly.