Bug 428554
Summary: | kernel fails to boot with firewire_sbp2 external drive | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andrew Farris <lordmorgul> | ||||||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||
Priority: | low | ||||||||||||||||
Version: | 8 | CC: | jarod, lordmorgul, stefan-r-rhbz | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | i686 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2008-02-20 16:22:21 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Andrew Farris
2008-01-13 07:36:35 UTC
Created attachment 291489 [details]
screen image of kernel boot failure
Created attachment 291490 [details]
screen image of kernel boot failure
Created attachment 291491 [details]
screen image of kernel boot failure
*** Bug 349921 has been marked as a duplicate of this bug. *** The screen images attached above were from several kernels in the 2.6.23.1-* to 2.6.23.9-85 range (sorry I have forgotten exactly which they were). The kernels that improve this issue, 2.6.23.13-104, -105, and -106, all have some different error messages displayed during boot, but none of them panic or fail to continue booting like the earlier kernels do. So the problem isn't really fixed. See the dmesg outputs from kernels -104 to -106 attached below. Kernel 2.6.23.13-104 has repeated messages: firewire_sbp2: orb reply timed out, rcode=0x11 Kernel 2.6.23.13-105 has the following: firewire_sbp2: error status: 0:12 Then later: firewire_sbp2: logged in to fw1.0 LUN 000 (5 retries) The -105 kernel has also had a message that said failed to login after x retries. Kernel 2.6.23.13-106 has: firewire_ohci: Added fw-ohci device 0000:02:09.2, OHCI version 1.10 firewire_core: created new fw device fw0 (0 config rom retries, S400) scsi2 : SBP-2 IEEE-1394 firewire_core: created new fw device fw1 (0 config rom retries, S400) firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_sbp2: orb reply timed out, rcode=0x11 firewire_sbp2: orb reply timed out, rcode=0x11 firewire_sbp2: status write for unknown orb firewire_sbp2: orb reply timed out, rcode=0x11 BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: f8f60234 *pde = 3f458067 Oops: 0000 [#1] SMP Modules linked in: firewire_sbp2 firewire_ohci firewire_core crc_itu_t ata_generic ata_piix libata sd_mod scsi_mod xfs uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0060:[<f8f60234>] Not tainted VLI EFLAGS: 00010292 (2.6.23.13-106.fc8 #1) EIP is at sbp2_cancel_orbs+0x12/0xb2 [firewire_sbp2] eax: 00000000 ebx: f7c5f0ec ecx: c1bb36b4 edx: c1bb36b4 esi: f7c5f0c0 edi: c1981efc ebp: c1bb3678 esp: c1981ef8 ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068 Process firewire_sbp2 (pid: 429, ti=c1981000 task=c1bc1840 task.ti=c1981000) Stack: c0427c6f 00100100 00200200 c1bb36ac c1bb3600 f7c5f0c0 c1bb3678 f8f60434 00000002 f0010000 0000ffff 00000002 0000ffc0 c1bb3658 f7c5f10c f7c5f0c0 c1bb4400 0000ffc0 f8f6089f 00000000 00000000 c1981f70 c18f4fb0 c1bbeabc Call Trace: [<c0427c6f>] default_wake_function+0x0/0xc [<f8f60434>] sbp2_send_management_orb+0x135/0x1f0 [firewire_sbp2] [<f8f6089f>] sbp2_login+0x59/0x19d [firewire_sbp2] [<f8f60846>] sbp2_login+0x0/0x19d [firewire_sbp2] [<c043a4c3>] run_workqueue+0x7d/0x109 [<c043d5db>] prepare_to_wait+0x24/0x3f [<c043ad1c>] worker_thread+0x0/0xc4 [<c043add6>] worker_thread+0xba/0xc4 [<c043d491>] autoremove_wake_function+0x0/0x35 [<c043d3ca>] kthread+0x38/0x5e [<c043d392>] kthread+0x0/0x5e [<c0405dbb>] kernel_thread_helper+0x7/0x10 ======================= Code: 24 08 8b 54 24 40 89 54 24 04 89 f2 e8 a2 75 f9 ff 83 c4 2c 5b 5e 5f 5d c3 55 57 56 89 c6 53 83 ec 0c 8b 00 8d 7c 24 04 8d 5e 2c <8b> 40 04 8b 68 74 89 7c 24 04 89 7c 24 08 83 ed 18 8b 45 14 05 EIP: [<f8f60234>] sbp2_cancel_orbs+0x12/0xb2 [firewire_sbp2] SS:ESP 0068:c1981ef8 That trace is very similar to what is seen in some of the 2.6.23.1-* kernels when they fail to boot. However -106 continues to boot after this error, the earlier kernels do not. Created attachment 291568 [details]
dmesg_2.6.23.13-104.i686.txt
Created attachment 291569 [details]
dmesg_2.6.23.13-105.i686.txt
Created attachment 291570 [details]
dmesg_2.6.23.13-106.i686.txt
So the status with latest 2.6.23.x-y is: - boots, - sometimes fails to log in (i.e. disk may be inaccessible), - sometimes OOPSes in sbp2_cancel_orbs. Correct me if I got you wrong. Side notes: - Further operation after an OOPS is principally dangerous. - Transient error messages may happen; what counts is that login and I/O finally succeeds. Question to Jarod et al: Does the firewire code in Fedora's 2.6.23.13-10* differ from 2.6.24-rcX in notable ways? (Regarding transactions, fw-sbp2...) The 2.6.23.13-106.fc8 kernel and the most recent rawhide kernels (2.6.24-0.147.rc7.git2.fc9 and later, iirc) should have more or less identical firewire code, updated with your (Stefan's) patches to bring them up to the latest linux1394 git code. Less certain about kernels prior to 2.6.23.13-106.fc8, but they did carry a few additional linux1394 git patches that were dropped when I put in the full firewire update, one of which was definitely sbp2-related... I just got a drive late last week from someone in the office that crashes some systems, and simply fails to log in (orb timeouts) on others, so I'll be playing with that some this week. The OOPs is a new one to me though. I've not really touched the storage code a whole lot thus far, but I guess its probably time I do... > I just got a drive late last week from someone in the office that > crashes some systems, and simply fails to log in (orb timeouts) on others, Drives which have frequent login failures should be cross-checked with the old stack or/and alternative OSs. If the drives make trouble with the other stack/ OSs too, than that's something to be kept in mind. Of course we shall make Juju to come out best in interoperability with junk devices in the long term... I say so because I've got a junk PL3507 based device (bad firmware) and a junk INIC-2430 based device (bad PHY, besides substandard firmware) which don't fully work under Juju, Linux1394, and OS X. In nay case, we don't want Juju to crash the kernel. > The OOPs is a new one to me though. I haven't seen it either... or maybe I have it seen very infrequently. I can't be entirely sure because wild hotplugging sessions combined with working on driver updates sometimes comes with kernel panics of different sorts. Stefan, minor correction on status: 2.6.23.13-y you have correct, boots, sometimes messages, sometimes oops 2.6.23.9 or anything before, cannot boot at all with drive turned on My drive in particular had no problem working with the 2.6.23-6 kernel, even after I noticed the issues with newer kernels (I had kept 2.6.23-6 for several months and checked that it still never had any of these symptoms, not even the occasional warning messages). The drive also has no problems at all with OSX 10.5 on a recent macbook. I earlier commented that sometimes my external drive disconnects with the kernels after 2.6.23-6, this does not happen on OSX. I want to test the latest kernel firewire stack handling my macbook in target disk mode, but I am waiting until I get it better backed up maybe the end of this week. I have an ext3 and ntfs partition on the macbook's HD and will see how the juju stack handles mounting those at kernel boot time. I'll then have the same kernel and firewire card but different drive to test. (In reply to comment #11) > > I just got a drive late last week from someone in the office that > > crashes some systems, and simply fails to log in (orb timeouts) on others, > > Drives which have frequent login failures should be cross-checked with the old > stack or/and alternative OSs. If the drives make trouble with the other stack/ > OSs too, than that's something to be kept in mind. I checked last night, and the disk in question (a 1TB WD My Book) works perfectly with Mac OS X (10.4 and 10.5), haven't tried it w/anything else yet. I've also got a LaCie drive that behaves similarly -- works w/OS X, but not w/juju (never could login w/a few different controllers, then hard-locked my laptop plugging/unplugging/replugging). > Of course we shall make Juju > to come out best in interoperability with junk devices in the long term... Definitely. On tap for me tonight is to read through the SBP-2 spec and code, to get a better idea of where the failures I'm seeing are actually happening. Oh, and I've got a nice stack of drives to work with now... Stefan, I agree, I do remember that kernel (2.6.23.1-42) failing to boot and it may have been the kernel that my image is from. Some improvement with 2.6.23.14-107.fc8.i686, no oops (good!) and all 3 partitions did mount at boot, no firewire timeout messages: firewire_ohci: Added fw-ohci device 0000:02:09.2, OHCI version 1.10 firewire_core: created new fw device fw0 (0 config rom retries, S400) scsi2 : SBP-2 IEEE-1394 firewire_core: created new fw device fw1 (0 config rom retries, S400) firewire_core: phy config: card 0, new root=ffc1, gap_count=5 firewire_sbp2: logged in to fw1.0 LUN 0000 (0 retries) firewire_sbp2: sbp2_scsi_abort scsi 2:0:0:0: scsi: Device offlined - not ready after error recovery firewire_sbp2: logged in to fw1.0 LUN 0000 (0 retries) scsi 2:0:0:0: Direct-Access-RBC Maxtor 6 L120P0 PQ: 0 ANSI: 4 And uhm.. /sheepish the drive in the enclosure is a maxtor, not wd as I commented above more than once. Nb: see the tail end of bug 238606, I think we'll have the orb reply timeout thing fixed shortly. :) Doesn't fix the panic seen in sbp2_cancel_orbs, but should make it much harder to trigger in normal use. New messages in dmesg from kernel-2.6.23.14-111.fc8.i686 - login to the drive succeeds, but filesystems do not mount, later this message appears firewire_sbp2: sbp2_scsi_abort sd 2:0:0:0: scsi: Device offlined - not ready after error recovery sd 2:0:0:0: rejecting I/O to offline device sd 2:0:0:0: rejecting I/O to offline device sd 2:0:0:0: rejecting I/O to offline device sd 2:0:0:0: [sdc] READ CAPACITY failed sd 2:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK sd 2:0:0:0: [sdc] Sense not available. sd 2:0:0:0: rejecting I/O to offline device sd 2:0:0:0: [sdc] Write Protect is off sd 2:0:0:0: [sdc] Mode Sense: 00 00 00 00 sd 2:0:0:0: rejecting I/O to offline device sd 2:0:0:0: [sdc] Asking for cache data failed sd 2:0:0:0: [sdc] Assuming drive cache: write through sd 2:0:0:0: [sdc] Attached SCSI disk sd 2:0:0:0: Attached scsi generic sg4 type 14 firewire_sbp2: reconnected to fw1.0 LUN 0000 (0 retries) I haven't mentioned before (maybe irrelevant), but this drive is connected to this machine through an SB Audigy Gamer sound card external firewire port. See: 02:09.2 FireWire (IEEE 1394): Creative Labs SB Audigy FireWire Port The dreaded bus reset during SCSI inquiry or during SCSI read capacity. This irritates the Linux SCSI core a lot, and perhaps firmwares too. This is also not handled too well by the sbp2 driver of the linux1394 stack, but perhaps in a manner which is more palatable to the Linux SCSI core. I am wondering whether we should switch back to sbp2's scheme of blocking the Scsi_Host (alternatively: the scsi_device) between bus reset event and reconnect. The -111 kernel does that consistently on every boot, drive offlined almost immediately after the login. > The dreaded bus reset during SCSI inquiry or during SCSI read > capacity. This irritates the Linux SCSI core a lot, In comment #18, the SCSI core already succeeded with INQUIRY and handed everything over to sd-mod. sd-mod tried READ CAPACITY, which timed out and was aborted (most certainly because of a bus generation change, as indicated by the later reconnect). This failure motivated SCSI core to take the device offline, and the subsequent MODE SENSE command from sd-mod failed too. All this went on within __scsi_add_device(), called by fw-sbp2's login job. The bug in fw-sbp2 is that reconnect jobs cannot run in parallel to the __scsi_add_device() job (which is currently tied to the login job). It wouldn't hurt as much if the SCSI core completely failed __scsi_add_device(), because then fw-sbp2 would retry the login + __scsi_add_device() job. Potential quick and dirty solution: In fw-sbp2.c::sbp2_login(), check whether sdev.sdev_state == SDEV_OFFLINE right after __scsi_add_device() succeeded. If so, remove the sdev again, logout, and reschedule the login job. Real solution: Perform __scsi_add_device() in a different process context. (Not in the shared workqueue though.) Maybe replace all of fw-sbp2's workqueue jobs by short-lived kthreads. quick and dirty solution posted at http://lkml.org/lkml/2008/1/26/161 also uploaded at http://me.in-berlin.de/~s5r6/linux1394/updates/ (not yet in linux1394-2.6.git, pending review) Jarod, do we want the patches from comment #23 in rawhide? We do. In fact, I added 'em earlier today. :) I haven't tried the f8 kernels since I've moved the machines to rawhide, but I've seen no problems with this drive since this was added in. Is this bug needed to track that quick solution or should it be closed? Do you need me to go back and see how the drive behaves with the latest f8 kernel build? Nah, I think we can go ahead and close this one, the same fixes are already in the 2.6.24.x kernels being build for f8, which should be in updates-testing shortly. |