Bug 466679 - firewire stack fails to initalize a harddrive on boot: possibly timing problem between firewire driver modules?
firewire stack fails to initalize a harddrive on boot: possibly timing proble...
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
All Linux
medium Severity medium
: ---
: ---
Assigned To: Jarod Wilson
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-12 14:18 EDT by Alex Kanavin
Modified: 2008-10-29 11:35 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-29 11:35:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
screenshot of a failed boot sequence due to firewire errors (962.49 KB, image/jpeg)
2008-10-12 14:18 EDT, Alex Kanavin
no flags Details
example of the same drive properly initialized (when e.g. booting from a F10 DVD) (171.78 KB, text/plain)
2008-10-12 14:19 EDT, Alex Kanavin
no flags Details
firewire: fw-sbp2: extend failure check after SCSI probe (1.13 KB, patch)
2008-10-14 18:12 EDT, Stefan Richter
no flags Details | Diff
ohci (debug=7) and sbp2 are loaded at the same time (931.95 KB, image/jpeg)
2008-10-21 08:30 EDT, Alex Kanavin
no flags Details
only ohci (debug=7) is loaded (883.91 KB, image/jpeg)
2008-10-21 08:30 EDT, Alex Kanavin
no flags Details
failed boot sequence, continued (1.00 MB, image/jpeg)
2008-10-21 17:03 EDT, Alex Kanavin
no flags Details
firewire: fw-sbp2: delay first login to avoid retries (1.61 KB, patch)
2008-10-21 18:38 EDT, Stefan Richter
no flags Details | Diff
different outcome - orb reply timed out/failed to login (1.02 MB, image/jpeg)
2008-10-22 02:07 EDT, Alex Kanavin
no flags Details

  None (edit)
Description Alex Kanavin 2008-10-12 14:18:30 EDT
Created attachment 320148 [details]
screenshot of a failed boot sequence due to firewire errors

Description of problem:
In Fedora 10 Beta, during boot there are firewire errors and hard drive fails to initalize (see attached screensshot, note that messages got overlayed on top of ones from the previous screen for some reason).

Version-Release number of selected component (if applicable):
Fedora 10 Beta, no updates applied.

How reproducible:
Always on bootup.

Additional info:
I have also attached messages from booting a F10 DVD, where the same firewire drive gets initialized correctly. Note the slightly different order of messages:

(when things work)
<6>firewire_ohci 0002:20:0e.0: enabling device (0000 -> 0002)
<5>firewire_ohci: Added fw-ohci device 0002:20:0e.0, OHCI version 1.10
<5>firewire_core: created device fw0: GUID 001124fffe871160, S400
<5>firewire_core: created device fw1: GUID 0050770e00071002, S400
<5>firewire_core: phy config: card 0, new root=ffc1, gap_count=5 
<6>scsi0 : SBP-2 IEEE-1394
<5>firewire_sbp2: Workarounds for fw1.0: 0x20 (firmware_revision 0x012804, model_id 0x000001)
<5>firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
<5>scsi 0:0:0:0: Direct-Access-RBC WDC WD80 0JB-00JJA0            PQ: 0 ANSI: 4
<5>sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
<5>sd 0:0:0:0: [sda] Write Protect is off
<7>sd 0:0:0:0: [sda] Mode Sense: 11 00 00 00
<5>sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
<5>sd 0:0:0:0: [sda] 156301488 512-byte hardware sectors (80026 MB)
<5>sd 0:0:0:0: [sda] Write Protect is off
<7>sd 0:0:0:0: [sda] Mode Sense: 11 00 00 00
<5>sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
<6> sda: [mac] sda1 sda2 sda3 sda4
<5>sd 0:0:0:0: [sda] Attached SCSI disk
<5>sd 0:0:0:0: Attached scsi generic sg0 type 14

(when things don't work)
firewire_ohci 0002:20:0e.0: enabling device (0000 -> 0002)
firewire_ohci: Added fw-ohci device 0002:20:0e.0, OHCI version 1.10
Loading firewire-sbp2 module
Waiting for driver initialization
firewire_core: created device fw0: GUID 001124fffe871160, S400
scsi0 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw1.0: 0x20 (firmware_revision 0x012804, model_id 0x000001)
firewire_core: created device fw1: GUID 0050770e00071002, S400
firewire_core: phy config: card 0, new root=ffc1, gap_count=5 
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
scsi 0:0:0:0: Direct-Access-RBC WDC WD80 0JB-00JJA0            PQ: 0 ANSI: 4
(many errors follow, see screenshot)

So it seems that firewire_sbp2 proceeds when firewire_core hasn't finished yet, and this causes trouble. I just did a quick hack of a fix by adding a 'stabilized /sys/bus/firewire/devices' to initrd image, so that there is a pause between loading core/ohci and sbp2, and indeed, that fixed the problem. However, a proper fix should probably be in the drivers - let me know.
Comment 1 Alex Kanavin 2008-10-12 14:19:46 EDT
Created attachment 320149 [details]
example of the same drive properly initialized (when e.g. booting from a F10 DVD)
Comment 2 Stefan Richter 2008-10-14 18:12:44 EDT
Created attachment 320368 [details]
firewire: fw-sbp2: extend failure check after SCSI probe

> note that messages got overlayed on top of ones from the
> previous screen for some reason).

Good that there are Beta tests.  It's another kernel bug if I may say so.

> firmware_revision 0x012804
> GUID 005077...

A Prolific chip.  This tells us that its firmware is either /moderately/ buggy or /extremely/ buggy.

Re attachment 320148 [details]:
The very first SCSI request (INQUIRY from SCSI core) succeeds.  The second (TEST_UNIT_READY from sd) perhaps too.  The third one (READ_CAPACITY from sd) times out for no apparent reason, and the SCSI core takes the freshly created device offline.

We have a little check in fw-sbp2's probe for a FireWire bus generation update between login and SCSI probe.  Such a generation change would lead to failure of the SCSI probe, but fw-sbp2 then destroys the SCSI core's device instance, logs out and back in and tries the SCSI probe again.

This means that the only thing in which way the fw-core's initialization could interfere with fw-sbp2's initialization (a bus reset for bus management reasons) *did not happen* in your case.

It was something else.

You can perhaps arrange for firewire-ohci being booted with the parameter debug=7 to gather more information about what is going on on the FireWire bus, can you?

And I could extend firewire-sbp2's check after the SCSI probe a little bit:  Do the destroy--logout--login--SCSI-probe game also if the SCSI core set the new device offline.  I actually used such a check at some point but this alone covered less variants of failure than the generation check.  However, it's cheap to check FireWire generation _and_ SCSI device offline status, hence we should probably do both.
Comment 3 Alex Kanavin 2008-10-21 08:28:18 EDT
Just to be sure, I know Prolific is buggy, but at least I've updated it with the very latest firmware available from http://www.prolific.com.tw/eng/downloads.asp?ID=44

I have added debug=7 to ohci loading parameters and it seems to me that a bus reset does happen - see attached screenshots. If sbp2 is being loaded at the same time, there is a long pause after the last line on the screenshot, and then the sbp2 initialization fails. If sbp2 is loaded a little bit later, everything seems to be fine. Sorry for the blurriness of the second pic, hope it's not a big problem.
Comment 4 Alex Kanavin 2008-10-21 08:30:11 EDT
Created attachment 321008 [details]
ohci (debug=7) and sbp2 are loaded at the same time
Comment 5 Alex Kanavin 2008-10-21 08:30:59 EDT
Created attachment 321009 [details]
only ohci (debug=7) is loaded
Comment 6 Stefan Richter 2008-10-21 16:20:42 EDT
Re attachment 321008 [details]:
Yes, there was a bus reset, though in this example it happened after successful completion of a few more SCSI requests (Read Capacity, Mode Sense to get cache type information).  There were still a few requests missing to complete the SCSI probe (reading the partition table).

So I am surprised that the current check in firewire-sbp2 did not catch this one.  However, the patch in attachment 320368 [details] will catch this if the failed session of comment #3 ended with an offline device.

...

OK, I now reproduced a failure with a Prolific based 2.5" HDD.  I needed to modprobe firewire-sbp2 before modprobe firewire-ohci get the failure.  A kernel _with_ the mentioned patch applied, fw0, fw1, fw4 are other controllers:

firewire_core: created device fw2: GUID 0010dc5600fed2d4, S400
scsi38 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw3.0: 0x20 (firmware_revision 0x012804, model_id 0x000001)
firewire_core: created device fw3: GUID 0050770e01305f6a, S400
firewire_core: phy config: card 2, new root=ffc1, gap_count=5
firewire_sbp2: fw3.0: logged in to LUN 0000 (0 retries)
scsi 38:0:0:0: Direct-Access-RBC WDC WD25 00BEVE-00WZT0         PQ: 0 ANSI: 4
sd 38:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 38:0:0:0: [sdd] Write Protect is off
sd 38:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 38:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 38:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 38:0:0:0: [sdd] Write Protect is off
sd 38:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 38:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd:<5>firewire_core: created device fw4: GUID 00110600000041cc, S400
firewire_sbp2: fw3.0: sbp2_scsi_abort
sd 38:0:0:0: Device offlined - not ready after error recovery
sd 38:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdd, sector 0
__ratelimit: 135 callbacks suppressed
Buffer I/O error on device sdd, logical block 0
sd 38:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0
ldm_validate_partition_table(): Disk read failed.
sd 38:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0
sd 38:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0
 unable to read partition table
sd 38:0:0:0: [sdd] Attached SCSI disk
sd 38:0:0:0: Attached scsi generic sg3 type 14
sd 38:0:0:0: [sdd] Synchronizing SCSI cache
sd 38:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
sd 38:0:0:0: [sdd] Stopping disk
sd 38:0:0:0: [sdd] START_STOP FAILED
sd 38:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
firewire_sbp2: fw3.0: error status: 0:12
firewire_sbp2: fw3.0: logged in to LUN 0000 (0 retries)
scsi 38:0:0:0: Direct-Access-RBC WDC WD25 00BEVE-00WZT0         PQ: 0 ANSI: 4
sd 38:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 38:0:0:0: [sdd] Write Protect is off
sd 38:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 38:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 38:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 38:0:0:0: [sdd] Write Protect is off
sd 38:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 38:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd: [mac] sdd1 sdd2 sdd3 sdd4
sd 38:0:0:0: [sdd] Attached SCSI disk
sd 38:0:0:0: Attached scsi generic sg3 type 14

You can see how the SCSI probe fails, the firewire-sbp2 starts over, and the second SCSI probe succeeds.


And here is the same procedure but without the patch:

firewire_core: created device fw2: GUID 0010dc5600fed2d4, S400
scsi39 : SBP-2 IEEE-1394
firewire_sbp2: Workarounds for fw3.0: 0x20 (firmware_revision 0x012804, model_id 0x000001)
firewire_core: created device fw3: GUID 0050770e01305f6a, S400
firewire_core: phy config: card 2, new root=ffc1, gap_count=5
firewire_sbp2: fw3.0: logged in to LUN 0000 (0 retries)
scsi 39:0:0:0: Direct-Access-RBC WDC WD25 00BEVE-00WZT0         PQ: 0 ANSI: 4
sd 39:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 39:0:0:0: [sdd] Write Protect is off
sd 39:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 39:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 39:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 39:0:0:0: [sdd] Write Protect is off
sd 39:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 39:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd:<5>firewire_core: created device fw4: GUID 00110600000041cc, S400
firewire_sbp2: fw3.0: sbp2_scsi_abort
sd 39:0:0:0: Device offlined - not ready after error recovery
sd 39:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdd, sector 0
Buffer I/O error on device sdd, logical block 0
sd 39:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0
ldm_validate_partition_table(): Disk read failed.
sd 39:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0
sd 39:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0
 unable to read partition table
sd 39:0:0:0: [sdd] Attached SCSI disk
sd 39:0:0:0: Attached scsi generic sg3 type 14
sd 39:0:0:0: [sdd] Synchronizing SCSI cache
sd 39:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
sd 39:0:0:0: [sdd] Stopping disk
sd 39:0:0:0: [sdd] START_STOP FAILED
sd 39:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
firewire_sbp2: fw3.0: error status: 0:12
firewire_sbp2: fw3.0: logged in to LUN 0000 (0 retries)
scsi 39:0:0:0: Direct-Access-RBC WDC WD25 00BEVE-00WZT0         PQ: 0 ANSI: 4
sd 39:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 39:0:0:0: [sdd] Write Protect is off
sd 39:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 39:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 39:0:0:0: [sdd] 488397168 512-byte hardware sectors (250059 MB)
sd 39:0:0:0: [sdd] Write Protect is off
sd 39:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 39:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdd: [mac] sdd1 sdd2 sdd3 sdd4
sd 39:0:0:0: [sdd] Attached SCSI disk
sd 39:0:0:0: Attached scsi generic sg3 type 14

So again, even without the patch there is a retry of the SCSI probe because firewire-sbp2 observed that a FireWire bus generation change (bus reset) happened during the first SCSI probe attempt.

Alex, do the screenshots in attachment 320148 [details] and attachment 321008 [details] show everything of the failure, or came more after it?  Especially, was there a "failed to login" message eventually?

PS:
firewire-sbp2's strategy to re-enter login and SCSI probe if there was a bus reset during SCSI probe is actually just a stop-gap solution.  It works around the fact that firewire-sbp2 cannot perform reconnect while calling the SCSI probe.  In the long term I want to change that by running the SCSI probe in a thread parallel to SBP-2 reconnect or relogin.
Comment 7 Stefan Richter 2008-10-21 16:33:51 EDT
Re comment #6:
Weird.  The message "firewire_core: phy config: card 2, new root=ffc1, gap_count=5" which means that the core is going to reset the bus is logged before the firewire-sbp2 login in my dmesg.  Same in your attachment 320148 [details].  Well, I suppose one shouldn't rely too much on the order of these messages.
Comment 8 Alex Kanavin 2008-10-21 17:03:45 EDT
Created attachment 321088 [details]
failed boot sequence, continued

Yes, there is more - I scrolled up before making the attachment 320148 [details] to have everything related to the initial failure. What follows that is attached - it seems to vary though, sometimes the system manages to initialize the drive after a reset, but boot can't proceed because the drive init happened too late (like on the screenshot), sometimes there are just lots of i/o errors as a boot is attempted and it fails. I'll try to collect samples of other cases tomorrow - exactly what happens is unpredictable.
Comment 9 Stefan Richter 2008-10-21 17:35:34 EDT
Comment on attachment 320368 [details]
firewire: fw-sbp2: extend failure check after SCSI probe

Ah.

So the driver does actually work in the way as it is intended.  At least in attachment 321088 [details], the second SCSI probe /is/ run and it succeeds.

Your only problem is that this suboptimal procedure works far too slow for your initrd's needs.
Comment 10 Stefan Richter 2008-10-21 18:38:07 EDT
Created attachment 321102 [details]
firewire: fw-sbp2: delay first login to avoid retries

This should do the trick:  The login and SCSI probe should now happen after the gap count optimizing bus reset, thus that the very first attempt (of login inquiry, read capacity etc. and reading the partition table) proceeds smoothly.

Works for me in the "modprobe firewire-sbp2; modprobe firewire-ohci" test with the 2.5" PL3507 disk.  The time-consuming failure and retry as in comment #6 does not happen anymore here.
Comment 11 Alex Kanavin 2008-10-22 02:07:45 EDT
Created attachment 321114 [details]
different outcome - orb reply timed out/failed to login

Here's an example of a different possible outcome - after a long series of "orb reply timed out" messages, there is a "failure to login message", and the drive power needs to be switched off/on for it to become visible again. Just to be sure - should your last patch fix this as well?
Comment 12 Stefan Richter 2008-10-22 08:30:11 EDT
Re comment 11:
Interesting log, thanks.  Yes, the patch in attachment 321102 [details] will fix this too, similarly to how your initrd workaround from comment 0 fixes it --- by changing the timing so that the bus reset is 99.9% certain to happen before firewire-sbp2 springs into action.

The "ORB reply timeout" actually looks familiar.  PL3507 and some other bridge chips tend to behave that way in general if a bus reset happens during ongoing I/O.  I have not investigated this problem in detail yet; at the moment we simply try to avoid it.
Comment 13 Alex Kanavin 2008-10-22 10:18:39 EDT
Thanks, I can confirm that the last patch fixes the problem as far as I could test - booted the patched kernel several times with and without another Firewire device, and it was always perfect.

I guess this fix will show up in mainline kernel soon - will it be in 2.6.28 and onwards only, or is it possible to get it to Fedora sooner?
Comment 14 Stefan Richter 2008-10-25 07:54:34 EDT
Thanks for testing.  I expect that I will get the patch into mainline before 2.6.28-rc2.  After it showed up in there, I will submit it for kernel.org's 2.6.27.y series too.  It will thus find its way into F10 eventually, but maybe the Fedora kernel maintainers take the patch in earlier than that.
Comment 15 Jarod Wilson 2008-10-29 11:35:06 EDT
Patch is now in kernel-2.6.27.4-66.fc10 and later.

Note You need to log in before you can comment on or make changes to this bug.