Red Hat Bugzilla – Full Text Bug Listing
|Summary:||[firewire] unable to use disk (fw_sbp2: failed to login to ...)|
|Product:||[Fedora] Fedora||Reporter:||Will Woods <wwoods>|
|Component:||kernel||Assignee:||Jarod Wilson <jarod>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Brian Brock <bbrock>|
|Version:||rawhide||CC:||ak, dpnaessens, fenlason, krh, lordmorgul, stefan-r-rhbz|
|Fixed In Version:||184.108.40.206-107.fc8||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2008-02-05 16:20:54 EST||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
|Bug Blocks:||235704, 246496|
Description Will Woods 2007-05-01 15:15:39 EDT
I've got a Western Digital FireWire/USB 2.0 External Hard Drive - WDXC2500JBRNN is the model number. None of my rawhide machines can seem to connect to it. It's not using bus power. Western Digital says that it uses either the Oxford 922 or Initio 1430 chipset; I haven't opened it up to find out which. WD reference: http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1469 Here's some lspci / kernel log info. On my x86_64 machine: 05:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61) fw_core: phy config: card 0, new root=ffc1, gap_count=5 fw_core: giving up on config rom for node id ffc0 On my G5 Power Mac it gets a little further: 0001:03:0e.0 FireWire (IEEE 1394): Apple Computer Inc. K2 FireWire I'll attach the log from that machine.
Comment 1 Will Woods 2007-05-01 15:15:40 EDT
Created attachment 153885 [details] log from Power Mac G5
Comment 2 Stefan Richter 2007-05-06 11:37:06 EDT
Oxford 922 is for FireWire 800 + USB 2.0, Initio 1430 for FireWire 400. I take it from "gap_count=5" that it's FireWire 400 (1394a) with an additional USB 2.0 link and PHY. Or it is an Initio 1530 = FireWire 400 + USB 2.0. Would you have more luck with the old FireWire stack or with a different OS?
Comment 3 Will Woods 2007-05-06 17:18:15 EDT
Good to know! It's FireWire 400 + USB 2.0, so I guess that makes it an Initio 1530. The drive works fine with another OS (Mac OS X) and worked with the old firewire stack (in FC6).
Comment 4 Darren Naessens 2007-05-17 07:56:05 EDT
I have a Lacie BigDisk on various IBM Thinkcentre's and have a similar problem with the new firewire stack on FC7 for some time, output usually is: fw_core: created new fw device fw1 (4 config rom retries) fw_core: phy config: card 0, new root=ffc0, gap_count=63 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: failed to login to fw1.0 FC6 gave me no trouble whatsoever. Currently on FC7 updated today to kernel-2.6.21-1.3116.fc7 and results in above. If it does connect on FC7 it can be fine for a while, but have had a few times where the whole machine locks up whilst copying files to the external drive - requiring turning the whole machine off to recover. Am using the older stack when I have to use the external drive, but mentioning my issues on here due to the locking up.
Comment 5 Stefan Richter 2007-05-17 13:17:33 EDT
Re comment #4: Are you using the old stack on the same kernel on which the new stack fails?
Comment 7 Darren Naessens 2007-05-18 10:53:35 EDT
Re comment #5: Yep. Was changing the config in /boot around each time. Just tried 3163 and the new stack works each time on rebooting, but if its unmounted it wont remount: May 18 15:48:19 bird kernel: fw_core: created new fw device fw1 (4 config rom retries) May 18 15:48:21 bird kernel: fw_sbp2: orb reply timed out, rcode=0x11 May 18 15:48:21 bird kernel: fw_core: phy config: card 0, new root=ffc0, gap_count=63 May 18 15:48:23 bird kernel: fw_sbp2: orb reply timed out, rcode=0x11 May 18 15:48:34 bird last message repeated 5 times May 18 15:48:34 bird kernel: fw_sbp2: failed to login to fw1.0 3149 worked as previously described with new stack but did with old stack.
Comment 8 Stefan Richter 2007-05-18 12:53:28 EDT
I may be wrong, but it appears as if the sequence of event is - - fw-core scanned device's ROM, hands it over to fw-sbp2 - fw-sbp2 starts login attempt - bus reset occurs (caused by the device or by fw-core, I don't know) - fw-sbp2's repeated login attempts go to a stale host-side device representation or stale device-side context The old stack definitely has problems if a bus reset occurs in the middle of sbp2's login attempt. /Maybe/ the new stack has similar problems. At least its timings are different from the old one's. Paradoxically, the new stack should be more successful in the device detection business AFAIU from looking at the sources.
Comment 9 Kristian Høgsberg 2007-05-22 22:02:44 EDT
I'm pretty sure I've tracked this down to a missing byteswap in the DMA program setup code. The .3190 kernel is building now with a patch to add the missing byteswaps. I've reproduced the problem on a mac ppc laptop and the fix makes it work for me. The build can be tracked here: http://koji.fedoraproject.org/koji/taskinfo?taskID=14801 and the RPM for ppc can be found by following the ppc descendent task link once build finishes. Please test, thanks :)
Comment 10 Kristian Høgsberg 2007-05-23 01:43:14 EDT
RPMs are here: http://koji.fedoraproject.org/koji/buildinfo?buildID=7077
Comment 11 Darren Naessens 2007-05-25 12:01:26 EDT
Created attachment 155461 [details] firewire on startup log
Comment 12 Darren Naessens 2007-05-25 12:02:06 EDT
Created attachment 155462 [details] firewire hotplug log
Comment 13 Stefan Richter 2007-05-25 13:20:42 EDT
Darren, the "firewire on startup log" looks good --- /if/ you can read the partions afterwards and do actual I/O. The line "fw_core: phy config: card 0, new root=ffc0, gap_count=63" in "firewire hotplug log" is what I meant by "bus reset" in comment #8. Do you still have a log around from a session where you were able to do some I/O before subsequent failure? If so, does it contain such a bus reset message (or several of them) near the failure too?
Comment 14 Alex Kanavin 2007-05-26 16:35:17 EDT
Just tried the .3194 with my drive and no joy (dmesg output): <5>fw_ohci: Added fw-ohci device 0002:20:0e.0, OHCI version 1.10 <5>fw_core: created new fw device fw0 (0 config rom retries) <5>fw_core: created new fw device fw1 (0 config rom retries) <5>fw_core: phy config: card 0, new root=ffc1, gap_count=5 <7>PHY ID: 4061e4, addr: 0 <3>fw_sbp2: orb reply timed out, rcode=0x11 <3>fw_sbp2: management write failed, rcode 0x10 <3>fw_sbp2: management write failed, rcode 0x10 <3>fw_sbp2: management write failed, rcode 0x10 <3>fw_sbp2: management write failed, rcode 0x10 <3>fw_sbp2: management write failed, rcode 0x10 <3>fw_sbp2: failed to login to fw1.0 More info about my drive and controller in bug #231708. Tell me if you need further testing.
Comment 15 Alex Kanavin 2007-05-26 17:47:32 EDT
Hmm, discard that. After a reboot and another attempt, everything worked! Hardware quirk maybe?
Comment 16 Stefan Richter 2007-05-26 17:58:44 EDT
Many PL-3507 are sold with buggy firmware. If you have a Windows PC around, have a look at http://www.prolific.com.tw/eng/downloads.asp?ID=44 for current firmware and an EEPROM reader/writer utility. This utility shows you which ROM version is installed. (See readme file in the reader/writer zip archive.) These PL-3507 are in my experience a gamble with the old FireWire stack too.
Comment 17 Alex Kanavin 2007-05-26 18:08:05 EDT
When I got the PL3507 enclosure, it had some old firmware, and the experience was very frustrating: the drive would randomly stop responding after an hour of work, sometimes more, sometimes less. Things were made more complicated by actually installing Fedora Core 6 on it (rather than just using it as a data partition) - I'm not sure how it managed to survive through the installation. Things improved a lot after a firmware update - the only remaining trouble is that occasionally the drive doesn't respond on boot (the kernel would load, but wouldn't initialize the drive) - seems like this problem exists with both stacks. I'll check if I have the most latest firmware - it seems so, but double checking won't hurt.
Comment 18 Darren Naessens 2007-05-29 03:23:38 EDT
After reboot (tried a few times), im not able to mount the drive etc last reboot: On startup: fw_ohci: Added fw-ohci device 0000:0a:09.4, OHCI version 1.10 fw_core: created new fw device fw0 (0 config rom retries) fw_core: created new fw device fw1 (0 config rom retries) fw_core: phy config: card 0, new root=ffc0, gap_count=63 <--bus reset again fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: orb reply timed out, rcode=0x11 fw_sbp2: failed to login to fw1.0 Regarding the failure - that machine completely packed up - unable to login or anything after an update so started over (sslkeylib or something like that wouldnt work). I should note that if I was only reading from the drive, it was fine, writing to the drive caused the lockup. Out of curiosity, as I only have the one firewire device connected, why are 2 devices created? fw0 and fw1? Is one the firewire card itself?
Comment 19 Alex Kanavin 2007-06-01 17:05:51 EDT
I've updated to F7 and it has rendered my system unbootable :( I'm booting it from an external firewire drive, and what seems to be happening is that LVM scan happens before fw-spb2 finishes initializing and therefore no volumes are found. If I boot from a rescue CD, the drive gets recognized and mounted fine, because the procedure is different there. Is there a way to serialize module loading or otherwise force LVM to wait a little bit? Or could it be a totally different problem?
Comment 20 Alex Kanavin 2007-06-01 18:06:42 EDT
Created attachment 155949 [details] fix mkinitrd to force waiting for fw-sbp2 initialization (same way as old sbp2 driver) This patch adds the waiting for fw-sbp2 initialization, just like the older sbp2 driver. This is necessary when e.g. booting from a firewire drive, so that DM doesn't get loaded too soon.
Comment 21 Alex Kanavin 2007-06-01 18:10:33 EDT
Ok, I've figured out what's going on: please check and apply the patch to mkinitrd. However, not all is well: the drive still fails to initialize on boot (with the "orb reply timed out" error like in comment #14). I can check and write down the exact message sequence if you want. The strange thing is that if I boot from a rescue cd, then everything works! Something with timings maybe?
Comment 22 Alex Kanavin 2007-06-02 11:51:28 EDT
Created attachment 155985 [details] fix mkinitrd to force waiting for fw-ohci and fw-sbp2 initialization This patch adds the waiting for fw-ohci and fw-sbp2 initialization. This is necessary when e.g. booting from a firewire drive, so that DM/LVM doesn't get loaded too soon and fw-sbp2 doesn't try to initialize before fw-ohci finishes loading.
Comment 23 Alex Kanavin 2007-06-02 11:57:32 EDT
I've added a similar wait for fw-ohci to finish initializing before loading fw-sbp2 and it worked! So apparently fw-sbp2 driver tries to do things before fw-ohci sets everything up. Shouldn't fw-sbp2 do some internal checks, like wait on a mutex or something? I've attached a patch to mkinitrd that includes both waits, but you should see if fw-sbp2 needs to be fixed too, because there might be other situations where rapid "insmod fw-ohci; insmod fw-sbp2" can happen - e.g. anaconda or someone's script.
Comment 24 Stefan Richter 2007-06-02 12:40:37 EDT
> Shouldn't fw-sbp2 do some internal checks, like wait on a mutex > or something? There is nothing it could sensibly wait for. How should fw-sbp2 know beforehand how many FireWire buses are present and how many SBP-2 capable nodes will sit on these buses? The maximum that could be done is to add parameters to fw-sbp2 (a) for how long it should wait for SBP-2 nodes to become available and perhaps (b) after how many nodes to quit waiting, and then fw-sbp2 could be integrated with the Linux SCSI layer's asynchronous scanning facility. Then, the initrd should - load fw-ohci - load fw-sbp2 if there is no hotplug facility in the initrd, - have those module parameters for fw-sbp2 configured (timeout, optionally number of SBP-2 targets to wait for) - load the scsi_wait_scan module. When "modprobe scsi_wait_scan" exits, then all SCSI drivers which are integrated with this generic asynchronous SCSI scanning facility, will have finished (or given up on) discovering all known SCSI targets. Lack of integration of sbp2 (or now rather fw-sbp2) with the asynchronous SCSI scanning facility is tracked as http://bugzilla.kernel.org/show_bug.cgi?id=3225 . While I am at it, let me explain the roles of the firewire drivers: fw-ohci implements interfaces to FireWire buses (for interfaces of the OHCI type). fw-core, among else, looks what kinds of nodes are visible on present buses and notifies protocol drivers (in kernel or in userspace) of detected nodes. fw-sbp2 handles whatever SBP-2 nodes are given to it. fw-ohci knows nothing of fw-sbp2, and vice versa. There could be other controller drivers besides fw-ohci beneath fw-core, and there could and probably will be other protocol drivers besides fw-sbp2 above fw-core. The old IEEE 1394 stack actually has two mainline controller drivers and two (incomplete) out-of-tree controller drivers beneath ieee1394; and it has several protocol drivers above ieee1394 in mainline and out-of-tree.
Comment 25 Kristian Høgsberg 2007-06-02 13:14:35 EDT
(In reply to comment #24) > > Shouldn't fw-sbp2 do some internal checks, like wait on a mutex > > or something? > > There is nothing it could sensibly wait for. How should fw-sbp2 know beforehand > how many FireWire buses are present and how many SBP-2 capable nodes will sit on > these buses? That said, it does sound like there's a timing problem here. It could be that the disk in question requires several config rom read retries and that there is some race between those retries and loading fw-sbp2.
Comment 26 Alex Kanavin 2007-06-02 13:22:04 EDT
The log says Jun 2 19:52:44 localhost kernel: fw_core: created new fw device fw0 (0 config rom retries) Jun 2 19:52:45 localhost kernel: fw_core: created new fw device fw1 (0 config rom retries) so it's probably some other kind of race?
Comment 27 Stefan Richter 2007-06-02 15:13:24 EDT
The "modprobe fw-sbp2" process will execute the SBP-2 login, SCSI inquiry, and SCSI device creation *only* if fw-core already finished probing of the remote node before fw-sbp2 is inserted into the kernel. Otherwise, "modprobe fw-sbp2" will only do some minimal driver initialization and will exit quickly, not knowing that fw-core is just (or will soon be) discovering a remote node.
Comment 28 Alex Kanavin 2007-07-23 12:57:59 EDT
Things have gotten worse in 220.127.116.11. For some reason, firewire-sbp2 gets loaded before firewire-ohci, so there's no way to make a pause between their initialization. Now I almost always get "orb reply timed out" error. Very occasionally the initialization goes through fine, and twice I also got a kernel crash. I'll attach the screen images so you can see how they look like.
Comment 29 Alex Kanavin 2007-07-23 13:23:31 EDT
Created attachment 159795 [details] kernel crash during firewire-sbp2 init
Comment 30 Alex Kanavin 2007-07-23 13:24:32 EDT
Created attachment 159796 [details] firewire errors during sbp2 init
Comment 31 Alex Kanavin 2007-09-01 06:39:52 EDT
Comment on attachment 155985 [details] fix mkinitrd to force waiting for fw-ohci and fw-sbp2 initialization Looks like the problem went away with the latest rawhide kernel. So this bug can be closed.
Comment 32 Stefan Richter 2007-09-01 07:36:56 EDT
Is it fixed for all reporters? I guess not.
Comment 33 Will Woods 2007-09-04 19:05:54 EDT
Works fine for me with kernel-2.6.23-0.153.rc4.fc8, the drive from the original report, and a Mac Mini; I'll test on the machines from the original report as soon as I can.
Comment 34 Will Woods 2007-09-05 14:41:32 EDT
Nope, still doesn't work on the original machine (dell precision 380, x86_64).
Comment 35 Darren Naessens 2007-10-03 10:06:05 EDT
Haven't been able to test for a while, but mine works with 2.6.23-0.214.rc8.git2.fc8. Comes up when plugged in (with no errors in log) and have successfully transferred over a 100gb to it with no problem.
Comment 36 Will Woods 2007-10-03 17:35:06 EDT
Situation is unchanged here with kernel .214.
Comment 37 Darren Naessens 2007-10-17 16:16:07 EDT
I have started having problems again, the whole machine sometimes locks up shortly after mounting the drive. Nautilus pops up and the disk starts to be read during which its locks up. Upon reboot the last line in the log before crash always says mounted drive on behalf of user, and thats it which to me is normal. No errors at all. It seems random in occurance, it has happened 3 times in the last few days. If Nautilus displays lists the root of the drive it is fine thereafter. Via the command line, can 'ls' no problem until lock up (if there is a lock up). To check the drive tried mounting it on OSX and it comes up no problem.
Comment 38 Will Woods 2007-10-18 12:27:54 EDT
Re comment #37: Darren, this bug is not about system lockups with firewire, but failure to mount firewire drives. Unless you see the "Cannot connect to external firewire drive (fw_sbp2: failed to login to fw1.0)" message or something similar in your system logs, you're talking about a different problem. If you're still getting lockups, please file a separate bug to avoid further confusion. As far as this bug goes: the "failed to login" problem is gone, and firewire drives work fine on all my test machines, except one. The x86_64 machine with an Agere Systems FW323 controller still fails to mount any firewire drive, with kernel messages as in the original bug report: [plug drive in] Oct 18 12:21:18 brinstar kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5 [wait 30 sec] Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom for node id ffc0 [remove drive] Oct 18 12:22:09 brinstar kernel: firewire_core: phy config: card 0, new root=ffc0, gap_count=63 Should I split this problem into a different bug, since the sbp2 problem named in the subject of the bug is fixed?
Comment 39 Darren Naessens 2007-10-18 16:53:31 EDT
Will, Slowly getting the hang of this bugzilla malarkey. Yep you are quite right I am talking about a different problem there and will file accordingly. Won't happen again. So for me the login problem is gone, when first plugging the drive in. So I don't waste anyone's further time, would the fact that If I successfully get the drive mounted without problems, do I/O, unmount it, unplug it and later in that same session, plug it back in and then always get the failed to login message fall under this bug or is that another separate issue?
Comment 40 Stefan Richter 2007-10-18 17:24:07 EDT
Splitting and shifting the bug entries back and forth is all nice and well. But what is needed to get these bugs actually fixed is either detailed debug info¹ or a driver hacker with hardware that lets him reproduce the problem² (so that he can produce debug info by himself). ¹) Will, if I find time at the weekend to spare, I'll post a patch to get more info during the config rom probing. ²) My secondary PC is an i945GM based x64_64 with Agere FW323, and it works fine. I don't know though if or when I last tested all of my SBP-2 devices with it.
Comment 41 Stefan Richter 2007-10-18 17:33:25 EDT
Re comment #39: > [...] I successfully get the drive mounted without problems, > do I/O, unmount it, unplug it and later in that same session, > plug it back in and then always get the failed to login message What is logged in dmesg during the unsuccessful replugging?
Comment 42 Will Woods 2007-10-22 17:41:34 EDT
This problem is confined to just one of all the machines I've tried. I think the number of systems affected is pretty low. Moving to F8Target. If you've got any suggestions for things to try with this machine (Dell Precision 380), let me know. Until then I'll wait for the patch for extra debug info during config rom probing.
Comment 43 Stefan Richter 2007-11-07 16:54:58 EST
> Until then I'll wait for the patch for extra debug info during config > rom probing. Someone working on an out-of-tree driver posted a draft patch for this but hasn't resubmitted it yet. And I have been distracted by other stuff recently and am unable to do this during the next ~14 days due to traveling. I hope to get back to this thereafter.
Comment 44 Stefan Richter 2008-01-03 18:20:20 EST
Re comment #43: I haven't forgotten about it, but there are always so many other more fascinating things in life... Will try to get to it RSN.
Comment 45 Stefan Richter 2008-01-17 16:48:14 EST
Re comment #38: > Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom > for node id ffc0 There is also an upstream bug entry now which is probably the same issue: http://bugzilla.kernel.org/show_bug.cgi?id=9760 And there are a few other mentions of the 'giving up on config rom' failure mode on mailinglists or via personal mail.
Comment 46 Jarod Wilson 2008-01-17 17:16:13 EST
So... I was getting very similar login failures with orb reply timeout messages with a certain drive, and was going to do some packet captures with nosy and a pcilynx card... Lo and behold, the drive is accessed and mounted just fine with the pcilynx card essentially serving as a hub between the drive and controller. I've verified that the drive also works with a plain firewire hub in between the drive and controller. I've got a few other drives to check out to see if they behave similarly, but its definitely an interesting data point... Still getting up to speed on specs and code, no idea what the root cause might be yet.
Comment 47 Stefan Richter 2008-01-17 17:27:14 EST
What if you replace the '5' by '7' in fw-card.c::gap_count_table?
Comment 48 Jarod Wilson 2008-01-18 12:08:36 EST
(In reply to comment #47) > What if you replace the '5' by '7' in fw-card.c::gap_count_table? No change, still fails in the same way.
Comment 49 Jarod Wilson 2008-01-18 13:37:54 EST
Got it. Disk login is successful if I simply raise SBP2_ORB_TIMEOUT in fw-sbp2.c. First test was to go from 2000 to 5000, and bingo. Not sure how much increase is really needed yet. firewire_core: created new fw device fw5 (0 config rom retries, S400) firewire_core: phy config: card 2, new root=ffc1, gap_count=7 firewire_sbp2: orb reply timed out, rcode=0x11 firewire_sbp2: logged in to fw5.1 LUN 0001 (0 retries) scsi 11:0:0:1: Enclosure WD My Book Device PQ: 0 ANSI: 4 scsi 11:0:0:1: Attached scsi generic sg2 type 13 firewire_sbp2: logged in to fw5.0 LUN 0000 (1 retries) scsi 10:0:0:0: Direct-Access WD My Book 1016 PQ: 0 ANSI: 4 sd 10:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) sd 10:0:0:0: [sdb] Write Protect is off sd 10:0:0:0: [sdb] Mode Sense: 10 00 00 00 sd 10:0:0:0: [sdb] Cache data unavailable sd 10:0:0:0: [sdb] Assuming drive cache: write through sd 10:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB) sd 10:0:0:0: [sdb] Write Protect is off sd 10:0:0:0: [sdb] Mode Sense: 10 00 00 00 sd 10:0:0:0: [sdb] Cache data unavailable sd 10:0:0:0: [sdb] Assuming drive cache: write through sdb: sdb1 sd 10:0:0:0: [sdb] Attached SCSI disk sd 10:0:0:0: Attached scsi generic sg3 type 0
Comment 50 Stefan Richter 2008-01-18 14:01:21 EST
Argh... sbp2 has a 20s timeout for login. Why didn't it occur to us earlier that there is this important difference between sbp2 and fw-sbp2? Maybe we should use different timeouts in fw-sbp2.c::sbp2_send_management_orb() depending on the kind of ORB. (Command ORB vs. management ORBs.) Now I'm off to try a bigger SBP2_ORB_TIMEOUT with the only device of mine to which fw-sbp2 consistently fails to log in (SmartDisk VST portable CD-R/W) thusly: Jan 17 22:29:10 stein scsi48 : SBP-2 IEEE-1394 Jan 17 22:29:12 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 17 22:29:14 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 17 22:29:16 stein firewire_sbp2: status write for unknown orb Jan 17 22:29:16 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 17 22:29:18 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 17 22:29:21 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 17 22:29:23 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 17 22:29:23 stein firewire_sbp2: failed to login to fw3.0 LUN 0000 However, "firewire_core: giving up on config rom" is another story.
Comment 51 Jarod Wilson 2008-01-18 14:08:49 EST
In reading over the sbp2 specs, it looks like the management ORB timeout is actually something each device's config rom is supposed to provide for itself. From Section 7.4.8: The mgt_ORB_timeout field shall specify, in units of 500 milliseconds, the maximum time an initiator shall allow for a target to store a status block in response to a management ORB. The time-out commences when the initiator receives either ack_complete or resp_complete from the target in response to the block write of the management ORB address to the MANAGEMENT_AGENT register.
Comment 52 Jarod Wilson 2008-01-18 14:21:02 EST
Doesn't actually seem to be clear whether or not a device MUST provide a management ORB timeout value or not, so perhaps the idea of separate command orb and management orb timeouts is the way to go for now. And yeah, 'giving up on config rom' is yet another separate issue we still need to track down... :\
Comment 53 Stefan Richter 2008-01-18 14:22:24 EST
> Now I'm off to try a bigger SBP2_ORB_TIMEOUT with the only device of > mine to which fw-sbp2 consistently fails to log in (SmartDisk VST > portable CD-R/W) Yess! According to a few quick tests, this thing wants 6...10 seconds grace time to send login status when powering up (it's bus-powered) and about 1 second when already powered up. > Maybe we should use different timeouts in fw-sbp2.c:: > sbp2_send_management_orb() depending on the kind of ORB. (Command > ORB vs. management ORBs.) Ahem, it's only for management ORBs. However, we don't want such long timeouts for logout ORBs. I'm not sure about reconnect ORBs. Devices should be able to respond to them quicker than to login ORBs because they are already powered up.
Comment 54 Stefan Richter 2008-01-18 14:36:36 EST
> From Section 7.4.8: > > The mgt_ORB_timeout field shall specify, in units of 500 milliseconds, > the maximum time an initiator shall allow for a target to store a > status block in response to a management ORB. Right. And in November 2006 I noted in sbp2.c "TODO:... handle Unit_Characteristics.mgt_ORB_timeout and .ORB_size" and then forgot about it. (Actually we probably don't need to handle .ORB_size.) We should indeed use mgt_ORB_timeout, but limit the value we get from there to some reasonable limits, i.e. a sane lower limit and a sane upper limit.
Comment 55 Jarod Wilson 2008-01-18 14:48:19 EST
Sounds like a plan to me. Do you want to work on it, or shall I try to whip something up?
Comment 56 Stefan Richter 2008-01-18 15:05:16 EST
I'm not gonna put something together today anymore. Maybe tomorrow unless you do. :-)
Comment 57 Kristian Høgsberg 2008-01-18 15:06:09 EST
I don't think it's worth it to try to do something clever here. A casual look through some of the devices I have here show that none of them have the unit characteristics entry. And some devices may specific a lower timeout than they actually need etc. Just going for 20s for at login time like the old driver did seems like the simplest and most robust approach.
Comment 58 Jarod Wilson 2008-01-18 15:32:45 EST
I'll just ship a tiny diff that increases the timeout and adds a note about why we're not currently bothering with the config rom provided value. Those seeking instant gratification in testing can grab a firewire driver tarball and run the little new-mods.sh script inside it, which will compile the driver against your running kernel (requires matching kernel-devel installed), unload the old firewire-* drivers and load up the ones just built (they just get loaded from where they're built, not installed). http://people.redhat.com/jwilson/misc/firewire-driver-increased-timeout.tar.bz2
Comment 59 Stefan Richter 2008-01-23 10:40:54 EST
Re comment #38: > Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom > for node id ffc0 ... > Should I split this problem into a different bug, since the sbp2 problem > named in the subject of the bug is fixed? Bug 429598 covers this.
Comment 60 Jarod Wilson 2008-01-23 17:11:44 EST
This is fixed in rawhide and the latest F8 koji kernels, and the patch has been accepted upstream. Will hold off on closing until we push a kernel package for F8 that carries the fix.