Bug 238606

Summary:

[firewire] unable to use disk (fw_sbp2: failed to login to ...)

Product:

[Fedora] Fedora

Reporter:

Will Woods <wwoods>

Component:

kernel

Assignee:

Jarod Wilson <jarod>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Brian Brock <bbrock>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

rawhide

CC:

ak, dpnaessens, fenlason, krh, lordmorgul, stefan-r-rhbz

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

2.6.23.14-107.fc8

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-02-05 21:20:54 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

235704, 246496

Attachments:

Description	Flags
log from Power Mac G5	none
firewire on startup log	none
firewire hotplug log	none
fix mkinitrd to force waiting for fw-sbp2 initialization (same way as old sbp2 driver)	none
fix mkinitrd to force waiting for fw-ohci and fw-sbp2 initialization	none
kernel crash during firewire-sbp2 init	none
firewire errors during sbp2 init	none

Description Will Woods 2007-05-01 19:15:39 UTC

I've got a Western Digital FireWire/USB 2.0 External Hard Drive - WDXC2500JBRNN
is the model number. None of my rawhide machines can seem to connect to it. It's
not using bus power.

Western Digital says that it uses either the Oxford 922 or Initio 1430 chipset;
I haven't opened it up to find out which. WD reference:
http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1469

Here's some lspci / kernel log info.

On my x86_64 machine:
05:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)

fw_core: phy config: card 0, new root=ffc1, gap_count=5
fw_core: giving up on config rom for node id ffc0

On my G5 Power Mac it gets a little further:
0001:03:0e.0 FireWire (IEEE 1394): Apple Computer Inc. K2 FireWire

I'll attach the log from that machine.

Comment 1 Will Woods 2007-05-01 19:15:40 UTC

Created attachment 153885 [details]
log from Power Mac G5

Comment 2 Stefan Richter 2007-05-06 15:37:06 UTC

Oxford 922 is for FireWire 800 + USB 2.0, Initio 1430 for FireWire 400.  I take
it from "gap_count=5" that it's FireWire 400 (1394a) with an additional USB 2.0
link and PHY.  Or it is an Initio 1530 = FireWire 400 + USB 2.0.

Would you have more luck with the old FireWire stack or with a different OS?

Comment 3 Will Woods 2007-05-06 21:18:15 UTC

Good to know! It's FireWire 400 + USB 2.0, so I guess that makes it an Initio
1530. The drive works fine with another OS (Mac OS X) and worked with the old
firewire stack (in FC6).

Comment 4 Darren Naessens 2007-05-17 11:56:05 UTC

I have a Lacie BigDisk on various IBM Thinkcentre's and have a similar problem
with the new firewire stack on FC7 for some time, output usually is:

fw_core: created new fw device fw1 (4 config rom retries)
fw_core: phy config: card 0, new root=ffc0, gap_count=63
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: failed to login to fw1.0

FC6 gave me no trouble whatsoever. 

Currently on FC7 updated today to kernel-2.6.21-1.3116.fc7 and results in above.
 
If it does connect on FC7 it can be fine for a while, but have had a few times
where the whole machine locks up whilst copying files to the external drive -
requiring turning the whole machine off to recover.

Am using the older stack when I have to use the external drive, but mentioning
my issues on here due to the locking up.

Comment 5 Stefan Richter 2007-05-17 17:17:33 UTC

Re comment #4: Are you using the old stack on the same kernel on which the new
stack fails?

Comment 6 Will Woods 2007-05-17 21:18:31 UTC

More info can be found in bug #231708

Comment 7 Darren Naessens 2007-05-18 14:53:35 UTC

Re comment #5: Yep. Was changing the config in /boot around each time. 

Just tried 3163 and the new stack works each time on rebooting, but if its
unmounted it wont remount:

May 18 15:48:19 bird kernel: fw_core: created new fw device fw1 (4 config rom
retries)
May 18 15:48:21 bird kernel: fw_sbp2: orb reply timed out, rcode=0x11
May 18 15:48:21 bird kernel: fw_core: phy config: card 0, new root=ffc0,
gap_count=63
May 18 15:48:23 bird kernel: fw_sbp2: orb reply timed out, rcode=0x11
May 18 15:48:34 bird last message repeated 5 times
May 18 15:48:34 bird kernel: fw_sbp2: failed to login to fw1.0

3149 worked as previously described with new stack but did with old stack.

Comment 8 Stefan Richter 2007-05-18 16:53:28 UTC

I may be wrong, but it appears as if the sequence of event is -
  - fw-core scanned device's ROM, hands it over to fw-sbp2
  - fw-sbp2 starts login attempt
  - bus reset occurs (caused by the device or by fw-core, I don't know)
  - fw-sbp2's repeated login attempts go to a stale host-side device
    representation or stale device-side context
The old stack definitely has problems if a bus reset occurs in the middle of
sbp2's login attempt.  /Maybe/ the new stack has similar problems.  At least its
timings are different from the old one's.  Paradoxically, the new stack should
be more successful in the device detection business AFAIU from looking at the
sources.

Comment 9 Kristian Høgsberg 2007-05-23 02:02:44 UTC

I'm pretty sure I've tracked this down to a missing byteswap in the DMA program
setup code.  The .3190 kernel is building now with a patch to add the missing
byteswaps.  I've reproduced the problem on a mac ppc laptop and the fix makes it
work for me.  The build can be tracked here:

  http://koji.fedoraproject.org/koji/taskinfo?taskID=14801

and the RPM for ppc can be found by following the ppc descendent task link once
build finishes.  Please test, thanks :)

Comment 10 Kristian Høgsberg 2007-05-23 05:43:14 UTC

RPMs are here:

  http://koji.fedoraproject.org/koji/buildinfo?buildID=7077

Comment 11 Darren Naessens 2007-05-25 16:01:26 UTC

Created attachment 155461 [details]
firewire on startup log

Comment 12 Darren Naessens 2007-05-25 16:02:06 UTC

Created attachment 155462 [details]
firewire hotplug log

Comment 13 Stefan Richter 2007-05-25 17:20:42 UTC

Darren, the "firewire on startup log" looks good --- /if/ you can read the
partions afterwards and do actual I/O.

The line "fw_core: phy config: card 0, new root=ffc0, gap_count=63" in "firewire
hotplug log" is what I meant by "bus reset" in comment #8.  Do you still have a
log around from a session where you were able to do some I/O before subsequent
failure?  If so, does it contain such a bus reset message (or several of them)
near the failure too?

Comment 14 Alex Kanavin 2007-05-26 20:35:17 UTC

Just tried the .3194 with my drive and no joy (dmesg output):

<5>fw_ohci: Added fw-ohci device 0002:20:0e.0, OHCI version 1.10
<5>fw_core: created new fw device fw0 (0 config rom retries)
<5>fw_core: created new fw device fw1 (0 config rom retries)
<5>fw_core: phy config: card 0, new root=ffc1, gap_count=5
<7>PHY ID: 4061e4, addr: 0
<3>fw_sbp2: orb reply timed out, rcode=0x11
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: failed to login to fw1.0

More info about my drive and controller in bug #231708. Tell me if you need further testing.

Comment 15 Alex Kanavin 2007-05-26 21:47:32 UTC

Hmm, discard that. After a reboot and another attempt, everything worked! Hardware quirk maybe?

Comment 16 Stefan Richter 2007-05-26 21:58:44 UTC

Many PL-3507 are sold with buggy firmware.  If you have a Windows PC around,
have a look at http://www.prolific.com.tw/eng/downloads.asp?ID=44 for current
firmware and an EEPROM reader/writer utility.  This utility shows you which ROM
version is installed.  (See readme file in the reader/writer zip archive.)

These PL-3507 are in my experience a gamble with the old FireWire stack too.

Comment 17 Alex Kanavin 2007-05-26 22:08:05 UTC

When I got the PL3507 enclosure, it had some old firmware, and the experience was very frustrating: the 
drive would randomly stop responding after an hour of work, sometimes more, sometimes less. Things 
were made more complicated by actually installing Fedora Core 6 on it (rather than just using it as a data 
partition) - I'm not sure how it managed to survive through the installation. Things improved a lot after a 
firmware update - the only remaining trouble is that occasionally the drive doesn't respond on boot (the 
kernel would load, but wouldn't initialize the drive) - seems like this problem exists with both stacks. I'll 
check if I have the most latest firmware - it seems so, but double checking won't hurt.

Comment 18 Darren Naessens 2007-05-29 07:23:38 UTC

After reboot (tried a few times), im not able to mount the drive etc last reboot:

On startup:
fw_ohci: Added fw-ohci device 0000:0a:09.4, OHCI version 1.10
fw_core: created new fw device fw0 (0 config rom retries)
fw_core: created new fw device fw1 (0 config rom retries)
fw_core: phy config: card 0, new root=ffc0, gap_count=63 <--bus reset again
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: failed to login to fw1.0

Regarding the failure - that machine completely packed up - unable to login or
anything after an update so started over (sslkeylib or something like that
wouldnt work). I should note that if I was only reading from the drive, it was
fine, writing to the drive caused the lockup.

Out of curiosity, as I only have the one firewire device connected, why are 2
devices created? fw0 and fw1? Is one the firewire card itself?

Comment 19 Alex Kanavin 2007-06-01 21:05:51 UTC

I've updated to F7 and it has rendered my system unbootable :( I'm booting it from an external firewire 
drive, and what seems to be happening is that LVM scan happens before fw-spb2 finishes initializing and 
therefore no volumes are found. If I boot from a rescue CD, the drive gets recognized and mounted fine, 
because the procedure is different there. Is there a way to serialize module loading or otherwise force LVM 
to wait a little bit? Or could it be a totally different problem?

Comment 20 Alex Kanavin 2007-06-01 22:06:42 UTC

Created attachment 155949 [details]
fix mkinitrd to force waiting for fw-sbp2 initialization (same way as old sbp2 driver)

This patch adds the waiting for fw-sbp2 initialization, just like the older
sbp2 driver. This is necessary when e.g. booting from a firewire drive, so that
DM doesn't get loaded too soon.

Comment 21 Alex Kanavin 2007-06-01 22:10:33 UTC

Ok, I've figured out what's going on: please check and apply the patch to mkinitrd. However, not all is well: 
the drive still fails to initialize on boot (with the "orb reply timed out" error like in comment #14). I can 
check and write down the exact message sequence if you want. The strange thing is that if I boot from a 
rescue cd, then everything works! Something with timings maybe?

Comment 22 Alex Kanavin 2007-06-02 15:51:28 UTC

Created attachment 155985 [details]
fix mkinitrd to force waiting for fw-ohci and fw-sbp2 initialization

This patch adds the waiting for fw-ohci and fw-sbp2 initialization. This is
necessary when e.g. booting from a firewire drive, so that DM/LVM doesn't get
loaded too soon and fw-sbp2 doesn't try to initialize before fw-ohci finishes
loading.

Comment 23 Alex Kanavin 2007-06-02 15:57:32 UTC

I've added a similar wait for fw-ohci to finish initializing before loading
fw-sbp2 and it worked! So apparently fw-sbp2 driver tries to do things before
fw-ohci sets everything up. Shouldn't fw-sbp2 do some internal checks, like wait
on a mutex or something? I've attached a patch to mkinitrd that includes both
waits, but you should see if fw-sbp2 needs to be fixed too, because there might
be other situations where rapid "insmod fw-ohci; insmod fw-sbp2" can happen -
e.g. anaconda or someone's script.

Comment 24 Stefan Richter 2007-06-02 16:40:37 UTC

> Shouldn't fw-sbp2 do some internal checks, like wait on a mutex
> or something?

There is nothing it could sensibly wait for.  How should fw-sbp2 know beforehand
how many FireWire buses are present and how many SBP-2 capable nodes will sit on
these buses?

The maximum that could be done is to add parameters to fw-sbp2 (a) for how long
it should wait for SBP-2 nodes to become available and perhaps (b) after how
many nodes to quit waiting, and then fw-sbp2 could be integrated with the Linux
SCSI layer's asynchronous scanning facility.  Then, the initrd should
  - load fw-ohci
  - load fw-sbp2 if there is no hotplug facility in the initrd,
  - have those module parameters for fw-sbp2 configured (timeout, optionally
    number of SBP-2 targets to wait for)
  - load the scsi_wait_scan module.
When "modprobe scsi_wait_scan" exits, then all SCSI drivers which are integrated
with this generic asynchronous SCSI scanning facility, will have finished (or
given up on) discovering all known SCSI targets.

Lack of integration of sbp2 (or now rather fw-sbp2) with the asynchronous SCSI
scanning facility is tracked as http://bugzilla.kernel.org/show_bug.cgi?id=3225 .

While I am at it, let me explain the roles of the firewire drivers:

fw-ohci implements interfaces to FireWire buses (for interfaces of the OHCI type).

fw-core, among else, looks what kinds of nodes are visible on present buses and
notifies protocol drivers (in kernel or in userspace) of detected nodes.

fw-sbp2 handles whatever SBP-2 nodes are given to it.

fw-ohci knows nothing of fw-sbp2, and vice versa.

There could be other controller drivers besides fw-ohci beneath fw-core, and
there could and probably will be other protocol drivers besides fw-sbp2 above
fw-core.  The old IEEE 1394 stack actually has two mainline controller drivers
and two (incomplete) out-of-tree controller drivers beneath ieee1394; and it has
several protocol drivers above ieee1394 in mainline and out-of-tree.

Comment 25 Kristian Høgsberg 2007-06-02 17:14:35 UTC

(In reply to comment #24)
> > Shouldn't fw-sbp2 do some internal checks, like wait on a mutex
> > or something?
> 
> There is nothing it could sensibly wait for.  How should fw-sbp2 know beforehand
> how many FireWire buses are present and how many SBP-2 capable nodes will sit on
> these buses?

That said, it does sound like there's a timing problem here.  It could be that
the disk in question requires several config rom read retries and that there is
some race between those retries and loading fw-sbp2.

Comment 26 Alex Kanavin 2007-06-02 17:22:04 UTC

The log says
Jun  2 19:52:44 localhost kernel: fw_core: created new fw device fw0 (0 config
rom retries)
Jun  2 19:52:45 localhost kernel: fw_core: created new fw device fw1 (0 config
rom retries)

so it's probably some other kind of race?

Comment 27 Stefan Richter 2007-06-02 19:13:24 UTC

The "modprobe fw-sbp2" process will execute the SBP-2 login, SCSI inquiry, and
SCSI device creation *only* if fw-core already finished probing of the remote
node before fw-sbp2 is inserted into the kernel.

Otherwise, "modprobe fw-sbp2" will only do some minimal driver initialization
and will exit quickly, not knowing that fw-core is just (or will soon be)
discovering a remote node.

Comment 28 Alex Kanavin 2007-07-23 16:57:59 UTC

Things have gotten worse in 2.6.22.1. For some reason, firewire-sbp2 gets loaded
before firewire-ohci, so there's no way to make a pause between their
initialization. Now I almost always get "orb reply timed out" error. Very
occasionally the initialization goes through fine, and twice I also got a kernel
crash. I'll attach the screen images so you can see how they look like.

Comment 29 Alex Kanavin 2007-07-23 17:23:31 UTC

Created attachment 159795 [details]
kernel crash during firewire-sbp2 init

Comment 30 Alex Kanavin 2007-07-23 17:24:32 UTC

Created attachment 159796 [details]
firewire errors during sbp2 init

Comment 31 Alex Kanavin 2007-09-01 10:39:52 UTC

Comment on attachment 155985 [details]
fix mkinitrd to force waiting for fw-ohci and fw-sbp2 initialization

Looks like the problem went away with the latest rawhide kernel. So this bug
can be closed.

Comment 32 Stefan Richter 2007-09-01 11:36:56 UTC

Is it fixed for all reporters?  I guess not.

Comment 33 Will Woods 2007-09-04 23:05:54 UTC

Works fine for me with kernel-2.6.23-0.153.rc4.fc8, the drive from the original
report, and a Mac Mini; I'll test on the machines from the original report as
soon as I can.

Comment 34 Will Woods 2007-09-05 18:41:32 UTC

Nope, still doesn't work on the original machine (dell precision 380, x86_64).

Comment 35 Darren Naessens 2007-10-03 14:06:05 UTC

Haven't been able to test for a while, but mine works with
2.6.23-0.214.rc8.git2.fc8. Comes up when plugged in (with no errors in log) and
have successfully transferred over a 100gb to it with no problem.

Comment 36 Will Woods 2007-10-03 21:35:06 UTC

Situation is unchanged here with kernel .214.

Comment 37 Darren Naessens 2007-10-17 20:16:07 UTC

I have started having problems again, the whole machine sometimes locks up
shortly after mounting the drive. Nautilus pops up and the disk starts to be
read during which its locks up. 
Upon reboot the last line in the log before crash always says mounted drive on
behalf of user, and thats it which to me is normal. No errors at all.

It seems random in occurance, it has happened 3 times in the last few days. If
Nautilus displays lists the root of the drive it is fine thereafter.

Via the command line, can 'ls' no problem until lock up (if there is a lock up).

To check the drive tried mounting it on OSX and it comes up no problem.

Comment 38 Will Woods 2007-10-18 16:27:54 UTC

Re comment #37: Darren, this bug is not about system lockups with firewire, but
failure to mount firewire drives. Unless you see the "Cannot connect to external
firewire drive (fw_sbp2: failed to login to fw1.0)" message or something similar
in your system logs, you're talking about a different problem. If you're still
getting lockups, please file a separate bug to avoid further confusion.

As far as this bug goes: the "failed to login" problem is gone, and firewire
drives work fine on all my test machines, except one.

The x86_64 machine with an Agere Systems FW323 controller still fails to mount
any firewire drive, with kernel messages as in the original bug report:

[plug drive in]
Oct 18 12:21:18 brinstar kernel: firewire_core: phy config: card 0, new
root=ffc1, gap_count=5
[wait 30 sec]
Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom for node
id ffc0
[remove drive]
Oct 18 12:22:09 brinstar kernel: firewire_core: phy config: card 0, new
root=ffc0, gap_count=63

Should I split this problem into a different bug, since the sbp2 problem named
in the subject of the bug is fixed?

Comment 39 Darren Naessens 2007-10-18 20:53:31 UTC

Will, Slowly getting the hang of this bugzilla malarkey. Yep you are quite right
I am talking about a different problem there and will file accordingly. Won't
happen again. 

So for me the login problem is gone, when first plugging the drive in.

So I don't waste anyone's further time, would the fact that If I successfully
get the drive mounted without problems, do I/O, unmount it, unplug it and later
in that same session, plug it back in and then always get the failed to login
message fall under this bug or is that another separate issue?

Comment 40 Stefan Richter 2007-10-18 21:24:07 UTC

Splitting and shifting the bug entries back and forth is all nice and well.  But
what is needed to get these bugs actually fixed is either detailed debug info¹
or a driver hacker with hardware that lets him reproduce the problem² (so that
he can produce debug info by himself).

¹) Will, if I find time at the weekend to spare, I'll post a patch to get more
info during the config rom probing.

²) My secondary PC is an i945GM based x64_64 with Agere FW323, and it works
fine.  I don't know though if or when I last tested all of my SBP-2 devices with it.

Comment 41 Stefan Richter 2007-10-18 21:33:25 UTC

Re comment #39:
> [...] I successfully get the drive mounted without problems,
> do I/O, unmount it, unplug it and later in that same session,
> plug it back in and then always get the failed to login message

What is logged in dmesg during the unsuccessful replugging?

Comment 42 Will Woods 2007-10-22 21:41:34 UTC

This problem is confined to just one of all the machines I've tried. I think the
number of systems affected is pretty low. Moving to F8Target.

If you've got any suggestions for things to try with this machine (Dell
Precision 380), let me know. Until then I'll wait for the patch for extra debug
info during config rom probing.

Comment 43 Stefan Richter 2007-11-07 21:54:58 UTC

> Until then I'll wait for the patch for extra debug info during config
> rom probing.

Someone working on an out-of-tree driver posted a draft patch for this but
hasn't resubmitted it yet.  And I have been distracted by other stuff recently
and am unable to do this during the next ~14 days due to traveling.  I hope to
get back to this thereafter.

Comment 44 Stefan Richter 2008-01-03 23:20:20 UTC

Re comment #43:
I haven't forgotten about it, but there are always so many other more
fascinating things in life...  Will try to get to it RSN.

Comment 45 Stefan Richter 2008-01-17 21:48:14 UTC

Re comment #38:
> Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom
> for node id ffc0

There is also an upstream bug entry now which is probably the same issue:
http://bugzilla.kernel.org/show_bug.cgi?id=9760
And there are a few other mentions of the 'giving up on config rom' failure mode
on mailinglists or via personal mail.

Comment 46 Jarod Wilson 2008-01-17 22:16:13 UTC

So... I was getting very similar login failures with orb reply timeout messages
with a certain drive, and was going to do some packet captures with nosy and a
pcilynx card... Lo and behold, the drive is accessed and mounted just fine with
the pcilynx card essentially serving as a hub between the drive and controller.
I've verified that the drive also works with a plain firewire hub in between the
drive and controller. I've got a few other drives to check out to see if they
behave similarly, but its definitely an interesting data point... Still getting
up to speed on specs and code, no idea what the root cause might be yet.

Comment 47 Stefan Richter 2008-01-17 22:27:14 UTC

What if you replace the '5' by '7' in fw-card.c::gap_count_table?

Comment 48 Jarod Wilson 2008-01-18 17:08:36 UTC

(In reply to comment #47)
> What if you replace the '5' by '7' in fw-card.c::gap_count_table?

No change, still fails in the same way.

Comment 49 Jarod Wilson 2008-01-18 18:37:54 UTC

Got it. Disk login is successful if I simply raise SBP2_ORB_TIMEOUT in
fw-sbp2.c. First test was to go from 2000 to 5000, and bingo. Not sure how much
increase is really needed yet.

firewire_core: created new fw device fw5 (0 config rom retries, S400)
firewire_core: phy config: card 2, new root=ffc1, gap_count=7
firewire_sbp2: orb reply timed out, rcode=0x11
firewire_sbp2: logged in to fw5.1 LUN 0001 (0 retries)
scsi 11:0:0:1: Enclosure         WD       My Book Device        PQ: 0 ANSI: 4
scsi 11:0:0:1: Attached scsi generic sg2 type 13
firewire_sbp2: logged in to fw5.0 LUN 0000 (1 retries)
scsi 10:0:0:0: Direct-Access     WD       My Book          1016 PQ: 0 ANSI: 4
sd 10:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
sd 10:0:0:0: [sdb] Write Protect is off
sd 10:0:0:0: [sdb] Mode Sense: 10 00 00 00
sd 10:0:0:0: [sdb] Cache data unavailable
sd 10:0:0:0: [sdb] Assuming drive cache: write through
sd 10:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
sd 10:0:0:0: [sdb] Write Protect is off
sd 10:0:0:0: [sdb] Mode Sense: 10 00 00 00
sd 10:0:0:0: [sdb] Cache data unavailable
sd 10:0:0:0: [sdb] Assuming drive cache: write through
 sdb: sdb1
sd 10:0:0:0: [sdb] Attached SCSI disk
sd 10:0:0:0: Attached scsi generic sg3 type 0

Comment 50 Stefan Richter 2008-01-18 19:01:21 UTC

Argh...  sbp2 has a 20s timeout for login.  Why didn't it occur to us earlier
that there is this important difference between sbp2 and fw-sbp2?

Maybe we should use different timeouts in fw-sbp2.c::sbp2_send_management_orb()
depending on the kind of ORB.  (Command ORB vs. management ORBs.)

Now I'm off to try a bigger SBP2_ORB_TIMEOUT with the only device of mine to
which fw-sbp2 consistently fails to log in (SmartDisk VST portable CD-R/W) thusly:
Jan 17 22:29:10 stein scsi48 : SBP-2 IEEE-1394
Jan 17 22:29:12 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:14 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:16 stein firewire_sbp2: status write for unknown orb
Jan 17 22:29:16 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:18 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:21 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:23 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:23 stein firewire_sbp2: failed to login to fw3.0 LUN 0000

However, "firewire_core: giving up on config rom" is another story.

Comment 51 Jarod Wilson 2008-01-18 19:08:49 UTC

In reading over the sbp2 specs, it looks like the management ORB timeout is
actually something each device's config rom is supposed to provide for itself.

From Section 7.4.8:

The mgt_ORB_timeout field shall specify, in units of 500 milliseconds, the
maximum time an initiator shall allow for a target to store a status block in
response to a management ORB. The time-out commences when the initiator receives
either ack_complete or resp_complete from the target in response to the block
write of the management ORB address to the MANAGEMENT_AGENT register.

Comment 52 Jarod Wilson 2008-01-18 19:21:02 UTC

Doesn't actually seem to be clear whether or not a device MUST provide a
management ORB timeout value or not, so perhaps the idea of separate command orb
and management orb timeouts is the way to go for now.

And yeah, 'giving up on config rom' is yet another separate issue we still need
to track down... :\

Comment 53 Stefan Richter 2008-01-18 19:22:24 UTC

> Now I'm off to try a bigger SBP2_ORB_TIMEOUT with the only device of
> mine to which fw-sbp2 consistently fails to log in (SmartDisk VST
> portable CD-R/W)

Yess!  According to a few quick tests, this thing wants 6...10 seconds grace
time to send login status when powering up (it's bus-powered) and about 1 second
when already powered up.

> Maybe we should use different timeouts in fw-sbp2.c::
> sbp2_send_management_orb() depending on the kind of ORB.  (Command
> ORB vs. management ORBs.)

Ahem, it's only for management ORBs.  However, we don't want such long timeouts
for logout ORBs.

I'm not sure about reconnect ORBs.  Devices should be able to respond to them
quicker than to login ORBs because they are already powered up.

Comment 54 Stefan Richter 2008-01-18 19:36:36 UTC

> From Section 7.4.8:
> 
> The mgt_ORB_timeout field shall specify, in units of 500 milliseconds,
> the maximum time an initiator shall allow for a target to store a
> status block in response to a management ORB.

Right.  And in November 2006 I noted in sbp2.c "TODO:... handle
Unit_Characteristics.mgt_ORB_timeout and .ORB_size" and then forgot about it. 
(Actually we probably don't need to handle .ORB_size.)

We should indeed use mgt_ORB_timeout, but limit the value we get from there to
some reasonable limits, i.e. a sane lower limit and a sane upper limit.

Comment 55 Jarod Wilson 2008-01-18 19:48:19 UTC

Sounds like a plan to me. Do you want to work on it, or shall I try to whip
something up?

Comment 56 Stefan Richter 2008-01-18 20:05:16 UTC

I'm not gonna put something together today anymore.  Maybe tomorrow unless you
do.  :-)

Comment 57 Kristian Høgsberg 2008-01-18 20:06:09 UTC

I don't think it's worth it to try to do something clever here.  A casual look
through some of the devices I have here show that none of them have the unit
characteristics entry.  And some devices may specific a lower timeout than they
actually need etc.  Just going for 20s for at login time like the old driver did
seems like the simplest and most robust approach.

Comment 58 Jarod Wilson 2008-01-18 20:32:45 UTC

I'll just ship a tiny diff that increases the timeout and adds a note about why
we're not currently bothering with the config rom provided value.

Those seeking instant gratification in testing can grab a firewire driver
tarball and run the little new-mods.sh script inside it, which will compile the
driver against your running kernel (requires matching kernel-devel installed),
unload the old firewire-* drivers and load up the ones just built (they just get
loaded from where they're built, not installed).

http://people.redhat.com/jwilson/misc/firewire-driver-increased-timeout.tar.bz2

Comment 59 Stefan Richter 2008-01-23 15:40:54 UTC

Re comment #38:
> Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom
> for node id ffc0
...
> Should I split this problem into a different bug, since the sbp2 problem
> named in the subject of the bug is fixed?

Bug 429598 covers this.

Comment 60 Jarod Wilson 2008-01-23 22:11:44 UTC

This is fixed in rawhide and the latest F8 koji kernels, and the patch has been
accepted upstream. Will hold off on closing until we push a kernel package for
F8 that carries the fix.