429951 – [firewire] unable to use disk (fw_sbp2: failed to login to ...)

Bug 429951 - [firewire] unable to use disk (fw_sbp2: failed to login to ...)

Summary: [firewire] unable to use disk (fw_sbp2: failed to login to ...)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jay Fenlason
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-01-23 22:15 UTC by Jarod Wilson
Modified:	2014-08-31 23:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:11:10 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
RHEL5 backport of sbp2 login failure fixes (8.78 KB, patch) 2008-01-25 21:00 UTC, Jarod Wilson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Jarod Wilson 2008-01-23 22:15:00 UTC

+++ This bug was initially created as a clone of Bug #238606 +++

I've got a Western Digital FireWire/USB 2.0 External Hard Drive - WDXC2500JBRNN
is the model number. None of my rawhide machines can seem to connect to it. It's
not using bus power.

Western Digital says that it uses either the Oxford 922 or Initio 1430 chipset;
I haven't opened it up to find out which. WD reference:
http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1469

Here's some lspci / kernel log info.

On my x86_64 machine:
05:02.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)

fw_core: phy config: card 0, new root=ffc1, gap_count=5
fw_core: giving up on config rom for node id ffc0

On my G5 Power Mac it gets a little further:
0001:03:0e.0 FireWire (IEEE 1394): Apple Computer Inc. K2 FireWire

I'll attach the log from that machine.

-- Additional comment from wwoods on 2007-05-01 15:15 EST --
Created an attachment (id=153885)
log from Power Mac G5


-- Additional comment from stefan-r-rhbz.de on 2007-05-06 11:37
EST --
Oxford 922 is for FireWire 800 + USB 2.0, Initio 1430 for FireWire 400.  I take
it from "gap_count=5" that it's FireWire 400 (1394a) with an additional USB 2.0
link and PHY.  Or it is an Initio 1530 = FireWire 400 + USB 2.0.

Would you have more luck with the old FireWire stack or with a different OS?

-- Additional comment from wwoods on 2007-05-06 17:18 EST --
Good to know! It's FireWire 400 + USB 2.0, so I guess that makes it an Initio
1530. The drive works fine with another OS (Mac OS X) and worked with the old
firewire stack (in FC6).

-- Additional comment from dpnaessens on 2007-05-17 07:56 EST --
I have a Lacie BigDisk on various IBM Thinkcentre's and have a similar problem
with the new firewire stack on FC7 for some time, output usually is:

fw_core: created new fw device fw1 (4 config rom retries)
fw_core: phy config: card 0, new root=ffc0, gap_count=63
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: failed to login to fw1.0

FC6 gave me no trouble whatsoever. 

Currently on FC7 updated today to kernel-2.6.21-1.3116.fc7 and results in above.
 
If it does connect on FC7 it can be fine for a while, but have had a few times
where the whole machine locks up whilst copying files to the external drive -
requiring turning the whole machine off to recover.

Am using the older stack when I have to use the external drive, but mentioning
my issues on here due to the locking up.



-- Additional comment from stefan-r-rhbz.de on 2007-05-17 13:17
EST --
Re comment #4: Are you using the old stack on the same kernel on which the new
stack fails?

-- Additional comment from wwoods on 2007-05-17 17:18 EST --
More info can be found in bug #231708

-- Additional comment from dpnaessens on 2007-05-18 10:53 EST --
Re comment #5: Yep. Was changing the config in /boot around each time. 

Just tried 3163 and the new stack works each time on rebooting, but if its
unmounted it wont remount:

May 18 15:48:19 bird kernel: fw_core: created new fw device fw1 (4 config rom
retries)
May 18 15:48:21 bird kernel: fw_sbp2: orb reply timed out, rcode=0x11
May 18 15:48:21 bird kernel: fw_core: phy config: card 0, new root=ffc0,
gap_count=63
May 18 15:48:23 bird kernel: fw_sbp2: orb reply timed out, rcode=0x11
May 18 15:48:34 bird last message repeated 5 times
May 18 15:48:34 bird kernel: fw_sbp2: failed to login to fw1.0

3149 worked as previously described with new stack but did with old stack.



-- Additional comment from stefan-r-rhbz.de on 2007-05-18 12:53
EST --
I may be wrong, but it appears as if the sequence of event is -
  - fw-core scanned device's ROM, hands it over to fw-sbp2
  - fw-sbp2 starts login attempt
  - bus reset occurs (caused by the device or by fw-core, I don't know)
  - fw-sbp2's repeated login attempts go to a stale host-side device
    representation or stale device-side context
The old stack definitely has problems if a bus reset occurs in the middle of
sbp2's login attempt.  /Maybe/ the new stack has similar problems.  At least its
timings are different from the old one's.  Paradoxically, the new stack should
be more successful in the device detection business AFAIU from looking at the
sources.

-- Additional comment from krh on 2007-05-22 22:02 EST --
I'm pretty sure I've tracked this down to a missing byteswap in the DMA program
setup code.  The .3190 kernel is building now with a patch to add the missing
byteswaps.  I've reproduced the problem on a mac ppc laptop and the fix makes it
work for me.  The build can be tracked here:

  http://koji.fedoraproject.org/koji/taskinfo?taskID=14801

and the RPM for ppc can be found by following the ppc descendent task link once
build finishes.  Please test, thanks :)

-- Additional comment from krh on 2007-05-23 01:43 EST --
RPMs are here:

  http://koji.fedoraproject.org/koji/buildinfo?buildID=7077

-- Additional comment from dpnaessens on 2007-05-25 12:01 EST --
Created an attachment (id=155461)
firewire on startup log


-- Additional comment from dpnaessens on 2007-05-25 12:02 EST --
Created an attachment (id=155462)
firewire hotplug log


-- Additional comment from stefan-r-rhbz.de on 2007-05-25 13:20
EST --
Darren, the "firewire on startup log" looks good --- /if/ you can read the
partions afterwards and do actual I/O.

The line "fw_core: phy config: card 0, new root=ffc0, gap_count=63" in "firewire
hotplug log" is what I meant by "bus reset" in comment #8.  Do you still have a
log around from a session where you were able to do some I/O before subsequent
failure?  If so, does it contain such a bus reset message (or several of them)
near the failure too?

-- Additional comment from ak on 2007-05-26 16:35 EST --
Just tried the .3194 with my drive and no joy (dmesg output):

<5>fw_ohci: Added fw-ohci device 0002:20:0e.0, OHCI version 1.10
<5>fw_core: created new fw device fw0 (0 config rom retries)
<5>fw_core: created new fw device fw1 (0 config rom retries)
<5>fw_core: phy config: card 0, new root=ffc1, gap_count=5
<7>PHY ID: 4061e4, addr: 0
<3>fw_sbp2: orb reply timed out, rcode=0x11
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: management write failed, rcode 0x10
<3>fw_sbp2: failed to login to fw1.0

More info about my drive and controller in bug #231708. Tell me if you need
further testing.

-- Additional comment from ak on 2007-05-26 17:47 EST --
Hmm, discard that. After a reboot and another attempt, everything worked!
Hardware quirk maybe?

-- Additional comment from stefan-r-rhbz.de on 2007-05-26 17:58
EST --
Many PL-3507 are sold with buggy firmware.  If you have a Windows PC around,
have a look at http://www.prolific.com.tw/eng/downloads.asp?ID=44 for current
firmware and an EEPROM reader/writer utility.  This utility shows you which ROM
version is installed.  (See readme file in the reader/writer zip archive.)

These PL-3507 are in my experience a gamble with the old FireWire stack too.

-- Additional comment from ak on 2007-05-26 18:08 EST --
When I got the PL3507 enclosure, it had some old firmware, and the experience
was very frustrating: the 
drive would randomly stop responding after an hour of work, sometimes more,
sometimes less. Things 
were made more complicated by actually installing Fedora Core 6 on it (rather
than just using it as a data 
partition) - I'm not sure how it managed to survive through the installation.
Things improved a lot after a 
firmware update - the only remaining trouble is that occasionally the drive
doesn't respond on boot (the 
kernel would load, but wouldn't initialize the drive) - seems like this problem
exists with both stacks. I'll 
check if I have the most latest firmware - it seems so, but double checking
won't hurt.

-- Additional comment from dpnaessens on 2007-05-29 03:23 EST --
After reboot (tried a few times), im not able to mount the drive etc last reboot:

On startup:
fw_ohci: Added fw-ohci device 0000:0a:09.4, OHCI version 1.10
fw_core: created new fw device fw0 (0 config rom retries)
fw_core: created new fw device fw1 (0 config rom retries)
fw_core: phy config: card 0, new root=ffc0, gap_count=63 <--bus reset again
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: orb reply timed out, rcode=0x11
fw_sbp2: failed to login to fw1.0

Regarding the failure - that machine completely packed up - unable to login or
anything after an update so started over (sslkeylib or something like that
wouldnt work). I should note that if I was only reading from the drive, it was
fine, writing to the drive caused the lockup.

Out of curiosity, as I only have the one firewire device connected, why are 2
devices created? fw0 and fw1? Is one the firewire card itself?


-- Additional comment from ak on 2007-06-01 17:05 EST --
I've updated to F7 and it has rendered my system unbootable :( I'm booting it
from an external firewire 
drive, and what seems to be happening is that LVM scan happens before fw-spb2
finishes initializing and 
therefore no volumes are found. If I boot from a rescue CD, the drive gets
recognized and mounted fine, 
because the procedure is different there. Is there a way to serialize module
loading or otherwise force LVM 
to wait a little bit? Or could it be a totally different problem?

-- Additional comment from ak on 2007-06-01 18:06 EST --
Created an attachment (id=155949)
fix mkinitrd to force waiting for fw-sbp2 initialization (same way as old sbp2
driver)

This patch adds the waiting for fw-sbp2 initialization, just like the older
sbp2 driver. This is necessary when e.g. booting from a firewire drive, so that
DM doesn't get loaded too soon.

-- Additional comment from ak on 2007-06-01 18:10 EST --
Ok, I've figured out what's going on: please check and apply the patch to
mkinitrd. However, not all is well: 
the drive still fails to initialize on boot (with the "orb reply timed out"
error like in comment #14). I can 
check and write down the exact message sequence if you want. The strange thing
is that if I boot from a 
rescue cd, then everything works! Something with timings maybe?

-- Additional comment from ak on 2007-06-02 11:51 EST --
Created an attachment (id=155985)
fix mkinitrd to force waiting for fw-ohci and fw-sbp2 initialization

This patch adds the waiting for fw-ohci and fw-sbp2 initialization. This is
necessary when e.g. booting from a firewire drive, so that DM/LVM doesn't get
loaded too soon and fw-sbp2 doesn't try to initialize before fw-ohci finishes
loading.

-- Additional comment from ak on 2007-06-02 11:57 EST --
I've added a similar wait for fw-ohci to finish initializing before loading
fw-sbp2 and it worked! So apparently fw-sbp2 driver tries to do things before
fw-ohci sets everything up. Shouldn't fw-sbp2 do some internal checks, like wait
on a mutex or something? I've attached a patch to mkinitrd that includes both
waits, but you should see if fw-sbp2 needs to be fixed too, because there might
be other situations where rapid "insmod fw-ohci; insmod fw-sbp2" can happen -
e.g. anaconda or someone's script.

-- Additional comment from stefan-r-rhbz.de on 2007-06-02 12:40
EST --
> Shouldn't fw-sbp2 do some internal checks, like wait on a mutex
> or something?

There is nothing it could sensibly wait for.  How should fw-sbp2 know beforehand
how many FireWire buses are present and how many SBP-2 capable nodes will sit on
these buses?

The maximum that could be done is to add parameters to fw-sbp2 (a) for how long
it should wait for SBP-2 nodes to become available and perhaps (b) after how
many nodes to quit waiting, and then fw-sbp2 could be integrated with the Linux
SCSI layer's asynchronous scanning facility.  Then, the initrd should
  - load fw-ohci
  - load fw-sbp2 if there is no hotplug facility in the initrd,
  - have those module parameters for fw-sbp2 configured (timeout, optionally
    number of SBP-2 targets to wait for)
  - load the scsi_wait_scan module.
When "modprobe scsi_wait_scan" exits, then all SCSI drivers which are integrated
with this generic asynchronous SCSI scanning facility, will have finished (or
given up on) discovering all known SCSI targets.

Lack of integration of sbp2 (or now rather fw-sbp2) with the asynchronous SCSI
scanning facility is tracked as http://bugzilla.kernel.org/show_bug.cgi?id=3225 .

While I am at it, let me explain the roles of the firewire drivers:

fw-ohci implements interfaces to FireWire buses (for interfaces of the OHCI type).

fw-core, among else, looks what kinds of nodes are visible on present buses and
notifies protocol drivers (in kernel or in userspace) of detected nodes.

fw-sbp2 handles whatever SBP-2 nodes are given to it.

fw-ohci knows nothing of fw-sbp2, and vice versa.

There could be other controller drivers besides fw-ohci beneath fw-core, and
there could and probably will be other protocol drivers besides fw-sbp2 above
fw-core.  The old IEEE 1394 stack actually has two mainline controller drivers
and two (incomplete) out-of-tree controller drivers beneath ieee1394; and it has
several protocol drivers above ieee1394 in mainline and out-of-tree.

-- Additional comment from krh on 2007-06-02 13:14 EST --
(In reply to comment #24)
> > Shouldn't fw-sbp2 do some internal checks, like wait on a mutex
> > or something?
> 
> There is nothing it could sensibly wait for.  How should fw-sbp2 know beforehand
> how many FireWire buses are present and how many SBP-2 capable nodes will sit on
> these buses?

That said, it does sound like there's a timing problem here.  It could be that
the disk in question requires several config rom read retries and that there is
some race between those retries and loading fw-sbp2.

-- Additional comment from ak on 2007-06-02 13:22 EST --
The log says
Jun  2 19:52:44 localhost kernel: fw_core: created new fw device fw0 (0 config
rom retries)
Jun  2 19:52:45 localhost kernel: fw_core: created new fw device fw1 (0 config
rom retries)

so it's probably some other kind of race?

-- Additional comment from stefan-r-rhbz.de on 2007-06-02 15:13
EST --
The "modprobe fw-sbp2" process will execute the SBP-2 login, SCSI inquiry, and
SCSI device creation *only* if fw-core already finished probing of the remote
node before fw-sbp2 is inserted into the kernel.

Otherwise, "modprobe fw-sbp2" will only do some minimal driver initialization
and will exit quickly, not knowing that fw-core is just (or will soon be)
discovering a remote node.

-- Additional comment from ak on 2007-07-23 12:57 EST --
Things have gotten worse in 2.6.22.1. For some reason, firewire-sbp2 gets loaded
before firewire-ohci, so there's no way to make a pause between their
initialization. Now I almost always get "orb reply timed out" error. Very
occasionally the initialization goes through fine, and twice I also got a kernel
crash. I'll attach the screen images so you can see how they look like.

-- Additional comment from ak on 2007-07-23 13:23 EST --
Created an attachment (id=159795)
kernel crash during firewire-sbp2 init


-- Additional comment from ak on 2007-07-23 13:24 EST --
Created an attachment (id=159796)
firewire errors during sbp2 init


-- Additional comment from ak on 2007-09-01 06:39 EST --
(From update of attachment 155985 [details])
Looks like the problem went away with the latest rawhide kernel. So this bug
can be closed.


-- Additional comment from stefan-r-rhbz.de on 2007-09-01 07:36
EST --
Is it fixed for all reporters?  I guess not.

-- Additional comment from wwoods on 2007-09-04 19:05 EST --
Works fine for me with kernel-2.6.23-0.153.rc4.fc8, the drive from the original
report, and a Mac Mini; I'll test on the machines from the original report as
soon as I can.

-- Additional comment from wwoods on 2007-09-05 14:41 EST --
Nope, still doesn't work on the original machine (dell precision 380, x86_64).

-- Additional comment from dpnaessens on 2007-10-03 10:06 EST --
Haven't been able to test for a while, but mine works with
2.6.23-0.214.rc8.git2.fc8. Comes up when plugged in (with no errors in log) and
have successfully transferred over a 100gb to it with no problem.

-- Additional comment from wwoods on 2007-10-03 17:35 EST --
Situation is unchanged here with kernel .214.

-- Additional comment from dpnaessens on 2007-10-17 16:16 EST --
I have started having problems again, the whole machine sometimes locks up
shortly after mounting the drive. Nautilus pops up and the disk starts to be
read during which its locks up. 
Upon reboot the last line in the log before crash always says mounted drive on
behalf of user, and thats it which to me is normal. No errors at all.

It seems random in occurance, it has happened 3 times in the last few days. If
Nautilus displays lists the root of the drive it is fine thereafter.

Via the command line, can 'ls' no problem until lock up (if there is a lock up).

To check the drive tried mounting it on OSX and it comes up no problem.

-- Additional comment from wwoods on 2007-10-18 12:27 EST --
Re comment #37: Darren, this bug is not about system lockups with firewire, but
failure to mount firewire drives. Unless you see the "Cannot connect to external
firewire drive (fw_sbp2: failed to login to fw1.0)" message or something similar
in your system logs, you're talking about a different problem. If you're still
getting lockups, please file a separate bug to avoid further confusion.

As far as this bug goes: the "failed to login" problem is gone, and firewire
drives work fine on all my test machines, except one.

The x86_64 machine with an Agere Systems FW323 controller still fails to mount
any firewire drive, with kernel messages as in the original bug report:

[plug drive in]
Oct 18 12:21:18 brinstar kernel: firewire_core: phy config: card 0, new
root=ffc1, gap_count=5
[wait 30 sec]
Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom for node
id ffc0
[remove drive]
Oct 18 12:22:09 brinstar kernel: firewire_core: phy config: card 0, new
root=ffc0, gap_count=63

Should I split this problem into a different bug, since the sbp2 problem named
in the subject of the bug is fixed?

-- Additional comment from dpnaessens on 2007-10-18 16:53 EST --
Will, Slowly getting the hang of this bugzilla malarkey. Yep you are quite right
I am talking about a different problem there and will file accordingly. Won't
happen again. 

So for me the login problem is gone, when first plugging the drive in.

So I don't waste anyone's further time, would the fact that If I successfully
get the drive mounted without problems, do I/O, unmount it, unplug it and later
in that same session, plug it back in and then always get the failed to login
message fall under this bug or is that another separate issue? 
 

-- Additional comment from stefan-r-rhbz.de on 2007-10-18 17:24
EST --
Splitting and shifting the bug entries back and forth is all nice and well.  But
what is needed to get these bugs actually fixed is either detailed debug info¹
or a driver hacker with hardware that lets him reproduce the problem² (so that
he can produce debug info by himself).

¹) Will, if I find time at the weekend to spare, I'll post a patch to get more
info during the config rom probing.

²) My secondary PC is an i945GM based x64_64 with Agere FW323, and it works
fine.  I don't know though if or when I last tested all of my SBP-2 devices with it.

-- Additional comment from stefan-r-rhbz.de on 2007-10-18 17:33
EST --
Re comment #39:
> [...] I successfully get the drive mounted without problems,
> do I/O, unmount it, unplug it and later in that same session,
> plug it back in and then always get the failed to login message

What is logged in dmesg during the unsuccessful replugging?

-- Additional comment from wwoods on 2007-10-22 17:41 EST --
This problem is confined to just one of all the machines I've tried. I think the
number of systems affected is pretty low. Moving to F8Target.

If you've got any suggestions for things to try with this machine (Dell
Precision 380), let me know. Until then I'll wait for the patch for extra debug
info during config rom probing.

-- Additional comment from stefan-r-rhbz.de on 2007-11-07 16:54
EST --
> Until then I'll wait for the patch for extra debug info during config
> rom probing.

Someone working on an out-of-tree driver posted a draft patch for this but
hasn't resubmitted it yet.  And I have been distracted by other stuff recently
and am unable to do this during the next ~14 days due to traveling.  I hope to
get back to this thereafter.

-- Additional comment from stefan-r-rhbz.de on 2008-01-03 18:20
EST --
Re comment #43:
I haven't forgotten about it, but there are always so many other more
fascinating things in life...  Will try to get to it RSN.

-- Additional comment from stefan-r-rhbz.de on 2008-01-17 16:48
EST --
Re comment #38:
> Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom
> for node id ffc0

There is also an upstream bug entry now which is probably the same issue:
http://bugzilla.kernel.org/show_bug.cgi?id=9760
And there are a few other mentions of the 'giving up on config rom' failure mode
on mailinglists or via personal mail.

-- Additional comment from jwilson on 2008-01-17 17:16 EST --
So... I was getting very similar login failures with orb reply timeout messages
with a certain drive, and was going to do some packet captures with nosy and a
pcilynx card... Lo and behold, the drive is accessed and mounted just fine with
the pcilynx card essentially serving as a hub between the drive and controller.
I've verified that the drive also works with a plain firewire hub in between the
drive and controller. I've got a few other drives to check out to see if they
behave similarly, but its definitely an interesting data point... Still getting
up to speed on specs and code, no idea what the root cause might be yet.

-- Additional comment from stefan-r-rhbz.de on 2008-01-17 17:27
EST --
What if you replace the '5' by '7' in fw-card.c::gap_count_table?

-- Additional comment from jwilson on 2008-01-18 12:08 EST --
(In reply to comment #47)
> What if you replace the '5' by '7' in fw-card.c::gap_count_table?

No change, still fails in the same way.

-- Additional comment from jwilson on 2008-01-18 13:37 EST --
Got it. Disk login is successful if I simply raise SBP2_ORB_TIMEOUT in
fw-sbp2.c. First test was to go from 2000 to 5000, and bingo. Not sure how much
increase is really needed yet.

firewire_core: created new fw device fw5 (0 config rom retries, S400)
firewire_core: phy config: card 2, new root=ffc1, gap_count=7
firewire_sbp2: orb reply timed out, rcode=0x11
firewire_sbp2: logged in to fw5.1 LUN 0001 (0 retries)
scsi 11:0:0:1: Enclosure         WD       My Book Device        PQ: 0 ANSI: 4
scsi 11:0:0:1: Attached scsi generic sg2 type 13
firewire_sbp2: logged in to fw5.0 LUN 0000 (1 retries)
scsi 10:0:0:0: Direct-Access     WD       My Book          1016 PQ: 0 ANSI: 4
sd 10:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
sd 10:0:0:0: [sdb] Write Protect is off
sd 10:0:0:0: [sdb] Mode Sense: 10 00 00 00
sd 10:0:0:0: [sdb] Cache data unavailable
sd 10:0:0:0: [sdb] Assuming drive cache: write through
sd 10:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
sd 10:0:0:0: [sdb] Write Protect is off
sd 10:0:0:0: [sdb] Mode Sense: 10 00 00 00
sd 10:0:0:0: [sdb] Cache data unavailable
sd 10:0:0:0: [sdb] Assuming drive cache: write through
 sdb: sdb1
sd 10:0:0:0: [sdb] Attached SCSI disk
sd 10:0:0:0: Attached scsi generic sg3 type 0

-- Additional comment from stefan-r-rhbz.de on 2008-01-18 14:01
EST --
Argh...  sbp2 has a 20s timeout for login.  Why didn't it occur to us earlier
that there is this important difference between sbp2 and fw-sbp2?

Maybe we should use different timeouts in fw-sbp2.c::sbp2_send_management_orb()
depending on the kind of ORB.  (Command ORB vs. management ORBs.)

Now I'm off to try a bigger SBP2_ORB_TIMEOUT with the only device of mine to
which fw-sbp2 consistently fails to log in (SmartDisk VST portable CD-R/W) thusly:
Jan 17 22:29:10 stein scsi48 : SBP-2 IEEE-1394
Jan 17 22:29:12 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:14 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:16 stein firewire_sbp2: status write for unknown orb
Jan 17 22:29:16 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:18 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:21 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:23 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 17 22:29:23 stein firewire_sbp2: failed to login to fw3.0 LUN 0000

However, "firewire_core: giving up on config rom" is another story.

-- Additional comment from jwilson on 2008-01-18 14:08 EST --
In reading over the sbp2 specs, it looks like the management ORB timeout is
actually something each device's config rom is supposed to provide for itself.

From Section 7.4.8:

The mgt_ORB_timeout field shall specify, in units of 500 milliseconds, the
maximum time an initiator shall allow for a target to store a status block in
response to a management ORB. The time-out commences when the initiator receives
either ack_complete or resp_complete from the target in response to the block
write of the management ORB address to the MANAGEMENT_AGENT register.


-- Additional comment from jwilson on 2008-01-18 14:21 EST --
Doesn't actually seem to be clear whether or not a device MUST provide a
management ORB timeout value or not, so perhaps the idea of separate command orb
and management orb timeouts is the way to go for now.

And yeah, 'giving up on config rom' is yet another separate issue we still need
to track down... :\

-- Additional comment from stefan-r-rhbz.de on 2008-01-18 14:22
EST --
> Now I'm off to try a bigger SBP2_ORB_TIMEOUT with the only device of
> mine to which fw-sbp2 consistently fails to log in (SmartDisk VST
> portable CD-R/W)

Yess!  According to a few quick tests, this thing wants 6...10 seconds grace
time to send login status when powering up (it's bus-powered) and about 1 second
when already powered up.

> Maybe we should use different timeouts in fw-sbp2.c::
> sbp2_send_management_orb() depending on the kind of ORB.  (Command
> ORB vs. management ORBs.)

Ahem, it's only for management ORBs.  However, we don't want such long timeouts
for logout ORBs.

I'm not sure about reconnect ORBs.  Devices should be able to respond to them
quicker than to login ORBs because they are already powered up.

-- Additional comment from stefan-r-rhbz.de on 2008-01-18 14:36
EST --
> From Section 7.4.8:
> 
> The mgt_ORB_timeout field shall specify, in units of 500 milliseconds,
> the maximum time an initiator shall allow for a target to store a
> status block in response to a management ORB.

Right.  And in November 2006 I noted in sbp2.c "TODO:... handle
Unit_Characteristics.mgt_ORB_timeout and .ORB_size" and then forgot about it. 
(Actually we probably don't need to handle .ORB_size.)

We should indeed use mgt_ORB_timeout, but limit the value we get from there to
some reasonable limits, i.e. a sane lower limit and a sane upper limit.

-- Additional comment from jwilson on 2008-01-18 14:48 EST --
Sounds like a plan to me. Do you want to work on it, or shall I try to whip
something up?

-- Additional comment from stefan-r-rhbz.de on 2008-01-18 15:05
EST --
I'm not gonna put something together today anymore.  Maybe tomorrow unless you
do.  :-)


-- Additional comment from krh on 2008-01-18 15:06 EST --
I don't think it's worth it to try to do something clever here.  A casual look
through some of the devices I have here show that none of them have the unit
characteristics entry.  And some devices may specific a lower timeout than they
actually need etc.  Just going for 20s for at login time like the old driver did
seems like the simplest and most robust approach.

-- Additional comment from jwilson on 2008-01-18 15:32 EST --
I'll just ship a tiny diff that increases the timeout and adds a note about why
we're not currently bothering with the config rom provided value.

Those seeking instant gratification in testing can grab a firewire driver
tarball and run the little new-mods.sh script inside it, which will compile the
driver against your running kernel (requires matching kernel-devel installed),
unload the old firewire-* drivers and load up the ones just built (they just get
loaded from where they're built, not installed).

http://people.redhat.com/jwilson/misc/firewire-driver-increased-timeout.tar.bz2

-- Additional comment from stefan-r-rhbz.de on 2008-01-23 10:40
EST --
Re comment #38:
> Oct 18 12:21:49 brinstar kernel: firewire_core: giving up on config rom
> for node id ffc0
...
> Should I split this problem into a different bug, since the sbp2 problem
> named in the subject of the bug is fixed?

Bug 429598 covers this.

-- Additional comment from jwilson on 2008-01-23 17:11 EST --
This is fixed in rawhide and the latest F8 koji kernels, and the patch has been
accepted upstream. Will hold off on closing until we push a kernel package for
F8 that carries the fix.

Comment 1 Jarod Wilson 2008-01-23 22:16:10 UTC

Would be a very good thing to get into RHEL5.2, enables a LOT more firewire
storage devices to actually work with our firewire stack.

Comment 2 Jarod Wilson 2008-01-25 21:00:42 UTC

Created attachment 292996 [details]
RHEL5 backport of sbp2 login failure fixes

Initial RHEL5 backport of fixes for sbp2 login failure problems, based on
upstream changes + one pending change to use login timeout information provided
by the device's config rom.

http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=f549da95402dde45b7184934df7ad54823cbc100


http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=78ab0f0e74ade225911fe51e25038e95483820c8


http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=3b490c7ffca9fe30f734fb0f2ca4df47a96da7e4

Comment 3 Jay Fenlason 2008-08-14 17:12:23 UTC

Can you please see if the patch I just posted for 5.3 fixes this?

Comment 4 Jarod Wilson 2008-08-18 15:27:46 UTC

Yep, looks like these are covered by that patch. Brief testing results look good too.

Comment 6 Don Zickus 2008-09-10 20:13:07 UTC

in kernel-2.6.18-110.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 9 errata-xmlrpc 2009-01-20 20:11:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.