242504 – fw_sbp2 status write for unknown orb, sbp2_scsi_abort

Bug 242504 - fw_sbp2 status write for unknown orb, sbp2_scsi_abort

Summary: fw_sbp2 status write for unknown orb, sbp2_scsi_abort

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	7
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Kristian Høgsberg
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-06-04 15:42 UTC by David Daeschler
Modified:	2007-11-30 22:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:	kernel-2.6.23-0.164.rc5.fc8.x86_64
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-09-13 20:29:43 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description David Daeschler 2007-06-04 15:42:15 UTC

Description of problem:

When utilizing the external hard drive for a c++ compilation (make -j2), the
compilation stops and all IO to the FW disk ceases to function for around 30
seconds.  The kernel then throws up the message "status write for unknown orb"
and sbp2_scsi_abort.  At this point the IO can continue.

Version-Release number of selected component (if applicable):


How reproducible:

Always

Steps to Reproduce:
1. Plug in drive
2. Use drive
  
Actual results:

Kernel prints messages:

fw_sbp2: status write for unknown orb
fw_sbp2: sbp2_scsi_abort

..and drive is unusable for about 30 seconds, or until I stop the IO (by
pressing control C during the make) and even then I have to wait a while before
I can read or write from the drive.

Expected results:

Should just work.

Additional info:

System is SMP, dual opteron:  AMD Opteron(tm) Processor 246

lspci of the systems firewire devices

01:08.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000
Controller (PHY/Link)

the disk is a Lacie 80GB RUGGED HD U2&FW&FW8 5400RPM 8MB


dmesgs relating to fw_sbp2 leading up to an event

fw_core: created new fw device fw1 (1 config rom retries)
fw_sbp2: logged in to sbp2 unit fw1.0 (0 retries)
fw_sbp2:  - management_agent_address:    0xfffff0030000
fw_sbp2:  - command_block_agent_address: 0xfffff0100000
fw_sbp2:  - status write address:        0x000100000000
scsi7 : SBP-2 IEEE-1394
scsi 7:0:0:0: Direct-Access-RBC WDC WD80 0VE-08HDT0       10.0 PQ: 0 ANSI: 4
SCSI device sdc: 156301488 512-byte hdwr sectors (80026 MB)
sdc: Write Protect is off
sdc: Mode Sense: 11 00 00 00
SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
SCSI device sdc: 156301488 512-byte hdwr sectors (80026 MB)
sdc: Write Protect is off
sdc: Mode Sense: 11 00 00 00
SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO
or FUA
 sdc: sdc1
sd 7:0:0:0: Attached scsi disk sdc
sd 7:0:0:0: Attached scsi generic sg3 type 14
fw_core: phy config: card 0, new root=ffc0, gap_count=63
fw_sbp2: management write failed, rcode 0x12
fw_sbp2: reconnected to unit fw1.0 (1 retries)
fw_sbp2: status write for unknown orb
fw_sbp2: sbp2_scsi_abort
fw_sbp2: status write for unknown orb
fw_sbp2: sbp2_scsi_abort
fw_sbp2: status write for unknown orb
fw_sbp2: sbp2_scsi_abort
fw_sbp2: status write for unknown orb
fw_sbp2: sbp2_scsi_abort
fw_sbp2: status write for unknown orb
fw_sbp2: sbp2_scsi_abort

Comment 1 Stefan Richter 2007-06-10 12:27:05 UTC

To my observation, fw_sbp2 (and lower layers) generally reacts even less
gracefully on occasional bus resets than the old sbp2 (and lower layers).

Bugs in bus generation tracking?  Or/and with filtered physical DMA?  To make
matters more interesting, this may come in unfortunate combination with firmware
flaws.

Comment 2 David Daeschler 2007-06-12 13:15:02 UTC

As more information, I think that this is a bug somewhere in the new fw* code
for Fedora 7 as opposed to a firmware problem.  I have used this disk on F7 USB,
FC5 FW, FC6 FW, Windows XP FW, and Max OS 10.3 & 10.4 FW.  I haven't had a
problem with the disk except for F7 on FireWire.

Comment 3 Stefan Richter 2007-06-12 14:50:52 UTC

If you have only this disk on the bus (like your dmesg output indicates), then
there shouldn't be the later bus reset event which is indicated by the lines
    fw_core: phy config: card 0, new root=ffc0, gap_count=63
    fw_sbp2: management write failed, rcode 0x12
    fw_sbp2: reconnected to unit fw1.0 (1 retries)
It may even be a small series of bus reset events in fast succesion.  Be it one
event or several, they shouldn't be there according to the protocols involved.
This means:  The bus is probably electrically unstable.  Nonetheless, the
drivers are supposed to survive occasional bus reset events, so there is clearly
a flaw in the drivers.

I have seen such problems with fw-sbp2 myself, notably with a 2.5" disk
connected to a front panel connector which is internally connected to a
motherboard header via a "jumper" cable.  Such hardware tends to be electrically
noisy, and bus resets may happen out of the blue.  Due to whatever bug in
fw-sbp2, the risk of command timeouts or more serious transport failures after
such events is higher as with the old driver.  (To make matters worse with that
2.5" disk of mine, it is HFS+ formatted and the hfsplus filesystem is quick to
oops after a transport error.)

Comment 4 Stefan Richter 2007-08-21 18:25:49 UTC

I found a way to occasionally reproduce the
    fw_sbp2: status write for unknown orb
    fw_sbp2: sbp2_scsi_abort
sequence (but without any 'reconnected...' in between) and will start debugging
this.  (Don't expect fast progress, I'm doing this in spare time.)

Comment 5 Stefan Richter 2007-08-27 15:40:49 UTC

Kristian, does the kernel build mentioned in
https://bugzilla.redhat.com/show_bug.cgi?id=242254#c20 also contain your 'status
write for unknown orb' fix?

Comment 6 Kristian Høgsberg 2007-08-27 15:51:18 UTC

Yes, I meant to update bugzilla when I did it, but it was down for maintenance
over the weekend.  In the meantime, davej rebase to a more recent git snapshot
that has the latest fixes from your tree, including the "unknown orb" fix.

David, if you can give the latest rawhide kernel or the kernel from this build:

  http://koji.fedoraproject.org/koji/taskinfo?taskID=131769

a try, that'd be great.

Thanks.

Comment 7 David Daeschler 2007-09-07 13:06:15 UTC

Kristian,

Is it too late for me to try this?  I'm getting married next week and have been
really busy at work, so this is the first chance I have to be able to build a
kernel.

Better yet, is this update in the kernel I just got from up2date?:

Linux 2.6.22.4-65.fc7 #1 SMP Tue Aug 21 21:50:50 EDT 2007 x86_64 x86_64 x86_64
GNU/Linux

If I can still test, where is the best place for me to get the kernel source?

Comment 8 Stefan Richter 2007-09-07 13:39:21 UTC

According to the link in comment #6, 2.6.23* builds from Fri Aug 24 2007 and
later contain the 'unknown orb' fix.  I'm not a Fedora user myself but I guess
the first thing to try would be the latest or close to latest from
http://koji.fedoraproject.org/koji/packageinfo?packageID=8 .  Also, the
changelog of the latest 2.6.22.* package there indicates that Kristian's fw-sbp2
patch is not in the 2.6.22.* Fedora kernels.

Comment 9 David Daeschler 2007-09-10 14:25:20 UTC

I tested this out using kernel-2.6.23-0.164.rc5.fc8.x86_64 from that page.

My problem appears to be fixed in that build.  I was able to copy my entire
source tree twice without the 'unknown orb' message or any hangs.

Sorry it took so long for me to test.
Thank you all for your hard work. 
- Dave

Comment 10 Christopher Brown 2007-09-13 20:29:43 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

This bug appears to be resolved and I am therefore closing it. If I have erred,
please forgive me and re-open with any additional information you are able to
give. I will then try and assist you if I can.

Cheers
Chris

Note You need to log in before you can comment on or make changes to this bug.