434830 – [firewire] disk can't be used due to buffer I/O errors

Bug 434830 - [firewire] disk can't be used due to buffer I/O errors

Summary: [firewire] disk can't be used due to buffer I/O errors

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jarod Wilson
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-02-25 18:31 UTC by Jarod Wilson
Modified:	2008-03-26 17:15 UTC (History)
CC List:	4 users (show)
Fixed In Version:	2.6.24.3-50.fc8
Clone Of:
Environment:
Last Closed:	2008-03-23 03:18:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
message log from mounting firewire drive (2.94 KB, text/plain) 2008-03-11 05:52 UTC, Ed Lally	no flags	Details
message log showing errors (5.50 KB, text/plain) 2008-03-11 05:53 UTC, Ed Lally	no flags	Details
config_rom output for LaCie 120GB Porsche external drive (88 bytes, application/octet-stream) 2008-03-12 05:02 UTC, Ed Lally	no flags	Details
config_rom output for LaCie 120GB Porsche external drive (204 bytes, application/octet-stream) 2008-03-12 05:52 UTC, Ed Lally	no flags	Details
Gscanbus output (506 bytes, text/plain) 2008-03-12 06:00 UTC, Ed Lally	no flags	Details
Show Obsolete (1) View All

Description Jarod Wilson 2008-02-25 18:31:45 UTC

+++ This bug was initially created as a clone of Bug #271801 +++

[...trimmed to relevant bits...]

-- Additional comment from elally on 2008-01-31 21:22 EST --
I've updated to a newer kernel: Linux strauss 2.6.23.14-107.fc8 #1 SMP Mon Jan
14 22:07:11 EST 2008 x86_64 x86_64 x86_64 GNU/Linux

Three of the errors in comment #13 have been resolved -- "firewire-sbp2 blocking
keyboard input when trying to add an SBP-2 device", "status write for unknown
orb", and "scsi scan: 96 byte inquiry failed".

Per Jarod's suggestion, I tried the koji F8 kernel, but realized that I still
got the problem resolved with the earlier kernel from fedora-updates and went back.

The only issue remaining is the first one -- "firewire-core unable to access the
cards if the modules are loaded early in the boot sequence", which impacts an
external hard drive and external CD-RW drive.

I put the command "modprobe -r firewire-ohci && modprobe firewire-ohci" in
/etc/rc.local to no effect.  However, if I run the same command after logging in
to GNOME, it works just fine -- the drives are recognized and mounted under
/media.  

I am attaching outputs from lsmod before and after reloading the firewire
modules.  I am also attaching dmesg output from startup through accessing the
drives.  Also, if it helps, my smolt page is at
http://www.smolts.org/show?UUID=c413b36f-7ba0-405c-ad84-98d4ae3bfb52

Please let me know if there's anything else I can try.

Thanks!

- Ed


-- Additional comment from elally on 2008-01-31 21:23 EST --
Created an attachment (id=293681)
newer dmesg output from removing/reloading firewire kernel modules


-- Additional comment from elally on 2008-01-31 21:24 EST --
Created an attachment (id=293682)
lsmod showing loaded modules before/after reloading firewire_ohci


-- Additional comment from elally on 2008-02-01 07:49 EST --
Take back my earlier report...  I did some load testing by rsync'ing a directory
from another computer and ran into a bunch of buffer IO errors within a few
seconds.  I've attached dmesg output.  I'll try moving back up to the latest
koji kernel to see if that fixes the problem.

-- Additional comment from elally on 2008-02-01 07:50 EST --
Created an attachment (id=293720)
Buffer IO errors under load


-- Additional comment from elally on 2008-02-02 19:38 EST --
I'm having problems even with koji kernel "Linux strauss 2.6.23.14-123.fc8 #1
SMP Fri Jan 25 19:54:41 EST 2008 x86_64 x86_64 x86_64 GNU/Linux".  

I'm testing the drive by rsyncing a directory from "bach" to the server
"strauss" (the one that has the firewire drive) over the LAN.  The rsync moves
along just fine for a while, but then pauses for about 30 seconds with no
apparent LAN or disk activity.  I get I/O errors followed by the message
"kernel: bad page state in process 'swapper'" appearing on the console. 
Sometime later, the computer with the drive will invariably crash (screen,
keyboard, and network all go dead) and require a reboot.

Also, the drive is still not recognized at boot -- I have to execute "modprobe
-r firewire-ohci && modprobe firewire-ohci" to have them detected.

Dmesg output is attached.

-- Additional comment from elally on 2008-02-02 19:40 EST --
Created an attachment (id=293810)
dmesg output with koji kernel


-- Additional comment from jwilson on 2008-02-04 10:01 EST --
Hi Ed,

From your dmesg output, it looks like the latest rawhide/devel kernel might get
your disks working on boot, as you're hitting the 'giving up on config rom'
problem, detailed in bug 429598. Please give that a spin and report back, and/or
wait until I get the backports to the F8 kernel done...

-- Additional comment from stefan-r-rhbz.de on 2008-02-04 12:28
EST --
Re attachment 293810 [details]:
> Feb  2 19:25:20 strauss kernel: sd 15:0:0:0: [sde] Result:
> hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
> Feb  2 19:25:20 strauss kernel: end_request: I/O error, dev sde,
> sector 38971935
> Feb  2 19:25:20 strauss kernel: sd 15:0:0:0: rejecting I/O to offline device
> Feb  2 19:25:20 strauss kernel: sd 15:0:0:0: [sde] Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK

DID_BUS_BUSY typically happens when a bus reset occurs.
DID_NO_CONNECT happens when the device was unplugged.

Well, you apparently did not unplug it, but there might have been noise on the
bus which inspired the controller to send a "self ID complete" event to the
drivers, without self ID of the disk --- or with firewire-core misinterpreting
the self ID buffer.

I saw something similar infrequently happen on my test setup:  When I plugged
something in to a bus with already a few nodes present, firewire-core
misinterpreted this as an existing device going away, rather than a new one
joining the bunch.

-- Additional comment from elally on 2008-02-07 23:23 EST --
Hi folks,

I loaded up the latest rawhide kernel and the drives were detected on boot --
woohoo!

Unfortunately the other problems with I/O buffers, etc., are still there.

Regarding Stefan's suggestion, both the drives in question (external CD burner
and external HD) are on two separate firewire buses, and each is the only device
on its bus.  The HD is attached to the motherboard's bus; the burner is attached
to a TI firewire 800 PCI card.

Please let me know if there's anything else I can try to work around or
troubleshoot this.

Cheers,

Ed


-- Additional comment from jwilson on 2008-02-08 13:59 EST --
Ed, exactly what kernel version was that with? I suspect some additional patches
we have queued up for rawhide, which haven't yet been in a build due to some
issues with gcc 4.3, might further help your situation.

-- Additional comment from stefan-r-rhbz.de on 2008-02-08 15:44
EST --
Patches "firewire: fw-sbp2: fix I/O errors during reconnect" and "firewire:
fw-sbp2: preemptively block sdev" may be beneficial to Ed's setup.

I suspect the ultimate problem is electrically unstable hardware here, but the
patches should make things smoother even for unreliable hardware.

The issue described in http://marc.info/?l=linux1394-devel&m=120237058319592
needs to be addressed eventually as well.  It is hopefully not of immediate
importance to Ed's setup though.


-- Additional comment from jwilson on 2008-02-25 13:23 EST --
So we have a few different bugs that have ended up in here... Here's what I'd
like to do:
[...]
2) Ed's additional issues listed in comment #13, all of which have been
resolved, save the I/O buffer problems. I'd like to open a new bug for this
issue, if its still a problem with the latest rawhide kernel.

Comment 1 Jarod Wilson 2008-02-25 18:32:56 UTC

Nb: these I/O buffer problems look identical to what I'm frequently seeing with
a drive in a case with a Prolific PL-3507 (rev c) bridge chip... Might be
interesting to know what the bridge chip is in Ed's case.

Comment 2 Jarod Wilson 2008-03-06 05:40:36 UTC

Okay, so I think we've figured out the root cause of my I/O problems on the
PL-3507, and I posted patches to fix 'em to the linux1394-devel list just a bit
ago[*]. I've added them to rawhide, so kernel-2.6.25-0.94.rc4.fc9 and later
should carry 'em. I'm guessing Ed's I/O issues will also be resolved...


[*]
http://sourceforge.net/mailarchive/forum.php?thread_name=200803060015.40357.jwilson%40redhat.com&forum_name=linux1394-devel

Comment 3 Jarod Wilson 2008-03-10 16:48:28 UTC

Also added to F8, in kernel 2.6.24.3-23.fc8, building in koji right now.

http://koji.fedoraproject.org/koji/taskinfo?taskID=508508

Ed, please give this f8 build (or a later one) or a rawhide kernel a try, I
think you should be all set.

Comment 4 Ed Lally 2008-03-11 04:40:41 UTC

Thanks -- will give it a try and report back.

Comment 5 Ed Lally 2008-03-11 05:51:36 UTC

No luck with koji kernel 2.6.24.3-23.fc8.  No step forward, and unfortunately it
actually seems to be a step backward.  The drive is no longer automatically
detected on startup.  Once I boot, I can execute "modprobe -r firewire-ohci &&
modprobe firewire-ohci" to force a scan for them.  The drive is then detected,
but it takes about 30 seconds to do so (see attached messages log snippet).

Once the drive was recognized, I again tried the rsync from a remote system.  It
ran for about 1 minute, paused for another 30 seconds or so, then started
pushing out the usual warnings to the messages log (also attached).

Regarding Jarod's question on the bridge chip, do I need to crack the case open
for that, or is there a s/w utility that will tell me?

Please let me know if there's anything else I can try to debug this.  I'm not
much of a coder, but I'll do whatever I can to help source the problem.

Thanks,

Ed

Comment 6 Ed Lally 2008-03-11 05:52:20 UTC

Created attachment 297562 [details]
message log from mounting firewire drive

Comment 7 Ed Lally 2008-03-11 05:53:17 UTC

Created attachment 297563 [details]
message log showing errors

message log showing errors when drive stops responding, following by messages
when unmounting the drive

Comment 8 Stefan Richter 2008-03-11 11:37:27 UTC

Ed, do you by chance use long cables, excessively bent cables, front panel or
back panel breakout connectors, or unventilated enclosures?

Could you install the old ieee1394 kernel modules from ATRPMs and see how they
work with the very same hardware configuration?  (Load ohci1394 and sbp2 instead
of firewire-ohci and firewire-sbp2.)

Jarod, the selfID complete event logging patch would be nice to have here to
check whether there are unexpected bus resets going on.

Comment 9 Stefan Richter 2008-03-11 12:08:08 UTC

> Regarding Jarod's question on the bridge chip, do I need to crack the
> case open for that, or is there a s/w utility that will tell me?

You could attach /sys/bus/firewire/devices/fwX/config_rom here so we could
hazard a guess.  (Insert the correct device name for "fwX"; it has to be one for
which also an fwX.Y exists to which firewire-sbp2 is bound.  In your last log,
this was fw1.)  The config_rom is build up by firmware though and hence may lack
or even provide false information about the hardware.

OxSemi chips have further firmware identifiers and also hardware identifiers
outside of the config_rom: http://marc.info/?l=linux1394-user&m=114485393227904
A few not too difficult ways exist to access these from userspace, but it would
take some time to explain how. :-)

Comment 10 Jarod Wilson 2008-03-11 13:46:35 UTC

Damn, I was hoping that build was going to fix things... Looks like a LaCie hard
disk drive (vendor oui 00d04b == LaCie). I believe they typically use Oxford
bridges -- at least one of the LaCie drives I have here that I just poked at is
an OXFW911+ bridge.

I'll work on getting the selfID logging patch added to a 2.6.24 f8 build
sometime this week, but there should be a version of it available in rawhide
even sooner...

Comment 11 Stefan Richter 2008-03-11 14:21:29 UTC

> Looks like a LaCie hard disk drive

Ah, I missed that.  From what I read on the internet (and it can only be true
then :-), Europeans are usually rather fond of their LaCie disks while there
seem to be many Americans having complaints about LaCie disks.  So it would be
nice if Ed, who I assume is American, could do some stress tests with the old
drivers from ATRPMs to check the extent of guilt of the new drivers.

Comment 12 Jarod Wilson 2008-03-11 14:33:48 UTC

Hey, I'm American, and I have no complaints with either of the LaCie disks I
have here! Actually quite fond of both of 'em -- one is designed to sit
perfectly undre a Mac Mini, the other is a nice little 2.5" drive in a case that
can be used and powered over either USB or FireWire... :)

Comment 13 Ed Lally 2008-03-11 18:00:53 UTC

Hi Jarod and Stefan,

I'm American and can even say "y'all" if you need me to ;-)

This is indeed a LaCie disk -- it's their 120GB Porsche but it's a couple of 
years old -- it only has the FW 400 connector (no USB).  The disk did work 
fine under Fedora 6.

I will try the old ieee1394 modules from ATRPMS and report back, along with 
the config rom dump.

To Stefan's earlier questions:
* I'm using short cables (1 meter or so) and there are no sharp turns.  The 
cable attaches to a soldered connector on a Gigabyte P965 mainboard (model GA-
965P-DQ6).
* I believe the drive enclosure is ventilated, but will check for sure.

Comment 14 Ed Lally 2008-03-12 05:02:22 UTC

I've pulled a copy of the config_rom and attached it here.  This was done under
koji kernel 2.6.24.3-23.fc8 with the newer firewire_ohci drivers -- I will
downgrade to the older drivers in the next day or so.

I tried using gscanbus to get some more info on the drive, but it looks like I
had a kernel panic or oops.  I will give that another try too.

Comment 15 Ed Lally 2008-03-12 05:02:59 UTC

Created attachment 297712 [details]
config_rom output for LaCie 120GB Porsche external drive

Comment 16 Ed Lally 2008-03-12 05:52:23 UTC

Created attachment 297715 [details]
config_rom output for LaCie 120GB Porsche external drive

Comment 17 Ed Lally 2008-03-12 06:00:23 UTC

I was able to get gscanbus working on the drive.  Output is attached.

Based on the link Stefan provided to
http://marc.info/?l=linux1394-user&m=114485393227904, I was able to determine
the chip version.

Quadlet Read from 0xFFFF F0050000 (firmware ID) = 0x88000738 which indicates
OXFW911.
Quadlet Read from 0xFFFF F0090020 (hardware ID) = 0x159E96FD which didn't match
any of the hardware IDs listed.

Comment 18 Ed Lally 2008-03-12 06:00:55 UTC

Created attachment 297718 [details]
Gscanbus output

Comment 19 Stefan Richter 2008-03-12 12:10:58 UTC

Ed,
last night I remembered a firewire-ohci bug which became known at the beginning
of this year.  firewire-ohci is broken on machines with physical memory
addresses above the 4GB mark.  If I read the first few lines of your dmesg from
https://bugzilla.redhat.com/show_bug.cgi?id=271801#c5 correctly, your system is
an affected machines.  Jarod is working on the issue.

Whether this bug actually causes your I/O errors is not clear.  However, it is
at least possible.  Your errors start with a SCSI request timeout (indicated by
"firewire_sbp2: fw1.0: sbp2_scsi_abort").  Perhaps the device properly completed
the request and wrote status in firewire-sbp2's status FIFO, but firewire-ohci
failed to properly process the AR DMA event which results from the status write.

PS:  Yes, all these firmware markers tell that the bridge chip is indeed an OXFW911.

Comment 20 Jarod Wilson 2008-03-12 21:52:12 UTC

Just posted the fix for my own problems on x86_64 w/>= 4GB of RAM a bit ago:

http://lkml.org/lkml/2008/3/12/356

Patch also added to rawhide, should be a kernel started building soonish...

Comment 21 Jarod Wilson 2008-03-15 02:28:32 UTC

Also added to an F8 kernel build now:

http://koji.fedoraproject.org/packages/kernel/2.6.24.3/37.fc8/

Ed, please give that a spin and see if we don't finally have things playing nice
for you...

Comment 22 Jarod Wilson 2008-03-18 03:47:33 UTC

Gah. I screwed up, and the patch is NOT in the -37 kernel. Its in the currently
building -40 kernel though. Should be ready by morning...

http://koji.fedoraproject.org/koji/taskinfo?taskID=520390

Comment 23 Ed Lally 2008-03-18 03:58:05 UTC

Thanks -- I should be able to try it out later this week.

Comment 24 Fedora Update System 2008-03-21 03:15:33 UTC

kernel-2.6.24.3-50.fc8 has been submitted as an update for Fedora 8

Comment 25 Fedora Update System 2008-03-21 22:16:52 UTC

kernel-2.6.24.3-50.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-2630

Comment 26 Ed Lally 2008-03-23 03:01:58 UTC

Jarod et al., 

Looks like the last batch of fixes in kernel-2.6.24.3-50.fc8 has solved
everything.  The drive and all partitions are recognized at startup; I'm NOT
getting the 'giving up on config rom' errors; some major stress-testing of the
drive failed to turn up any problems.  In short, I think we're good to close out
this report.

Thank you all for your help in resolving this problem!!!

Regards,

Ed

Comment 27 Jarod Wilson 2008-03-23 03:18:14 UTC

Excellent, glad to hear we finally got this one licked!

Comment 28 Fedora Update System 2008-03-26 17:15:04 UTC

kernel-2.6.24.3-50.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.