+++ This bug was initially created as a clone of Bug #271801 +++ [...trimmed to relevant bits...] -- Additional comment from elally on 2008-01-31 21:22 EST -- I've updated to a newer kernel: Linux strauss 2.6.23.14-107.fc8 #1 SMP Mon Jan 14 22:07:11 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Three of the errors in comment #13 have been resolved -- "firewire-sbp2 blocking keyboard input when trying to add an SBP-2 device", "status write for unknown orb", and "scsi scan: 96 byte inquiry failed". Per Jarod's suggestion, I tried the koji F8 kernel, but realized that I still got the problem resolved with the earlier kernel from fedora-updates and went back. The only issue remaining is the first one -- "firewire-core unable to access the cards if the modules are loaded early in the boot sequence", which impacts an external hard drive and external CD-RW drive. I put the command "modprobe -r firewire-ohci && modprobe firewire-ohci" in /etc/rc.local to no effect. However, if I run the same command after logging in to GNOME, it works just fine -- the drives are recognized and mounted under /media. I am attaching outputs from lsmod before and after reloading the firewire modules. I am also attaching dmesg output from startup through accessing the drives. Also, if it helps, my smolt page is at http://www.smolts.org/show?UUID=c413b36f-7ba0-405c-ad84-98d4ae3bfb52 Please let me know if there's anything else I can try. Thanks! - Ed -- Additional comment from elally on 2008-01-31 21:23 EST -- Created an attachment (id=293681) newer dmesg output from removing/reloading firewire kernel modules -- Additional comment from elally on 2008-01-31 21:24 EST -- Created an attachment (id=293682) lsmod showing loaded modules before/after reloading firewire_ohci -- Additional comment from elally on 2008-02-01 07:49 EST -- Take back my earlier report... I did some load testing by rsync'ing a directory from another computer and ran into a bunch of buffer IO errors within a few seconds. I've attached dmesg output. I'll try moving back up to the latest koji kernel to see if that fixes the problem. -- Additional comment from elally on 2008-02-01 07:50 EST -- Created an attachment (id=293720) Buffer IO errors under load -- Additional comment from elally on 2008-02-02 19:38 EST -- I'm having problems even with koji kernel "Linux strauss 2.6.23.14-123.fc8 #1 SMP Fri Jan 25 19:54:41 EST 2008 x86_64 x86_64 x86_64 GNU/Linux". I'm testing the drive by rsyncing a directory from "bach" to the server "strauss" (the one that has the firewire drive) over the LAN. The rsync moves along just fine for a while, but then pauses for about 30 seconds with no apparent LAN or disk activity. I get I/O errors followed by the message "kernel: bad page state in process 'swapper'" appearing on the console. Sometime later, the computer with the drive will invariably crash (screen, keyboard, and network all go dead) and require a reboot. Also, the drive is still not recognized at boot -- I have to execute "modprobe -r firewire-ohci && modprobe firewire-ohci" to have them detected. Dmesg output is attached. -- Additional comment from elally on 2008-02-02 19:40 EST -- Created an attachment (id=293810) dmesg output with koji kernel -- Additional comment from jwilson on 2008-02-04 10:01 EST -- Hi Ed, From your dmesg output, it looks like the latest rawhide/devel kernel might get your disks working on boot, as you're hitting the 'giving up on config rom' problem, detailed in bug 429598. Please give that a spin and report back, and/or wait until I get the backports to the F8 kernel done... -- Additional comment from stefan-r-rhbz.de on 2008-02-04 12:28 EST -- Re attachment 293810 [details]: > Feb 2 19:25:20 strauss kernel: sd 15:0:0:0: [sde] Result: > hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK > Feb 2 19:25:20 strauss kernel: end_request: I/O error, dev sde, > sector 38971935 > Feb 2 19:25:20 strauss kernel: sd 15:0:0:0: rejecting I/O to offline device > Feb 2 19:25:20 strauss kernel: sd 15:0:0:0: [sde] Result: > hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK DID_BUS_BUSY typically happens when a bus reset occurs. DID_NO_CONNECT happens when the device was unplugged. Well, you apparently did not unplug it, but there might have been noise on the bus which inspired the controller to send a "self ID complete" event to the drivers, without self ID of the disk --- or with firewire-core misinterpreting the self ID buffer. I saw something similar infrequently happen on my test setup: When I plugged something in to a bus with already a few nodes present, firewire-core misinterpreted this as an existing device going away, rather than a new one joining the bunch. -- Additional comment from elally on 2008-02-07 23:23 EST -- Hi folks, I loaded up the latest rawhide kernel and the drives were detected on boot -- woohoo! Unfortunately the other problems with I/O buffers, etc., are still there. Regarding Stefan's suggestion, both the drives in question (external CD burner and external HD) are on two separate firewire buses, and each is the only device on its bus. The HD is attached to the motherboard's bus; the burner is attached to a TI firewire 800 PCI card. Please let me know if there's anything else I can try to work around or troubleshoot this. Cheers, Ed -- Additional comment from jwilson on 2008-02-08 13:59 EST -- Ed, exactly what kernel version was that with? I suspect some additional patches we have queued up for rawhide, which haven't yet been in a build due to some issues with gcc 4.3, might further help your situation. -- Additional comment from stefan-r-rhbz.de on 2008-02-08 15:44 EST -- Patches "firewire: fw-sbp2: fix I/O errors during reconnect" and "firewire: fw-sbp2: preemptively block sdev" may be beneficial to Ed's setup. I suspect the ultimate problem is electrically unstable hardware here, but the patches should make things smoother even for unreliable hardware. The issue described in http://marc.info/?l=linux1394-devel&m=120237058319592 needs to be addressed eventually as well. It is hopefully not of immediate importance to Ed's setup though. -- Additional comment from jwilson on 2008-02-25 13:23 EST -- So we have a few different bugs that have ended up in here... Here's what I'd like to do: [...] 2) Ed's additional issues listed in comment #13, all of which have been resolved, save the I/O buffer problems. I'd like to open a new bug for this issue, if its still a problem with the latest rawhide kernel.
Nb: these I/O buffer problems look identical to what I'm frequently seeing with a drive in a case with a Prolific PL-3507 (rev c) bridge chip... Might be interesting to know what the bridge chip is in Ed's case.
Okay, so I think we've figured out the root cause of my I/O problems on the PL-3507, and I posted patches to fix 'em to the linux1394-devel list just a bit ago[*]. I've added them to rawhide, so kernel-2.6.25-0.94.rc4.fc9 and later should carry 'em. I'm guessing Ed's I/O issues will also be resolved... [*] http://sourceforge.net/mailarchive/forum.php?thread_name=200803060015.40357.jwilson%40redhat.com&forum_name=linux1394-devel
Also added to F8, in kernel 2.6.24.3-23.fc8, building in koji right now. http://koji.fedoraproject.org/koji/taskinfo?taskID=508508 Ed, please give this f8 build (or a later one) or a rawhide kernel a try, I think you should be all set.
Thanks -- will give it a try and report back.
No luck with koji kernel 2.6.24.3-23.fc8. No step forward, and unfortunately it actually seems to be a step backward. The drive is no longer automatically detected on startup. Once I boot, I can execute "modprobe -r firewire-ohci && modprobe firewire-ohci" to force a scan for them. The drive is then detected, but it takes about 30 seconds to do so (see attached messages log snippet). Once the drive was recognized, I again tried the rsync from a remote system. It ran for about 1 minute, paused for another 30 seconds or so, then started pushing out the usual warnings to the messages log (also attached). Regarding Jarod's question on the bridge chip, do I need to crack the case open for that, or is there a s/w utility that will tell me? Please let me know if there's anything else I can try to debug this. I'm not much of a coder, but I'll do whatever I can to help source the problem. Thanks, Ed
Created attachment 297562 [details] message log from mounting firewire drive
Created attachment 297563 [details] message log showing errors message log showing errors when drive stops responding, following by messages when unmounting the drive
Ed, do you by chance use long cables, excessively bent cables, front panel or back panel breakout connectors, or unventilated enclosures? Could you install the old ieee1394 kernel modules from ATRPMs and see how they work with the very same hardware configuration? (Load ohci1394 and sbp2 instead of firewire-ohci and firewire-sbp2.) Jarod, the selfID complete event logging patch would be nice to have here to check whether there are unexpected bus resets going on.
> Regarding Jarod's question on the bridge chip, do I need to crack the > case open for that, or is there a s/w utility that will tell me? You could attach /sys/bus/firewire/devices/fwX/config_rom here so we could hazard a guess. (Insert the correct device name for "fwX"; it has to be one for which also an fwX.Y exists to which firewire-sbp2 is bound. In your last log, this was fw1.) The config_rom is build up by firmware though and hence may lack or even provide false information about the hardware. OxSemi chips have further firmware identifiers and also hardware identifiers outside of the config_rom: http://marc.info/?l=linux1394-user&m=114485393227904 A few not too difficult ways exist to access these from userspace, but it would take some time to explain how. :-)
Damn, I was hoping that build was going to fix things... Looks like a LaCie hard disk drive (vendor oui 00d04b == LaCie). I believe they typically use Oxford bridges -- at least one of the LaCie drives I have here that I just poked at is an OXFW911+ bridge. I'll work on getting the selfID logging patch added to a 2.6.24 f8 build sometime this week, but there should be a version of it available in rawhide even sooner...
> Looks like a LaCie hard disk drive Ah, I missed that. From what I read on the internet (and it can only be true then :-), Europeans are usually rather fond of their LaCie disks while there seem to be many Americans having complaints about LaCie disks. So it would be nice if Ed, who I assume is American, could do some stress tests with the old drivers from ATRPMs to check the extent of guilt of the new drivers.
Hey, I'm American, and I have no complaints with either of the LaCie disks I have here! Actually quite fond of both of 'em -- one is designed to sit perfectly undre a Mac Mini, the other is a nice little 2.5" drive in a case that can be used and powered over either USB or FireWire... :)
Hi Jarod and Stefan, I'm American and can even say "y'all" if you need me to ;-) This is indeed a LaCie disk -- it's their 120GB Porsche but it's a couple of years old -- it only has the FW 400 connector (no USB). The disk did work fine under Fedora 6. I will try the old ieee1394 modules from ATRPMS and report back, along with the config rom dump. To Stefan's earlier questions: * I'm using short cables (1 meter or so) and there are no sharp turns. The cable attaches to a soldered connector on a Gigabyte P965 mainboard (model GA- 965P-DQ6). * I believe the drive enclosure is ventilated, but will check for sure.
I've pulled a copy of the config_rom and attached it here. This was done under koji kernel 2.6.24.3-23.fc8 with the newer firewire_ohci drivers -- I will downgrade to the older drivers in the next day or so. I tried using gscanbus to get some more info on the drive, but it looks like I had a kernel panic or oops. I will give that another try too.
Created attachment 297712 [details] config_rom output for LaCie 120GB Porsche external drive
Created attachment 297715 [details] config_rom output for LaCie 120GB Porsche external drive
I was able to get gscanbus working on the drive. Output is attached. Based on the link Stefan provided to http://marc.info/?l=linux1394-user&m=114485393227904, I was able to determine the chip version. Quadlet Read from 0xFFFF F0050000 (firmware ID) = 0x88000738 which indicates OXFW911. Quadlet Read from 0xFFFF F0090020 (hardware ID) = 0x159E96FD which didn't match any of the hardware IDs listed.
Created attachment 297718 [details] Gscanbus output
Ed, last night I remembered a firewire-ohci bug which became known at the beginning of this year. firewire-ohci is broken on machines with physical memory addresses above the 4GB mark. If I read the first few lines of your dmesg from https://bugzilla.redhat.com/show_bug.cgi?id=271801#c5 correctly, your system is an affected machines. Jarod is working on the issue. Whether this bug actually causes your I/O errors is not clear. However, it is at least possible. Your errors start with a SCSI request timeout (indicated by "firewire_sbp2: fw1.0: sbp2_scsi_abort"). Perhaps the device properly completed the request and wrote status in firewire-sbp2's status FIFO, but firewire-ohci failed to properly process the AR DMA event which results from the status write. PS: Yes, all these firmware markers tell that the bridge chip is indeed an OXFW911.
Just posted the fix for my own problems on x86_64 w/>= 4GB of RAM a bit ago: http://lkml.org/lkml/2008/3/12/356 Patch also added to rawhide, should be a kernel started building soonish...
Also added to an F8 kernel build now: http://koji.fedoraproject.org/packages/kernel/2.6.24.3/37.fc8/ Ed, please give that a spin and see if we don't finally have things playing nice for you...
Gah. I screwed up, and the patch is NOT in the -37 kernel. Its in the currently building -40 kernel though. Should be ready by morning... http://koji.fedoraproject.org/koji/taskinfo?taskID=520390
Thanks -- I should be able to try it out later this week.
kernel-2.6.24.3-50.fc8 has been submitted as an update for Fedora 8
kernel-2.6.24.3-50.fc8 has been pushed to the Fedora 8 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F8/FEDORA-2008-2630
Jarod et al., Looks like the last batch of fixes in kernel-2.6.24.3-50.fc8 has solved everything. The drive and all partitions are recognized at startup; I'm NOT getting the 'giving up on config rom' errors; some major stress-testing of the drive failed to turn up any problems. In short, I think we're good to close out this report. Thank you all for your help in resolving this problem!!! Regards, Ed
Excellent, glad to hear we finally got this one licked!
kernel-2.6.24.3-50.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report.