Bug 262901
Summary: | Firewire stack in F7 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Hugues Talbot <hugues.talbot> | ||||||
Component: | kernel | Assignee: | Jay Fenlason <fenlason> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 7 | CC: | cebbert, chris.brown, davej, fche, jfeeney, krh, stefan-r-rhbz, zaitcev | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-01-03 23:55:24 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Hugues Talbot
2007-08-29 10:46:23 UTC
This is too general of a report, not actionable (e.g. if someone assigns this to me, I won't be able to find a lot of time to spare on it). A more focused kind of bug report would be welcome though. Hello, I'm reviewing this bug as part of the kernel bug triage project, an attempt to isolate current bugs in the fedora kernel. http://fedoraproject.org/wiki/KernelBugTriage I am CC'ing myself to this bug and will try and assist you in resolving it if I can. There hasn't been much activity on this bug for a while. As Pete said, there is currently not enough detail to go on to begin troubleshooting the issue. Could you attach the following: # lspci -vvxxx dmesg (entire contents dump plus any output after insertion of one specific device - please indicate which one) You might also like to test the latest kernel from rawhide which contains many fixes which may resolve the issue for you. With reference to the suspend/resume issues, could you indicate what machine this is used on and perhaps review some of the following which may help: # Find out if the system is locked up completely by hitting the caps lock key. * If the capslock light doesn't toggle, the system is completely dead. Try again, but this time before suspending, activate the pm_trace functionality with echo 1 > /sys/power/pm_trace. This reprograms the real time clock to contain a few bytes of information which we can use to diagnose which driver failed to resume. After the hang, reboot, boot up again, and save the output of dmesg. * If the capslock light does toggle, then the system did come back up, and it's possible that we just failed to reinitialise the video. http://people.freedesktop.org/~hughsient/quirk may contain further useful information to diagnose this problem. It may also be useful to initiate the suspend from a tty (ctrl-alt-f1) and run pm-suspend ; dmesg > dmesg.out ; sync by hand. Upon resuming you'll now have some more debug info to sift through. Additionally, this way when it resumes, you already have a console logged in from which you can type commands 'blind'. Trying vbetool post for example may bring things back to life. # Try rmmod'ing various modules before doing the suspend. If this makes things work again, retry with a smaller set of modules unloaded. Keep retrying until you narrow down which module is to blame. # Another trick that sometimes works to force video to come back up is to enable the BIOS password. This makes the system resume in a VGA text mode that the kernel recovers from a lot easier. Not a real solution, but it can help to diagnose other problems. # Proprietary 3d graphics driver users should test with respective open source drivers. # Laptops using the nv driver should be considered hibernate-only capable as per https://www.redhat.com/archives/fedora-test-list/2007-September/msg00365.html If the problem no longer exists (which I doubt) then please close this bug as CURRENTRELEASE indicating what resolved it for you or I'll do so in a few days if there is no additional information lodged. Hello, OK, then let's focus the report. Forget about suspend/resume. I have an external WD MyBook FW800 disk plugged into this PCI card (output of lspci) 00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller (rev 01) I'm running F7 x86_64 on an Athlon 64 3000+ with 2GB of RAM. With the current kernel, any kind of heavy activity, such as reading or writing a 10GB file on the disk results in a multitude of *abort* messages in dmesg and /var/log/messages as well as very slow transfers. The old, default FW stack shipping with Linus' kernel works fine. This behaviour has been widely reported and is easily reproducible, as evidenced by the availability of the EzPlanet kernel that does nothing but replace the RH firewire stack with the old one. http://www.ezplanetone.com/xwiki/bin/view/KnowledgeBase/BrokenFC7FireWire Let me know if you need more. The suspend/resume issue is of secondary importance. A working FW stack is vital ! I appreciate that off-bugzilla you have commented you cannot afford to run rawhide kernels however as Pete commented there is currently nothing here for them to go on. I appreciate that not having hardware work in the way you wish is frustrating but bug reports must be substantiated by debuggable output. So far, despite Pete and my requests you have not attached any. Again, please can you run the following: # lspci -vvxxx # dmidecode (you may need to install this) dmesg and attach as separate text/plain attachments (does not need to be from a rawhide kernel) and post with any additional constructive comments you feel are prudent. There are valid reasons for shipping a new firewire stack - the old one was equally problematic (if not more so) however yours is a case of an older stack working whilst the new one does not (yet). In this case I would counsel one of two things: -Use Fedora Core 6 or *EL 5 -Work to get the new code functioning quicker - feedback much appreciated I have re-assigned this bug to the firewire maintainers who can review it once the additional debugging information requested has been provided. Until then "Firewire doesn't work" is simply not good enough. Cheers Chris Chris, thanks once more for adding me to the CC list. 1. The problem 'disk not accessible after resume' is a known bug and fixed in Andrew Morton's -mm test kernel by a patch provided by Kristian H. Alas this fix uncovers an additional bug which happens only with _some_ SBP-2 devices (about half of my devices): An APM notebook of mine locks up soon after resume if one of those devices was connected before and after suspend/resume. Because this second bug is more serious than the former one, I haven't forwarded Kristian's fix to Linus yet. I have to try to set up netconsole or get suspend/resume working on a non-APM machine for testing purposes or whatever to debug that second problem. Alas I couldn't afford the time to do so yet. This is Kristian's fix: http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=3743cc44c82342e744aa564f27c3e1fdd74ba3d0 (this URL will become invalid during next merge to Linus) or http://me.in-berlin.de/~s5r6/linux1394/submitted/492-firewire-fix-pci-resume-to-not-pass-in-a-__be32-config-rom.patch As a workaround for either of the two resume bugs, one can (1) unplug the disk, (2) unload firewire-ohci before suspend or after resume, (3) reload firewire-ohci, (4) plug the disk back in. 2. Now to the other more urgent part of the report. > With the current kernel, any kind of heavy activity, such as reading > or writing a 10GB file on the disk results in a multitude of *abort* > messages in dmesg and /var/log/messages as well as very slow > transfers. > > The old, default FW stack shipping with Linus' kernel works fine. > > This behaviour has been widely reported and is easily reproducible, [...] Command abortion in the middle of I/O, when the bus isn't touched otherwise (i.e. nothing else is plugged in or out in the FireWire bus during that) could mean that your bus is electrically unstable. Please post the output of dmesg, either in whole if you are unsure what's relevant, or the parts that you are sure of are from when the firewire drivers are loaded, from when the disk was plugged in, and from when the I/O errors happened. If it is an electric problem, we can't ultimately fix it in software. We can and should however try to make the new drivers at least as robust against occasional (legitimate) bus resets as the old drivers were. PS: I already transferred many hundreds of GB with the new FireWire stack, using various 1394A and 1394B hardware. Just recently I moved 40GB disk images back and forth via FireWire. Works all very fine for me, except that there is trouble recognizing a known buggy PL3507 based device (the old drivers have trouble with it too but exhibit different symptoms) and recognizing a very old 1394-1995 portable CD-RW. It should be noted that, compared to the old stack, the new stack drives SBP-2 devices in a more secure mode and gives better throughput on 1394A hardware. output of lspci -vvxxx 00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller (rev 01) (prog-if 10 [OHCI]) Subsystem: Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (500ns min, 1000ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 19 Region 0: Memory at e6004000 (32-bit, non-prefetchable) [size=2K] Region 1: Memory at e6000000 (32-bit, non-prefetchable) [size=16K] Capabilities: <access denied> 00: 4c 10 25 80 06 00 10 02 01 10 00 0c 08 20 00 00 10: 00 40 00 e6 00 00 00 e6 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 4c 10 25 80 30: 00 00 00 00 44 00 00 00 00 00 00 00 05 01 02 04 I'll send the rest tonight. My apologies, I hadn't noticed the "access denied" line. Same command run as root 00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller (rev 01) (prog-if 10 [OHCI]) Subsystem: Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (500ns min, 1000ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 19 Region 0: Memory at e6004000 (32-bit, non-prefetchable) [size=2K] Region 1: Memory at e6000000 (32-bit, non-prefetchable) [size=16K] Capabilities: [44] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME+ 00: 4c 10 25 80 06 00 10 02 01 10 00 0c 08 20 00 00 10: 00 40 00 e6 00 00 00 e6 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 4c 10 25 80 30: 00 00 00 00 44 00 00 00 00 00 00 00 05 01 02 04 40: 00 00 00 00 01 00 02 7e 00 80 00 00 00 00 00 00 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 10 02 00 00 86 40 00 00 4c 10 25 80 00 00 01 00 Created attachment 215191 [details]
relevant part of message log during fw transfer
This happens during transfers.
Created attachment 215201 [details]
Output from dmidecode
Let me know what else I can do. I've redone the transfer with the EzPlanet kernel and there is no error, and the transferred file is correct. Re comment #8: "status write for unknown orb" should be fixed a least in fc8 kernel packages. See bug 242504. Suspend/hibernate & resume should work (except perhaps on PPC, see bug 312871) with firewire drivers on a patchlevel like in 2.6.24-rc2. A small but important fix for a regression after 2.6.23 is pending for 2.6.24-rc3. I tested suspend and resume on an APM notebook and on a i945 based x86_64 PC; didn't test hibernation. PS: This means drivers like in patchkit v577 or later at http://stein/~stefan/s5r6/linux1394/updates/. If the original reporter of this issue could test and report with the latest rawhide kernel it would be appreciated. # yum install kernel --enablerepo=development Cheers Chris Hello, I'm not even running F8 yet. Should I not try that first? I was about to convince myself to upgrade soon :-) Hello all, I'm sorry to report that I have the same issue again in F8. Large transfers fail with many errors : Dec 28 10:24:57 localhost kernel: firewire_sbp2: sbp2_scsi_abort Dec 28 10:25:07 localhost kernel: firewire_sbp2: sbp2_scsi_abort Dec 28 10:25:07 localhost kernel: sr 3:0:0:0: scsi: Device offlined - not ready after error recovery Dec 28 10:25:07 localhost kernel: sr 3:0:0:0: rejecting I/O to offline device Dec 28 10:25:27 localhost kernel:last message repeated 2 times Dec 28 10:25:27 localhost kernel: firewire_sbp2: sbp2_scsi_abort Dec 28 10:25:37 localhost kernel: firewire_sbp2: sbp2_scsi_abort Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: scsi: Device offlined - not ready after error recovery Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK Dec 28 10:25:37 localhost kernel: end_request: I/O error, dev sdb, sector 375736200 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845150 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845151 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845152 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845153 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845154 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845155 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845156 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block 16845157 Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device Dec 28 10:25:37 localhost kernel:last message repeated 2 times Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Dec 28 10:25:37 localhost kernel: end_request: I/O error, dev sdb, sector 375737224 Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device Dec 28 10:25:37 localhost kernel:last message repeated 112 times Dec 28 10:25:37 localhost kernel: Aborting journal on device sdb2. Dec 28 10:25:37 localhost kernel: __journal_remove_journal_head: freeing b_committed_data Dec 28 10:25:38 localhost kernel: ext3_abort called. Dec 28 10:25:38 localhost kernel: EXT3-fs error (device sdb2): ext3_journal_start_sb: Detected aborted journal Dec 28 10:25:38 localhost kernel: Remounting filesystem read-only Dec 28 10:25:40 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device Dec 28 10:25:43 localhost kernel:last message repeated 3 times Dec 28 10:25:43 localhost kernel: printk: 355 messages suppressed. Dec 28 10:25:43 localhost kernel: Buffer I/O error on device sdb2, logical block 16826771 Dec 28 10:25:43 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:43 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device Dec 28 10:25:48 localhost kernel:last message repeated 18 times Dec 28 10:25:48 localhost kernel: printk: 18 messages suppressed. Dec 28 10:25:48 localhost kernel: Buffer I/O error on device sdb2, logical block 4 Dec 28 10:25:48 localhost kernel: lost page write due to I/O error on sdb2 Dec 28 10:25:48 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device This is very disappointing. I'm happy to help but installing Fedora takes hours from downloading the iso to finally getting to a testable system. I'll try with the rawhide kernel. OK, progress, I've just installed the latest RawHide kernel : 2.6.24-0.123.rc6.fc9 I've just transferred a 4GB iso image from one partition to the next on my external FW disk, with no error in the log. I'm checksumming right now, but previously this would fail with many error messages at the 1.5GB mark or so. OK, the sums are indentical, the transfer worked ! This is good. Congratulations and thanks to all involved. Thanks for the feedback and for filing the original bug. Closing NEXTRELEASE Cheers Chris |