Bug 262901

Summary:

Firewire stack in F7

Product:

[Fedora] Fedora

Reporter:

Hugues Talbot <hugues.talbot>

Component:

kernel

Assignee:

Jay Fenlason <fenlason>

Status:

CLOSED NEXTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

cebbert, chris.brown, davej, fche, jfeeney, krh, stefan-r-rhbz, zaitcev

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-01-03 23:55:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
relevant part of message log during fw transfer	none
Output from dmidecode	none

Description Hugues Talbot 2007-08-29 10:46:23 UTC

Description of problem:

This mqy have been widely reported, but with the current F7 firewire stack,
nothing works correctly with respect to external disks

- connexion is intermittent
- when connected to the disk, lots of firewire abort messages in the logs, with
corresponding dropouts
- suspend/hibernate does not work, or more exactly on resume the disks are
unavailable


Version-Release number of selected component (if applicable):


How reproducible:

Every time.

Steps to Reproduce:
1. Boot, connect external disk to firewire port
2. Do some work on the disk, like some heavy transfer
3. Alternatively suspend/hibernate then resume
4. The disk become unavailable.
  
Actual results:

External disk unreachable. various firewire abort message on dmseg and
/var/log/messages.

Expected results:

Disk should continue working 

Additional info

00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer
Controller (rev 01)

External disks are various LaCie and Western Digital MyBook Pro. Accessed with
FW800 and FW400.

The disks work find under FC6, and under FC7 with the EzPlanet alternative
kernel with the old ieee1394 kernel modules and libraries. 

Why RedHat chose to ship this buggy stack is beyond me.

Comment 1 Pete Zaitcev 2007-08-29 22:52:21 UTC

This is too general of a report, not actionable (e.g. if someone assigns this
to me, I won't be able to find a lot of time to spare on it). A more focused
kind of bug report would be welcome though.

Comment 2 Christopher Brown 2007-09-30 15:53:16 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. As Pete said, there is
currently not enough detail to go on to begin troubleshooting the issue. Could
you attach the following:

# lspci -vvxxx
dmesg (entire contents dump plus any output after insertion of one specific
device - please indicate which one)

You might also like to test the latest kernel from rawhide which contains many
fixes which may resolve the issue for you.

With reference to the suspend/resume issues, could you indicate what machine
this is used on and perhaps review some of the following which may help:

# Find out if the system is locked up completely by hitting the caps lock key.

* If the capslock light doesn't toggle, the system is completely dead. Try
again, but this time before suspending, activate the pm_trace functionality with
echo 1 > /sys/power/pm_trace. This reprograms the real time clock to contain a
few bytes of information which we can use to diagnose which driver failed to
resume. After the hang, reboot, boot up again, and save the output of dmesg.

* If the capslock light does toggle, then the system did come back up, and it's
possible that we just failed to reinitialise the video.
http://people.freedesktop.org/~hughsient/quirk may contain further useful
information to diagnose this problem. It may also be useful to initiate the
suspend from a tty (ctrl-alt-f1) and run pm-suspend ; dmesg > dmesg.out ; sync
by hand. Upon resuming you'll now have some more debug info to sift through.
Additionally, this way when it resumes, you already have a console logged in
from which you can type commands 'blind'. Trying vbetool post for example may
bring things back to life. 

# Try rmmod'ing various modules before doing the suspend. If this makes things
work again, retry with a smaller set of modules unloaded. Keep retrying until
you narrow down which module is to blame.

# Another trick that sometimes works to force video to come back up is to enable
the BIOS password. This makes the system resume in a VGA text mode that the
kernel recovers from a lot easier. Not a real solution, but it can help to
diagnose other problems.

# Proprietary 3d graphics driver users should test with respective open source
drivers.

# Laptops using the nv driver should be considered hibernate-only capable as per
https://www.redhat.com/archives/fedora-test-list/2007-September/msg00365.html

If the problem no longer exists (which I doubt) then please close this bug as
CURRENTRELEASE indicating what resolved it for you or I'll do so in a few days
if there is no additional information lodged.

Comment 3 Hugues Talbot 2007-10-01 13:03:08 UTC

Hello,

OK, then let's focus the report. Forget about suspend/resume.

I have an external WD MyBook FW800 disk plugged into this PCI card (output of lspci)

00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer
Controller (rev 01)

I'm running F7 x86_64 on an Athlon 64 3000+ with 2GB of RAM.

With the current kernel, any kind of heavy activity, such as reading or writing
a 10GB file on the disk results in a multitude of *abort* messages in dmesg and
/var/log/messages as well as very slow transfers.

The old, default FW stack shipping with Linus' kernel works fine.

This behaviour has been widely reported and is easily reproducible, as evidenced
by the availability of the EzPlanet kernel that does nothing but replace the RH
firewire stack with the old one.

http://www.ezplanetone.com/xwiki/bin/view/KnowledgeBase/BrokenFC7FireWire

Let me know if you need more. The suspend/resume issue is of secondary
importance. A working FW stack is vital !

Comment 4 Christopher Brown 2007-10-01 13:57:22 UTC

I appreciate that off-bugzilla you have commented you cannot afford to run
rawhide kernels however as Pete commented there is currently nothing here for
them to go on. I appreciate that not having hardware work in the way you wish is
frustrating but bug reports must be substantiated by debuggable output. So far,
despite Pete and my requests you have not attached any. Again, please can you
run the following:

# lspci -vvxxx
# dmidecode (you may need to install this)
dmesg


and attach as separate text/plain attachments (does not need to be from a
rawhide kernel) and post with any additional constructive comments you feel are
prudent. There are valid reasons for shipping a new firewire stack - the old one
was equally problematic (if not more so) however yours is a case of an older
stack working whilst the new one does not (yet). In this case I would counsel
one of two things:

-Use Fedora Core 6 or *EL 5
-Work to get the new code functioning quicker - feedback much appreciated

I have re-assigned this bug to the firewire maintainers who can review it once
the additional debugging information requested has been provided. Until then
"Firewire doesn't work" is simply not good enough.

Cheers
Chris

Comment 5 Stefan Richter 2007-10-01 16:19:33 UTC

Chris, thanks once more for adding me to the CC list.

1.  The problem 'disk not accessible after resume' is a known bug and fixed in
Andrew Morton's -mm test kernel by a patch provided by Kristian H.  Alas this
fix uncovers an additional bug which happens only with _some_ SBP-2 devices
(about half of my devices):  An APM notebook of mine locks up soon after resume
if one of those devices was connected before and after suspend/resume.  Because
this second bug is more serious than the former one, I haven't forwarded
Kristian's fix to Linus yet.  I have to try to set up netconsole or get
suspend/resume working on a non-APM machine for testing purposes or whatever to
debug that second problem.  Alas I couldn't afford the time to do so yet.

This is Kristian's fix:
http://git.kernel.org/?p=linux/kernel/git/ieee1394/linux1394-2.6.git;a=commit;h=3743cc44c82342e744aa564f27c3e1fdd74ba3d0
(this URL will become invalid during next merge to Linus) or
http://me.in-berlin.de/~s5r6/linux1394/submitted/492-firewire-fix-pci-resume-to-not-pass-in-a-__be32-config-rom.patch

As a workaround for either of the two resume bugs, one can (1) unplug the disk,
(2) unload firewire-ohci before suspend or after resume, (3) reload
firewire-ohci, (4) plug the disk back in.


2. Now to the other more urgent part of the report.

> With the current kernel, any kind of heavy activity, such as reading
> or writing a 10GB file on the disk results in a multitude of *abort*
> messages in dmesg and /var/log/messages as well as very slow
> transfers.
> 
> The old, default FW stack shipping with Linus' kernel works fine.
> 
> This behaviour has been widely reported and is easily reproducible,
[...]

Command abortion in the middle of I/O, when the bus isn't touched otherwise
(i.e. nothing else is plugged in or out in the FireWire bus during that) could
mean that your bus is electrically unstable.  Please post the output of dmesg,
either in whole if you are unsure what's relevant, or the parts that you are
sure of are from when the firewire drivers are loaded, from when the disk was
plugged in, and from when the I/O errors happened.

If it is an electric problem, we can't ultimately fix it in software.  We can
and should however try to make the new drivers at least as robust against
occasional (legitimate) bus resets as the old drivers were.

PS:
I already transferred many hundreds of GB with the new FireWire stack, using
various 1394A and 1394B hardware.  Just recently I moved 40GB disk images back
and forth via FireWire.  Works all very fine for me, except that there is
trouble recognizing a known buggy PL3507 based device (the old drivers have
trouble with it too but exhibit different symptoms) and recognizing a very old
1394-1995 portable CD-RW.  It should be noted that, compared to the old stack,
the new stack drives SBP-2 devices in a more secure mode and gives better
throughput on 1394A hardware.

Comment 6 Hugues Talbot 2007-10-02 12:22:10 UTC

output of lspci -vvxxx

00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer
Controller (rev 01) (prog-if 10 [OHCI])
        Subsystem: Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (500ns min, 1000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at e6004000 (32-bit, non-prefetchable) [size=2K]
        Region 1: Memory at e6000000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
00: 4c 10 25 80 06 00 10 02 01 10 00 0c 08 20 00 00
10: 00 40 00 e6 00 00 00 e6 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4c 10 25 80
30: 00 00 00 00 44 00 00 00 00 00 00 00 05 01 02 04


I'll send the rest tonight.

Comment 7 Hugues Talbot 2007-10-02 13:04:46 UTC

My apologies, I hadn't noticed the "access denied" line. Same command run as root

00:0b.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer
Controller (rev 01) (prog-if 10 [OHCI])
        Subsystem: Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (500ns min, 1000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at e6004000 (32-bit, non-prefetchable) [size=2K]
        Region 1: Memory at e6000000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME+

00: 4c 10 25 80 06 00 10 02 01 10 00 0c 08 20 00 00
10: 00 40 00 e6 00 00 00 e6 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 4c 10 25 80
30: 00 00 00 00 44 00 00 00 00 00 00 00 05 01 02 04
40: 00 00 00 00 01 00 02 7e 00 80 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 10 02 00 00 86 40 00 00 4c 10 25 80 00 00 01 00

Comment 8 Hugues Talbot 2007-10-03 21:46:23 UTC

Created attachment 215191 [details]
relevant part of message log during fw transfer

This happens during transfers.

Comment 9 Hugues Talbot 2007-10-03 22:03:00 UTC

Created attachment 215201 [details]
Output from dmidecode

Comment 10 Hugues Talbot 2007-10-03 22:15:35 UTC

Let me know what else I can do. I've redone the transfer with the EzPlanet
kernel and there is no error, and the transferred file is correct.

Comment 11 Stefan Richter 2007-10-03 23:08:32 UTC

Re comment #8: "status write for unknown orb" should be fixed a least in fc8
kernel packages.  See bug 242504.

Comment 12 Stefan Richter 2007-11-08 22:24:22 UTC

Suspend/hibernate & resume should work (except perhaps on PPC, see bug 312871)
with firewire drivers on a patchlevel like in 2.6.24-rc2.  A small but important
fix for a regression after 2.6.23 is pending for 2.6.24-rc3.  I tested suspend
and resume on an APM notebook and on a i945 based x86_64 PC; didn't test
hibernation.

Comment 13 Stefan Richter 2007-11-08 22:27:50 UTC

PS:  This means drivers like in patchkit v577 or later at
http://stein/~stefan/s5r6/linux1394/updates/.

Comment 14 Stefan Richter 2007-11-08 22:28:40 UTC

Woops, http://me.in-berlin.de/~s5r6/linux1394/updates/ :-)

Comment 15 Christopher Brown 2007-12-13 18:39:33 UTC

If the original reporter of this issue could test and report with the latest
rawhide kernel it would be appreciated.

# yum install kernel --enablerepo=development

Cheers
Chris

Comment 16 Hugues Talbot 2007-12-13 22:06:29 UTC

Hello, 

I'm not even running F8 yet. Should I not try that first? I was about to
convince myself to upgrade soon :-)

Comment 17 Hugues Talbot 2007-12-28 09:44:06 UTC

Hello all,

I'm sorry to report that I have the same issue again in F8. Large transfers fail
with many errors :


Dec 28 10:24:57 localhost kernel: firewire_sbp2: sbp2_scsi_abort
Dec 28 10:25:07 localhost kernel: firewire_sbp2: sbp2_scsi_abort
Dec 28 10:25:07 localhost kernel: sr 3:0:0:0: scsi: Device offlined - not ready
after error recovery 
Dec 28 10:25:07 localhost kernel: sr 3:0:0:0: rejecting I/O to offline device
Dec 28 10:25:27 localhost kernel:last message repeated 2 times
Dec 28 10:25:27 localhost kernel: firewire_sbp2: sbp2_scsi_abort
Dec 28 10:25:37 localhost kernel: firewire_sbp2: sbp2_scsi_abort 
Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: scsi: Device offlined - not ready
after error recovery
Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: [sdb] Result:
hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK 
Dec 28 10:25:37 localhost kernel: end_request: I/O error, dev sdb, sector 375736200
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845150
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845151
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845152 
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845153
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845154
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845155 
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845156
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2 
Dec 28 10:25:37 localhost kernel: Buffer I/O error on device sdb2, logical block
16845157
Dec 28 10:25:37 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device 
Dec 28 10:25:37 localhost kernel:last message repeated 2 times
Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: [sdb] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Dec 28 10:25:37 localhost kernel: end_request: I/O error, dev sdb, sector 375737224 
Dec 28 10:25:37 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device
Dec 28 10:25:37 localhost kernel:last message repeated 112 times
Dec 28 10:25:37 localhost kernel: Aborting journal on device sdb2.
Dec 28 10:25:37 localhost kernel: __journal_remove_journal_head: freeing
b_committed_data
Dec 28 10:25:38 localhost kernel: ext3_abort called.
Dec 28 10:25:38 localhost kernel: EXT3-fs error (device sdb2):
ext3_journal_start_sb: Detected aborted journal 
Dec 28 10:25:38 localhost kernel: Remounting filesystem read-only
Dec 28 10:25:40 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device
Dec 28 10:25:43 localhost kernel:last message repeated 3 times
Dec 28 10:25:43 localhost kernel: printk: 355 messages suppressed. 
Dec 28 10:25:43 localhost kernel: Buffer I/O error on device sdb2, logical block
16826771
Dec 28 10:25:43 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:43 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device 
Dec 28 10:25:48 localhost kernel:last message repeated 18 times
Dec 28 10:25:48 localhost kernel: printk: 18 messages suppressed.
Dec 28 10:25:48 localhost kernel: Buffer I/O error on device sdb2, logical block 4 
Dec 28 10:25:48 localhost kernel: lost page write due to I/O error on sdb2
Dec 28 10:25:48 localhost kernel: sd 2:0:0:0: rejecting I/O to offline device



This is very disappointing. I'm happy to help but installing Fedora takes hours
from downloading the iso to finally getting to a testable system.

I'll try with the rawhide kernel.

Comment 18 Hugues Talbot 2007-12-28 10:11:36 UTC

OK, progress,


I've just installed the latest RawHide kernel : 2.6.24-0.123.rc6.fc9

I've just transferred a 4GB iso image from one partition to the next on my
external FW disk, with no error in the log. I'm checksumming right now, but
previously this would fail with many error messages at the 1.5GB mark or so.

OK, the sums are indentical, the transfer worked !

This is good. Congratulations and thanks to all involved.

Comment 19 Christopher Brown 2008-01-03 23:55:24 UTC

Thanks for the feedback and for filing the original bug. Closing NEXTRELEASE

Cheers
Chris