718077 – usb stalled endpoint disconnected some drives

Bug 718077 - usb stalled endpoint disconnected some drives

Summary: usb stalled endpoint disconnected some drives

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	15
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-06-30 22:03 UTC by Carl Byington
Modified:	2012-07-11 17:52 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-07-11 17:52:43 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Carl Byington 2011-06-30 22:03:58 UTC

Description of problem:
Dell Inc. Vostro 230 w/ 4 Western Digital MyBook usb3 drives connected via two dual-port buffalo usb cards:

01:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03) (prog-if 30)
        Subsystem: Melco Inc Device 0241
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at fe9fe000 (64-bit, non-prefetchable) [size=8K]
        Capabilities: [50] Power Management version 3
        Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
        Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Device Serial Number ff-ff-ff-ff-ff-ff-ff-ff
        Capabilities: [150] #18
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci-hcd

The system was running for 2.5 months, but 

Jun 28 12:01:20 media2 kernel: [5707721.711858] xhci_hcd 0000:03:00.0: WARN: Stalled endpoint
Jun 28 12:01:20 media2 kernel: [5707721.712129] xhci_hcd 0000:03:00.0: WARN: Stalled endpoint
Jun 28 12:08:28 media2 kernel: [5708149.922987] usb 6-1: USB disconnect, address 2
Jun 28 12:08:30 media2 kernel: [5708152.212478] usb 7-2: USB disconnect, address 2
Jun 28 12:13:30 media2 kernel: [5708452.657011] md/raid:md0: Disk failure on sdd1, disabling device.
Jun 28 12:13:30 media2 kernel: [5708452.657013] md/raid:md0: Operation continuing on 3 devices.
Jun 28 12:13:30 media2 kernel: [5708452.657028] md/raid:md0: Disk failure on sdc1, disabling device.
Jun 28 12:13:30 media2 kernel: [5708452.657029] md/raid:md0: Operation continuing on 2 devices.

That dropped 2 out of 4 devices from the raid-5, which of course killed that disk system. It was recovered by:

mdadm --stop /dev/md0
mdadm --assemble --force -v /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1



Version-Release number of selected component (if applicable):
2.6.38.6-26.rc1.fc15.x86_64

How reproducible:
Seems to be very rare - that system had been running for about 2.5 months, serving that 8TB raid over smb and nfs.


Steps to Reproduce:
1. create raid5 over multiple usb3 drives
2.
3.
  
Actual results:
Eventual stall drops drives from the raid bundle.

Comment 1 Yun-Fong Loh 2011-07-21 16:17:32 UTC

Might this be a duplicate of:

Bug 663186?

Comment 2 Dave Jones 2011-12-06 19:08:50 UTC

is this still a problem in the latest updates ?

Comment 3 Carl Byington 2011-12-09 16:23:10 UTC

That happened again a few weeks ago, running 2.6.38.8-32.fc15.x86_64

I have not been able to upgrade to 2.6.4x since the last time I tried that, the machine won't reboot unattended. There seems to be a timing error, where something is not waiting long enought for the md raid device to become ready.

4 usb drives, raid5 device over that, then lvm pvgroup containing the md device, lvm lvgroup containing the pvgroup, and ext4 on the lvgroup. That all boots nicely on 2.6.38.8-3{2,5}, but 2.6.40.4-5.fc15 times out somewhere and the end result is that my /mnt/share does not appear. As I recall, a manual mount command worked. I could not see how to fix the timing problem, so for now that machine is still running 2.6.38.8-32.

I will have access to the machine for testing this weekend if there is anything I can do to help debug this.

Comment 4 Carl Byington 2011-12-12 00:37:02 UTC

Ok, the machine is now running 2.6.41.4-1.fc15.x86_64 (set to reboot into 2.6.38.8-35.fc15.x86_64 if they get a power failure/restore).

However, 2.6.41.4 fails to mount the (4xUsb) - md127 - volume group. It needed a manual "vgchange -a y lvm-media" before "mount /mnt/media" would work.

pvdisplay /dev/md127
  --- Physical volume ---
  PV Name               /dev/md127
  VG Name               lvm-media
  PV Size               8.19 TiB / not usable 4.50 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              2146164
  Free PE               0
  Allocated PE          2146164
  PV UUID               YA1a5L-Vjye-6GF2-9fgB-y2eb-D0pY-T22uxE


vgdisplay --verbose lvm-media
    Using volume group(s) on command line
    Finding volume group "lvm-media"
  --- Volume group ---
  VG Name               lvm-media
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               8.19 TiB
  PE Size               4.00 MiB
  Total PE              2146164
  Alloc PE / Size       2146164 / 8.19 TiB
  Free  PE / Size       0 / 0   
  VG UUID               qw5eiA-dWzr-Q3Jo-sDMs-Lafg-n2hJ-FjPOMr
   
  --- Logical volume ---
  LV Name                /dev/lvm-media/media
  VG Name                lvm-media
  LV UUID                GyyyYc-6pHi-1KT2-IGor-1vKy-5T2E-JZ72L1
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                8.19 TiB
  Current LE             2146164
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     6144
  Block device           253:3
   
  --- Physical volumes ---
  PV Name               /dev/md127     
  PV UUID               YA1a5L-Vjye-6GF2-9fgB-y2eb-D0pY-T22uxE
  PV Status             allocatable
  Total PE / Free PE    2146164 / 0

After boot with 2.6.41.4, /dev/lvm-media/media LV Status was NOT available. But a simple vgchange made it available and mountable. Booting with 2.6.38.8-35 that volume comes up available, and automounts via /etc/fstab. I don't see anything strange in the logs to cause this. Let me know if there is anything else that you need.

Comment 5 Josh Boyer 2012-06-06 15:39:29 UTC

Is this still a problem with the 2.6.43/3.3 kernel updates?

Comment 6 Carl Byington 2012-06-07 05:05:16 UTC

Similar issues (but not identical symptoms) with usb stalls on usb3 drives:

Dell Inc. Vostro 230 w/ 4 Western Digital MyBook usb3 drives connected via two dual-port buffalo usb cards:

01:00.0 USB Controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03) (prog-if 30)

on all of:

2.6.42.9-2.fc15.x86_64
2.6.42.12-1.fc15.x86_64
3.3.5-2.fc16.x86_64
3.3.7-1.fc16.x86_64

We have four Dell vostros (2 230, 2 260) running those four versions. All four machines have 4 WD 3TB drives in a raid5 config. They all get usb stalls leading to drives dropping off the raid system if we run them at full speed. They are currently throttled to effectively usb2 speeds.

I will be moving the oldest one 2.6.42.9 tomorrow to Fedora 17 to see if that helps.

The particular issue of the mount failing on boot (even when all the drives are working properly and the raid is healthy) has not occurred recently. That presumably got fixed somewhere before 2.6.42.9

Comment 7 Josh Boyer 2012-07-11 17:52:43 UTC

Fedora 15 has reached it's end of life as of June 26, 2012.  As a result, we will not be fixing any remaining bugs found in Fedora 15.

In the event that you have upgraded to a newer release and the bug you reported is still present, please reopen the bug and set the version field to the newest release you have encountered the issue with.  Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered.

Thank you for taking the time to file a report.  We hope newer versions of Fedora suit your needs.

Note You need to log in before you can comment on or make changes to this bug.