Bug 849966 - USB disk data corruption on unmount due to xHCI driver error (xhci_drop_endpoint called with disabled ep)
Summary: USB disk data corruption on unmount due to xHCI driver error (xhci_drop_endpo...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-08-21 11:17 UTC by scd
Modified: 2013-02-13 15:41 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-13 15:40:37 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description scd 2012-08-21 11:17:45 UTC
Description of problem:
After copying large files (ca. 8GB) to encrypted USB disks attached to a USB 3 hub, unmounting the disk frequently fails, kernel issues messages

xHCI xhci_drop_endpoint called with disabled ep...

followed by SCSI driver errors / file system corruption: 

[sdc] Unhandled error code
sd 10:0:0:0: [sdc]  Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
sd 10:0:0:0: [sdc] CDB: Write(10): 2a 00 1d 04 18 00 00 00 08 00
end_request: I/O error, dev sdc, sector 486807552
Buffer I/O error on device dm-2, logical block 60850176
lost page write due to I/O error on dm-2

Version-Release number of selected component (if applicable):
All Fedora 16 kernels after about 3.0

Steps to Reproduce:
1. create ext4 on top of LUKS filesystem on 2 USB3 disks
2. attach disks to USB3 hub
3. copy 6-8GB file simultaneously to both disks
4. try to unmount / detach disks via DBUS
With a probability of about 50%, the error occurs
  
Actual results:
File system corruption, possible data loss.

Expected results:
Filesystems must be successfully flushed to USB3 disks, and then unmounted cleanly.

Additional info:
I consider this a critical bug w/ high fixing priority:
o data loss due to filesystem corruption can occur (I had one filesystem's root
  directory destroyed), which is especially nasty as USB drives are frequently
  used as media for storing important backups
o the severity of the bug is completely masked if the GUI is used (i.e. "Safely
  Remove Drive" is clicked): the operation hangs for ca. 20s, then some
  rather uninformative error message like "cannot unmount drive" is issued --
  but that the drive suffered a hard I/O error and might be corrupted is not
  brought to the user's attention!!
o increasing USB filesystem memory with e.g.

  modprobe usbcore usbfs_memory_mb=1000

  does not provide a workaround.
o I saw discussions on 2 Linux boards that the bug is known, and that a fix
  might be available in upstream releases of kernel / libusb -- please backport
  this solution to Fedora 16 ASAP!

HW are LaCie / Hitachi HDDs attached to a Samsung 900X3A notebook

Thanks,
Stefan

Comment 1 Justin M. Forbes 2012-08-24 16:57:41 UTC
What is the most recent kernel you have seen this on?

Comment 2 scd 2012-08-24 17:14:21 UTC
Sorry for the missing info, it's 3.4.9-1

Meanwhile, I experienced the bug once even w/o writing anything to the USB drives -- just having them attached and mounted for a while was enough.

Comment 3 James Harrion 2012-09-21 15:58:51 UTC
I think I have the same problem. Below is a trace from /var/log/messages.

Machine running Fedora 17 with latest patches as of 21st September, 2012.

[root@LP000138 dev]# uname -a
Linux xxx.com 3.5.3-1.fc17.x86_64 #1 SMP Wed Aug 29 18:46:34 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Plugging in a USB 3 external drive into DELL Latitude E5430


Sep 21 16:45:37 LP000138 kernel: [  384.217898] usb 2-1.8.2: USB disconnect, device number 6
Sep 21 16:45:51 LP000138 kernel: [  397.838596] usb 4-2: new SuperSpeed USB device number 13 using xhci_hcd
Sep 21 16:45:51 LP000138 kernel: [  397.850754] usb 4-2: New USB device found, idVendor=1058, idProduct=0730
Sep 21 16:45:51 LP000138 kernel: [  397.850761] usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Sep 21 16:45:51 LP000138 kernel: [  397.850765] usb 4-2: Product: My Passport 0730
Sep 21 16:45:51 LP000138 kernel: [  397.850768] usb 4-2: Manufacturer: Western Digital
Sep 21 16:45:51 LP000138 kernel: [  397.850771] usb 4-2: SerialNumber: 575838314134315435343738
Sep 21 16:45:51 LP000138 kernel: [  397.851783] scsi10 : usb-storage 4-2:1.0
Sep 21 16:45:51 LP000138 mtp-probe: checking bus 4, device 13: "/sys/devices/pci0000:00/0000:00:14.0/usb4/4-2"
Sep 21 16:45:51 LP000138 mtp-probe: bus: 4, device: 13 was not an MTP device
Sep 21 16:46:02 LP000138 kernel: [  408.851792] scsi 10:0:0:0: Direct-Access     WD       My Passport 0730 1016 PQ: 0 ANSI: 6
Sep 21 16:46:02 LP000138 kernel: [  408.853261] sd 10:0:0:0: Attached scsi generic sg2 type 0
Sep 21 16:46:02 LP000138 kernel: [  408.853464] sd 10:0:0:0: [sdb] 1953458176 512-byte logical blocks: (1.00 TB/931 GiB)
Sep 21 16:46:02 LP000138 kernel: [  408.853634] sd 10:0:0:0: [sdb] Write Protect is off
Sep 21 16:46:02 LP000138 kernel: [  408.853790] sd 10:0:0:0: [sdb] No Caching mode page present
Sep 21 16:46:02 LP000138 kernel: [  408.853795] sd 10:0:0:0: [sdb] Assuming drive cache: write through
Sep 21 16:46:02 LP000138 kernel: [  408.854507] sd 10:0:0:0: [sdb] No Caching mode page present
Sep 21 16:46:02 LP000138 kernel: [  408.854513] sd 10:0:0:0: [sdb] Assuming drive cache: write through
Sep 21 16:46:02 LP000138 kernel: [  408.864533]  sdb:
Sep 21 16:46:02 LP000138 kernel: [  408.865243] sd 10:0:0:0: [sdb] No Caching mode page present
Sep 21 16:46:02 LP000138 kernel: [  408.865249] sd 10:0:0:0: [sdb] Assuming drive cache: write through
Sep 21 16:46:02 LP000138 kernel: [  408.865254] sd 10:0:0:0: [sdb] Attached SCSI disk
Sep 21 16:46:02 LP000138 kernel: [  408.869471] usb 4-2: Disable of device-initiated U1 failed.
Sep 21 16:46:02 LP000138 kernel: [  408.869544] usb 4-2: Disable of device-initiated U2 failed.
Sep 21 16:46:02 LP000138 kernel: [  408.971140] usb 4-2: Device not responding to set address.
Sep 21 16:46:02 LP000138 kernel: [  409.171888] usb 4-2: Device not responding to set address.
Sep 21 16:46:03 LP000138 kernel: [  409.372762] usb 4-2: device not accepting address 13, error -71
Sep 21 16:46:03 LP000138 kernel: [  409.474820] usb 4-2: Device not responding to set address.
Sep 21 16:46:03 LP000138 kernel: [  409.675726] usb 4-2: Device not responding to set address.
Sep 21 16:46:03 LP000138 kernel: [  409.876604] usb 4-2: device not accepting address 13, error -71
Sep 21 16:46:03 LP000138 kernel: [  409.978673] usb 4-2: Device not responding to set address.
Sep 21 16:46:03 LP000138 kernel: [  410.179580] usb 4-2: Device not responding to set address.
Sep 21 16:46:04 LP000138 kernel: [  410.380427] usb 4-2: device not accepting address 13, error -71
Sep 21 16:46:04 LP000138 kernel: [  410.482512] usb 4-2: Device not responding to set address.
Sep 21 16:46:04 LP000138 kernel: [  410.683379] usb 4-2: Device not responding to set address.
Sep 21 16:46:04 LP000138 kernel: [  410.884248] usb 4-2: device not accepting address 13, error -71
Sep 21 16:46:04 LP000138 kernel: [  410.884304] usb 4-2: USB disconnect, device number 13
Sep 21 16:46:04 LP000138 kernel: [  410.885735] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff8801e18279c0
Sep 21 16:46:04 LP000138 kernel: [  410.885744] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff8801e1827980
Sep 21 16:46:04 LP000138 udevd[3097]: inotify_add_watch(6, /dev/sdb, 10) failed: No such file or directory
Sep 21 16:46:04 LP000138 kernel: [  410.987331] usb 4-2: Device not responding to set address.
Sep 21 16:46:04 LP000138 kernel: [  411.188216] usb 4-2: Device not responding to set address.
Sep 21 16:46:05 LP000138 kernel: [  411.389114] usb 4-2: device not accepting address 14, error -71
Sep 21 16:46:05 LP000138 kernel: [  411.491163] usb 4-2: Device not responding to set address.
Sep 21 16:46:05 LP000138 kernel: [  411.692060] usb 4-2: Device not responding to set address.
Sep 21 16:46:05 LP000138 kernel: [  411.892911] usb 4-2: device not accepting address 15, error -71
Sep 21 16:46:05 LP000138 kernel: [  411.994978] usb 4-2: Device not responding to set address.
Sep 21 16:46:05 LP000138 kernel: [  412.195913] usb 4-2: Device not responding to set address.
Sep 21 16:46:06 LP000138 kernel: [  412.396781] usb 4-2: device not accepting address 16, error -71
Sep 21 16:46:06 LP000138 kernel: [  412.498800] usb 4-2: Device not responding to set address.
Sep 21 16:46:06 LP000138 kernel: [  412.699756] usb 4-2: Device not responding to set address.
Sep 21 16:46:06 LP000138 kernel: [  412.900598] usb 4-2: device not accepting address 17, error -71
Sep 21 16:46:06 LP000138 kernel: [  412.900626] hub 4-0:1.0: unable to enumerate USB device on port 2

Comment 4 Dave Jones 2012-10-23 15:32:37 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 5 scd 2012-11-04 14:59:05 UTC
With kernel version 3.6.2-1.fc16.x86_64, I could not reproduce the bug in ca. 20 copying actions.  I am not sure that it is completely gone (as I was not able to trigger it deterministically with the older kernels), but at least the probability of transfer errors has decreased significantly.

Comment 6 Fedora End Of Life 2013-01-16 14:33:38 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 7 Fedora End Of Life 2013-02-13 15:41:05 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.