Hide Forgot
Description of problem: Version-Release number of selected component (if applicable): How reproducible: Most of the time Steps to Reproduce: with Western Digital My Passport USB3 drive attached (and other USB3 devices) 1. time dd if=/dev/zero of=/dev/sdb conv=fsync bs=512 count=500000 Actual results: "WARN: Stalled endpoint" errors given, occasional slowdown is observed, and kernel stack traces like the one following are produced: Mar 22 18:38:27 localhost kernel: USB Mass Storage support registered. Mar 22 18:38:32 localhost kernel: scsi 6:0:0:0: Direct-Access WD My Passport 0730 1012 PQ: 0 ANSI: 6 Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Enclosure WD SES Device 1012 PQ: 0 ANSI: 6 Mar 22 18:38:32 localhost kernel: sd 6:0:0:0: Attached scsi generic sg1 type 0 Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Attached scsi generic sg2 type 13 Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] 976707584 512-byte logical blocks: (500 GB/465 GiB) Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Write Protect is off Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through Mar 22 18:38:34 localhost kernel: sdb: unknown partition table Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Attached SCSI disk Mar 22 18:38:34 localhost kernel: ses 6:0:0:1: Attached Enclosure device Mar 22 18:38:34 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint Mar 22 18:39:01 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint Mar 22 18:39:31 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint Mar 22 18:39:58 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint Mar 22 18:40:11 localhost kernel: irq event 33: bogus return value ffffff94 Mar 22 18:40:11 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.32-125.el6.x86_64 #1 Mar 22 18:40:11 localhost kernel: Call Trace: Mar 22 18:40:11 localhost kernel: <IRQ> [<ffffffff810d872b>] ? __report_bad_irq+0x2b/0xa0 Mar 22 18:40:11 localhost kernel: [<ffffffff810d88f4>] ? note_interrupt+0x154/0x1d0 Mar 22 18:40:11 localhost kernel: [<ffffffff810d8ecd>] ? handle_edge_irq+0xdd/0x160 Mar 22 18:40:11 localhost kernel: [<ffffffff8109e386>] ? tick_check_idle+0xb6/0xe0 Mar 22 18:40:11 localhost kernel: [<ffffffff8100df89>] ? handle_irq+0x49/0xa0 Mar 22 18:40:11 localhost kernel: [<ffffffff814e1b4c>] ? do_IRQ+0x6c/0xf0 Mar 22 18:40:11 localhost kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11 Mar 22 18:40:11 localhost kernel: <EOI> [<ffffffff812ba4ae>] ? intel_idle+0xde/0x170 Mar 22 18:40:11 localhost kernel: [<ffffffff812ba491>] ? intel_idle+0xc1/0x170 Mar 22 18:40:11 localhost kernel: [<ffffffff813eb537>] ? cpuidle_idle_call+0xa7/0x140 Mar 22 18:40:11 localhost kernel: [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 Mar 22 18:40:11 localhost kernel: [<ffffffff814c1d2a>] ? rest_init+0x7a/0x80 Mar 22 18:40:11 localhost kernel: [<ffffffff81bbdf23>] ? start_kernel+0x418/0x424 Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129 Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109 Mar 22 18:40:11 localhost kernel: handlers: Mar 22 18:40:11 localhost kernel: [<ffffffffa01b6800>] (xhci_msi_irq+0x0/0x20 [xhci_hcd]) Expected results: Drive should not sporadically slow down and produce these errors. Additional info:
The stack traces are a concern and looking up what 0xffffff94 translates to is ESHUTDOWN, which is not a normal irq return value. Looking into the xhci code, sure enough, if the host controller has died then the irq handler will return ESHUTDOWN. So the host controller died, which is probably a result of the WARNs. The WARNs usually indicate the host controller and device got out of sync and the device needs to be reset. Not unusual for the usb layer, in fact there are apis for the storage layer to follow when hit with this condition. What I am guessing happened, is the device stalled and needed to be reset. The resetting part did not work out and the host controller gave up. The reset logic is still a little fragile, I'll poke at it some more and see if I can grab some upstream fixes to stabilize it. Is this reliably reproducible? Cheers, Don