Bug 692425 - USB3 storage equipment runs slow and generates "WARN: Stalled endpoint" errors, and kernel stack traces
Summary: USB3 storage equipment runs slow and generates "WARN: Stalled endpoint" error...
Keywords:
Status: CLOSED DUPLICATE of bug 696602
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Don Zickus
QA Contact: Caspar Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-31 10:25 UTC by Jason Hird
Modified: 2018-11-14 13:41 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-07 12:22:59 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Jason Hird 2011-03-31 10:25:20 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Most of the time

Steps to Reproduce:
with Western Digital My Passport USB3 drive attached (and other USB3 devices)
1. time dd if=/dev/zero of=/dev/sdb conv=fsync bs=512 count=500000 

  
Actual results:
"WARN: Stalled endpoint" errors given, occasional slowdown is observed, and kernel stack traces like the one following are produced:

Mar 22 18:38:27 localhost kernel: USB Mass Storage support registered.
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:0: Direct-Access     WD       My Passport 0730 1012 PQ: 0 ANSI: 6
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Enclosure         WD       SES Device       1012 PQ: 0 ANSI: 6
Mar 22 18:38:32 localhost kernel: sd 6:0:0:0: Attached scsi generic sg1 type 0
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Attached scsi generic sg2 type 13
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] 976707584 512-byte logical blocks: (500 GB/465 GiB)
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Write Protect is off
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sdb: unknown partition table
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Attached SCSI disk
Mar 22 18:38:34 localhost kernel: ses 6:0:0:1: Attached Enclosure device
Mar 22 18:38:34 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:01 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:31 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:58 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:40:11 localhost kernel: irq event 33: bogus return value ffffff94
Mar 22 18:40:11 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.32-125.el6.x86_64 #1
Mar 22 18:40:11 localhost kernel: Call Trace:
Mar 22 18:40:11 localhost kernel: <IRQ>  [<ffffffff810d872b>] ? __report_bad_irq+0x2b/0xa0
Mar 22 18:40:11 localhost kernel: [<ffffffff810d88f4>] ? note_interrupt+0x154/0x1d0
Mar 22 18:40:11 localhost kernel: [<ffffffff810d8ecd>] ? handle_edge_irq+0xdd/0x160
Mar 22 18:40:11 localhost kernel: [<ffffffff8109e386>] ? tick_check_idle+0xb6/0xe0
Mar 22 18:40:11 localhost kernel: [<ffffffff8100df89>] ? handle_irq+0x49/0xa0
Mar 22 18:40:11 localhost kernel: [<ffffffff814e1b4c>] ? do_IRQ+0x6c/0xf0
Mar 22 18:40:11 localhost kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
Mar 22 18:40:11 localhost kernel: <EOI>  [<ffffffff812ba4ae>] ? intel_idle+0xde/0x170
Mar 22 18:40:11 localhost kernel: [<ffffffff812ba491>] ? intel_idle+0xc1/0x170
Mar 22 18:40:11 localhost kernel: [<ffffffff813eb537>] ? cpuidle_idle_call+0xa7/0x140
Mar 22 18:40:11 localhost kernel: [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
Mar 22 18:40:11 localhost kernel: [<ffffffff814c1d2a>] ? rest_init+0x7a/0x80
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbdf23>] ? start_kernel+0x418/0x424
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109
Mar 22 18:40:11 localhost kernel: handlers:
Mar 22 18:40:11 localhost kernel: [<ffffffffa01b6800>] (xhci_msi_irq+0x0/0x20 [xhci_hcd])



Expected results:
Drive should not sporadically slow down and produce these errors.

Additional info:

Comment 4 Don Zickus 2011-04-01 15:18:45 UTC
The stack traces are a concern and looking up what 0xffffff94 translates to is ESHUTDOWN, which is not a normal irq return value.  Looking into the xhci code, sure enough, if the host controller has died then the irq handler will return ESHUTDOWN.

So the host controller died, which is probably a result of the WARNs.  The WARNs usually indicate the host controller and device got out of sync and the device needs to be reset.  Not unusual for the usb layer, in fact there are apis for the storage layer to follow when hit with this condition.

What I am guessing happened, is the device stalled and needed to be reset.  The resetting part did not work out and the host controller gave up.

The reset logic is still a little fragile, I'll poke at it some more and see if I can grab some upstream fixes to stabilize it.

Is this reliably reproducible?

Cheers,
Don


Note You need to log in before you can comment on or make changes to this bug.