Bug 692425

Summary: USB3 storage equipment runs slow and generates "WARN: Stalled endpoint" errors, and kernel stack traces
Product: Red Hat Enterprise Linux 6 Reporter: Jason Hird <jhird>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED DUPLICATE QA Contact: Caspar Zhang <czhang>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: arozansk, cww, myllynen, qcai
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-07 12:22:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jason Hird 2011-03-31 10:25:20 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Most of the time

Steps to Reproduce:
with Western Digital My Passport USB3 drive attached (and other USB3 devices)
1. time dd if=/dev/zero of=/dev/sdb conv=fsync bs=512 count=500000 

  
Actual results:
"WARN: Stalled endpoint" errors given, occasional slowdown is observed, and kernel stack traces like the one following are produced:

Mar 22 18:38:27 localhost kernel: USB Mass Storage support registered.
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:0: Direct-Access     WD       My Passport 0730 1012 PQ: 0 ANSI: 6
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Enclosure         WD       SES Device       1012 PQ: 0 ANSI: 6
Mar 22 18:38:32 localhost kernel: sd 6:0:0:0: Attached scsi generic sg1 type 0
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Attached scsi generic sg2 type 13
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] 976707584 512-byte logical blocks: (500 GB/465 GiB)
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Write Protect is off
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sdb: unknown partition table
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Attached SCSI disk
Mar 22 18:38:34 localhost kernel: ses 6:0:0:1: Attached Enclosure device
Mar 22 18:38:34 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:01 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:31 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:58 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:40:11 localhost kernel: irq event 33: bogus return value ffffff94
Mar 22 18:40:11 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.32-125.el6.x86_64 #1
Mar 22 18:40:11 localhost kernel: Call Trace:
Mar 22 18:40:11 localhost kernel: <IRQ>  [<ffffffff810d872b>] ? __report_bad_irq+0x2b/0xa0
Mar 22 18:40:11 localhost kernel: [<ffffffff810d88f4>] ? note_interrupt+0x154/0x1d0
Mar 22 18:40:11 localhost kernel: [<ffffffff810d8ecd>] ? handle_edge_irq+0xdd/0x160
Mar 22 18:40:11 localhost kernel: [<ffffffff8109e386>] ? tick_check_idle+0xb6/0xe0
Mar 22 18:40:11 localhost kernel: [<ffffffff8100df89>] ? handle_irq+0x49/0xa0
Mar 22 18:40:11 localhost kernel: [<ffffffff814e1b4c>] ? do_IRQ+0x6c/0xf0
Mar 22 18:40:11 localhost kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
Mar 22 18:40:11 localhost kernel: <EOI>  [<ffffffff812ba4ae>] ? intel_idle+0xde/0x170
Mar 22 18:40:11 localhost kernel: [<ffffffff812ba491>] ? intel_idle+0xc1/0x170
Mar 22 18:40:11 localhost kernel: [<ffffffff813eb537>] ? cpuidle_idle_call+0xa7/0x140
Mar 22 18:40:11 localhost kernel: [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
Mar 22 18:40:11 localhost kernel: [<ffffffff814c1d2a>] ? rest_init+0x7a/0x80
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbdf23>] ? start_kernel+0x418/0x424
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109
Mar 22 18:40:11 localhost kernel: handlers:
Mar 22 18:40:11 localhost kernel: [<ffffffffa01b6800>] (xhci_msi_irq+0x0/0x20 [xhci_hcd])



Expected results:
Drive should not sporadically slow down and produce these errors.

Additional info:

Comment 4 Don Zickus 2011-04-01 15:18:45 UTC
The stack traces are a concern and looking up what 0xffffff94 translates to is ESHUTDOWN, which is not a normal irq return value.  Looking into the xhci code, sure enough, if the host controller has died then the irq handler will return ESHUTDOWN.

So the host controller died, which is probably a result of the WARNs.  The WARNs usually indicate the host controller and device got out of sync and the device needs to be reset.  Not unusual for the usb layer, in fact there are apis for the storage layer to follow when hit with this condition.

What I am guessing happened, is the device stalled and needed to be reset.  The resetting part did not work out and the host controller gave up.

The reset logic is still a little fragile, I'll poke at it some more and see if I can grab some upstream fixes to stabilize it.

Is this reliably reproducible?

Cheers,
Don