Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 692425

Summary: USB3 storage equipment runs slow and generates "WARN: Stalled endpoint" errors, and kernel stack traces
Product: Red Hat Enterprise Linux 6 Reporter: Jason Hird <jhird>
Component: kernelAssignee: Don Zickus <dzickus>
Status: CLOSED DUPLICATE QA Contact: Caspar Zhang <czhang>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.2CC: arozansk, cww, myllynen, qcai
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-07 12:22:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jason Hird 2011-03-31 10:25:20 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
Most of the time

Steps to Reproduce:
with Western Digital My Passport USB3 drive attached (and other USB3 devices)
1. time dd if=/dev/zero of=/dev/sdb conv=fsync bs=512 count=500000 

  
Actual results:
"WARN: Stalled endpoint" errors given, occasional slowdown is observed, and kernel stack traces like the one following are produced:

Mar 22 18:38:27 localhost kernel: USB Mass Storage support registered.
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:0: Direct-Access     WD       My Passport 0730 1012 PQ: 0 ANSI: 6
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Enclosure         WD       SES Device       1012 PQ: 0 ANSI: 6
Mar 22 18:38:32 localhost kernel: sd 6:0:0:0: Attached scsi generic sg1 type 0
Mar 22 18:38:32 localhost kernel: scsi 6:0:0:1: Attached scsi generic sg2 type 13
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] 976707584 512-byte logical blocks: (500 GB/465 GiB)
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Write Protect is off
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sdb: unknown partition table
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Assuming drive cache: write through
Mar 22 18:38:34 localhost kernel: sd 6:0:0:0: [sdb] Attached SCSI disk
Mar 22 18:38:34 localhost kernel: ses 6:0:0:1: Attached Enclosure device
Mar 22 18:38:34 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:01 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:31 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:39:58 localhost kernel: xhci_hcd 0000:01:00.0: WARN: Stalled endpoint
Mar 22 18:40:11 localhost kernel: irq event 33: bogus return value ffffff94
Mar 22 18:40:11 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.32-125.el6.x86_64 #1
Mar 22 18:40:11 localhost kernel: Call Trace:
Mar 22 18:40:11 localhost kernel: <IRQ>  [<ffffffff810d872b>] ? __report_bad_irq+0x2b/0xa0
Mar 22 18:40:11 localhost kernel: [<ffffffff810d88f4>] ? note_interrupt+0x154/0x1d0
Mar 22 18:40:11 localhost kernel: [<ffffffff810d8ecd>] ? handle_edge_irq+0xdd/0x160
Mar 22 18:40:11 localhost kernel: [<ffffffff8109e386>] ? tick_check_idle+0xb6/0xe0
Mar 22 18:40:11 localhost kernel: [<ffffffff8100df89>] ? handle_irq+0x49/0xa0
Mar 22 18:40:11 localhost kernel: [<ffffffff814e1b4c>] ? do_IRQ+0x6c/0xf0
Mar 22 18:40:11 localhost kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
Mar 22 18:40:11 localhost kernel: <EOI>  [<ffffffff812ba4ae>] ? intel_idle+0xde/0x170
Mar 22 18:40:11 localhost kernel: [<ffffffff812ba491>] ? intel_idle+0xc1/0x170
Mar 22 18:40:11 localhost kernel: [<ffffffff813eb537>] ? cpuidle_idle_call+0xa7/0x140
Mar 22 18:40:11 localhost kernel: [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110
Mar 22 18:40:11 localhost kernel: [<ffffffff814c1d2a>] ? rest_init+0x7a/0x80
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbdf23>] ? start_kernel+0x418/0x424
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129
Mar 22 18:40:11 localhost kernel: [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109
Mar 22 18:40:11 localhost kernel: handlers:
Mar 22 18:40:11 localhost kernel: [<ffffffffa01b6800>] (xhci_msi_irq+0x0/0x20 [xhci_hcd])



Expected results:
Drive should not sporadically slow down and produce these errors.

Additional info:

Comment 4 Don Zickus 2011-04-01 15:18:45 UTC
The stack traces are a concern and looking up what 0xffffff94 translates to is ESHUTDOWN, which is not a normal irq return value.  Looking into the xhci code, sure enough, if the host controller has died then the irq handler will return ESHUTDOWN.

So the host controller died, which is probably a result of the WARNs.  The WARNs usually indicate the host controller and device got out of sync and the device needs to be reset.  Not unusual for the usb layer, in fact there are apis for the storage layer to follow when hit with this condition.

What I am guessing happened, is the device stalled and needed to be reset.  The resetting part did not work out and the host controller gave up.

The reset logic is still a little fragile, I'll poke at it some more and see if I can grab some upstream fixes to stabilize it.

Is this reliably reproducible?

Cheers,
Don