435787 – RHEL4.7: USB stress test failure on AMD SBX00

Bug 435787 - RHEL4.7: USB stress test failure on AMD SBX00

Summary: RHEL4.7: USB stress test failure on AMD SBX00

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Bhavna Sarathy
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	335381
TreeView+	depends on / blocked

Reported:	2008-03-03 21:35 UTC by Bhavna Sarathy
Modified:	2009-08-10 07:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHSA-2008-0665
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-07-24 19:27:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
backported patch to fix it (10.15 KB, patch) 2008-03-10 10:20 UTC, Shane Huang	no flags	Details \| Diff
backported patch to fix it (updated) (10.17 KB, patch) 2008-03-11 10:24 UTC, Shane Huang	no flags	Details \| Diff
Fix-ups which worked for me (5.31 KB, patch) 2008-04-09 03:33 UTC, Pete Zaitcev	no flags	Details \| Diff
Fix-up patch (6.03 KB, patch) 2008-04-10 00:18 UTC, Pete Zaitcev	no flags	Details \| Diff
dmesg under 2.6.9-67 plus two EHCI patches (28.00 KB, text/plain) 2008-04-21 09:37 UTC, Shane Huang	no flags	Details
all in one patch for the stress test failure (15.53 KB, patch) 2008-04-29 10:28 UTC, henry su	no flags	Details \| Diff
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2008:0665	0	normal	SHIPPED_LIVE	Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7	2008-07-24 16:41:06 UTC

Description Bhavna Sarathy 2008-03-03 21:35:31 UTC

+++ This bug was initially created as a clone of Bug #435670 +++

Description of problem:

There is one SB600/SB700 USB bug which will lead to usb stress
test failure:
http://bugzilla.kernel.org/show_bug.cgi?id=8692

The workaround contains three linux patches:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commitdiff;h=07d29b63ef6b39963ab37818653284d861cf55af

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commitdiff;h=f8fa7571a928d6d0e1b7444b0ea69ec7dc7db3b6

http://lkml.org/lkml/2008/2/19/546

I backported these three patches for RHEL5.2, please add it into the kernel.

Thanks

-- Additional comment from shane.huang on 2008-03-03 04:01 EST --
Created an attachment (id=296565)
backported patch to fix USB stress test failure


-- Additional comment from shane.huang on 2008-03-03 04:04 EST --
You may add this patch and build one kernel rpm package for us,
I can ask our QA to test this rpm package. Thanks

-- Additional comment from bnagendr on 2008-03-03 09:31 EST --
Russ, please add to the 5.2 master bug list.  Peter has indicated that this
patch will go into a snapshot build.
Bhavana

-- Additional comment from rdoty on 2008-03-03 10:19 EST --
Requesting inclusion in a Beta snapshot.

-- Additional comment from bnagendr on 2008-03-03 15:16 EST --
Shane, thanks for attaching the backport.  Brew build uploaded for AMD chipset
QA team to test.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1193626

Comment 1 Bhavna Sarathy 2008-03-03 21:36:33 UTC

Russ, please add to the 4.7 tracker, for 5.2 and 4.7 bug fix parity.

Comment 2 Bhavna Sarathy 2008-03-03 21:40:45 UTC

Shane, is this fix needed for R4.7 as well?

Comment 3 Shane Huang 2008-03-04 01:02:17 UTC

Yes, but the backported patches may be different, I will check them first.

Comment 4 Shane Huang 2008-03-10 10:20:21 UTC

Created attachment 297411 [details]
backported patch to fix it

Comment 5 Shane Huang 2008-03-10 10:20:54 UTC

The third patch in http://lkml.org/lkml/2008/2/19/546
has been replaced by another one:
http://marc.info/?l=linux-usb&m=120469059715031&w=2
because the latter can make things better.

Since kernel 2.6.9 is a little old and there is much difference between
2.6.9 and 2.6.25, there are many dependencies in porting the patches.

Bhavana,

Can you add some USB experts from RedHat into this bugzilla CC list?
So that they may be able to help to review the backported patch(comment #4)
in case of some potential regression issues.

Comment 6 Shane Huang 2008-03-11 10:24:35 UTC

Created attachment 297591 [details]
backported patch to fix it (updated)

Comment 7 Shane Huang 2008-03-11 10:25:27 UTC

The third patch http://marc.info/?l=linux-usb&m=120469059715031&w=2
has been added to linus source tree too, which is:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commit;h=e82cc1288fa57857c6af8c57f3d07096d4bcd9d9

And I also did some modification to the original backported patch,
please use the patch in comment #6 instead.

Comment 8 RHEL Program Management 2008-03-17 19:10:52 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Vivek Goyal 2008-03-27 23:23:59 UTC

Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 11 Pete Zaitcev 2008-04-09 03:31:55 UTC

I was tinkering with my local testbed for bug 441552, and experienced
a hang on -68.32.EL. I did rmmod usb_storage && modprobe usb_storage
and the modprobe hung in =>command_abort=>hcd_unlink_urb. Clearly a
lost interrupt. Unplugging the device clears. I applied an additional
patch and modprobe started working.

At this point I suspect that Shane's backport was not 100% correct
for the 2.6.9 base which we have in RHEL 4. It looks like he busted
the useful code in ehci_urb_dequeue(), and I missed that when I reviewed.

I'm wondering now if the stress test AMD was running included command
aborts. That's where all bugs always crop up. Maybe they just transferred
a lot of data and called it a day. But the error processing is much
trickier than just doing transfers.

Comment 12 Pete Zaitcev 2008-04-09 03:33:15 UTC

Created attachment 301746 [details]
Fix-ups which worked for me

Comment 14 Shane Huang 2008-04-09 04:44:12 UTC

There is much difference between 2.6.9 and up to date kernel like 2.6.25,
including that unlink_async() is not implemented in 2.6.9.
So it's possible that my backport in incomplete. That's why I asked
RedHat USB experts to review my patch in comment #5.

Our stress test focused at data transfer as you guessed,
command abort case was not tested well.

Any other EHCI patch is encouraged to RHEL4, Thanks

BTW:
I'm not able to visit BZ #441552, can you also put me in the CC list?

Comment 15 Pete Zaitcev 2008-04-09 05:45:51 UTC

There's noting interesting in 441552 for us, it's just a use-after-free
in usb-storage. It probably has access bits set because a partner reported
it through an escalation.

I'm going to make a better fix-up patch tomorrow.

Comment 16 Shane Huang 2008-04-09 07:07:54 UTC

OK, thanks. please provide your patch when it is ready,
I can ask our QA to test it again on RHEL4.

Comment 17 Pete Zaitcev 2008-04-10 00:18:00 UTC

Created attachment 301913 [details]
Fix-up patch

Since we're following bleeding edge upstream here, I thought it prudent
to swipe the new update_async and otherwise get a bit closer to upstream.
I tested this patch on top of -68.32.EL with basic unit tests. Now we
need to build a kernel for AMD to test. Maybe Bhavana can help?

Comment 21 Vivek Goyal 2008-04-15 18:24:58 UTC

Reverted this patch in 68.34 build as this patch had created issues. Putting the
bug back to assigned state.

Comment 22 Shane Huang 2008-04-16 06:35:21 UTC

OK. I'm combining the patch in comment #6 and patch in comment #17 into a
single one, then ask our QA to test the combined patch.

I will submit the combined patch here after our QA's test. Thanks.

Comment 23 Shane Huang 2008-04-17 07:30:46 UTC

Pete's patch in comment #17 PLUS my patch in comment #6 lead to error:
    SCSI error : <5 0 0 0> return code = 0x6000000
on kernel 2.6.9-68.34 or 2.6.9-67 when I mount external USB HDD partitions, 
while I do NOT find the similar error with my patch alone.

Pete, would you please check your patch again? Did you find this error on your
platform? Thanks

Comment 24 Shane Huang 2008-04-18 10:21:30 UTC

Is there any update on the "SCSI error"?

Thanks

Comment 25 Pete Zaitcev 2008-04-19 01:37:05 UTC

I have done my tests from the beginning, with various versions. The result
is the same, modprobe ehci_hcd (or modprobe usb_storage) hangs with just
the patch from comment #6 and works with both. So, no change on my end
from what I reported previously.

I tried different controllers (but not SB600):
05:04.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)

I have access to an SB600:
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
But for some reason I'm unable to run RHEL 4 on that system (issues
with either LVM2 or SATA not recognized), and the USB is onboard, so
that's not tested. Lots of time wasted though.

The error Shane reported in comment #23 did not appear. However, I'm
quite ready to accept that my fix-up made it worse for his case.
So Peter was wise not to bank on my patch.

Shane, please attach an unmodified dmesg, I would like to see it.

Comment 26 Shane Huang 2008-04-21 09:37:31 UTC

Created attachment 303125 [details]
dmesg under 2.6.9-67 plus two EHCI patches

Comment 27 Shane Huang 2008-04-21 09:40:01 UTC

The "SCSI error" appears after I plug in USB HDD for several seconds,
it is not related with "mount". Please check it.

Comment 28 Shane Huang 2008-04-25 01:09:19 UTC

Pete, Is there any update at your side?
Can you duplicate the SCSI error?

Comment 29 Pete Zaitcev 2008-04-25 02:23:32 UTC

No change, it's not happening here.

But actually it may be a good thing. In case the abort didn't work
properly previously and started working now, the error handler would
be able to print that message. I would not be too concerned IF the
device continued to work (if sdb1 is mountable and accessible).

Comment 30 Shane Huang 2008-04-25 02:55:46 UTC

So, will the EHCI patches in comment #6 and comment #17 be added to
RHEL4.7 kernel?

Comment 31 henry su 2008-04-29 10:26:45 UTC

The SCSI errors come from the SCSI completion processing for block I/O 
requests, actually, the external hard disk is is mountable and accessible even 
these errors come up.
the backport of qh_refresh is not complete,if not  we still need to refresh the 
queue head when the state of QH is idle, I combine shane and Pete's patches 
into one, it should fixed all issues(modprobe hang and SCSI errors).
Pete, please review and test this patch on your side, thank you!

Comment 32 henry su 2008-04-29 10:28:13 UTC

Created attachment 304091 [details]
all in one patch for the stress test failure

Comment 33 henry su 2008-04-30 01:13:08 UTC

if  you plug the USB hard disk,the SCSI errors come up a few minutes later, no 
errors with the USB flash disk.

Comment 34 Pete Zaitcev 2008-04-30 01:18:43 UTC

The patch in comment #32 works for me.

Comment 35 Shane Huang 2008-04-30 01:37:39 UTC

Bhavana and Pete,

Can you add this combined patch in #32 into RHEL4.7 kernel before its final 
release?

And can you provide us one testing rpm package for our QA's test?

Thanks

Comment 36 Pete Zaitcev 2008-04-30 01:54:40 UTC

Bhavana has to get an exception for it from the PM team.

Comment 37 Russell Doty 2008-04-30 13:32:16 UTC

Bhavana, please report when the patch has its rhkernel acks.

Comment 38 Bhavna Sarathy 2008-04-30 13:51:42 UTC

I most certainly can, stay tuned.  But folks reviewing would want to know if the
patch has an exception.  Peter?

Comment 39 Peter Martuccelli 2008-04-30 14:58:42 UTC

The review process is separate, regardless RHKL ACKs will be required first for
this issue to be approved as an exception.  Plus I will need a write-up on the
combined patch - specifically the difference between the original patch(comment
6), Pete's follow-up patch(comment 17), and the combined patch(comment 32).  I
see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume
this is based on comment 31 and was required to resolve the SCSI errors.

Has anyone seen any additional SCSI errors when testing with the combined patch,
(comment 32)?

Comment 40 Pete Zaitcev 2008-04-30 16:06:38 UTC

I haven't seen the error with any combination of patches. It has
something to do with the HD enclosure AMD was using.

Comment 41 Bhavna Sarathy 2008-05-01 16:40:43 UTC

Patch posted to RHML Apr 30, two ACKs (PeteZ, PeterM).

Brew build for testing:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1300222

Comment 42 Shane Huang 2008-05-06 01:46:21 UTC

Our QA's full USB test to x86_64 testing rpm package provided by Bhavana
is okay, which includes the combined patch in comment #32.

So can you guys add the patch in comment #32 to RHEL4.7 kernel ASAP?

The test result of i386 rpm package will come out tomorrow.

Thanks

Comment 43 Shane Huang 2008-05-07 02:39:58 UTC

The i386 testing kernel with the combined patch in comment #32 can work well
too on our SB700 board.  Thanks.

Comment 44 Bhavna Sarathy 2008-05-07 20:35:03 UTC

(In reply to comment #39)
> The review process is separate, regardless RHKL ACKs will be required first for
> this issue to be approved as an exception.  Plus I will need a write-up on the
> combined patch - specifically the difference between the original patch(comment
> 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32).  I
> see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume
> this is based on comment 31 and was required to resolve the SCSI errors.
> 

Pete's patch adds the following:
Function unlink_async()
Uncomments the function in QH_STATE_COMPLETING:
Updates QH_STATE_LINKED to periodically self-unlink when empty and call
unlink_async()
Adds qh_refresh() and invokes refresh in QH_STATE_IDLE

Henry's changes to comment #6 + comment#17 are:
a call added to qh_make() to refresh QH via qh_refresh()
Code added to qh_link_async() to clear halt or toggle to recover from 
silicon quirk, via a call to qh_refresh() if qh_state is QH_STATE_IDLE

Both these changes completed the back port of qh_refresh that Pete's 
patch included.  These changes are necessary to fix the SCSI errors
that were occuring during the block I/O requests on AMD hardware.

Complete QA tests at AMD have shown that the combined patch in #32 works
without any issues.

> Has anyone seen any additional SCSI errors when testing with the combined patch,
> (comment 32)?

AMD has not seen any errors and Pete said he has not either.

Peter, do you have enough information for you and Russ to ask for the 
exception for this patch for RHEL4.7?    

Regards,
Bhavana

Comment 46 Shane Huang 2008-05-28 02:05:14 UTC

I do not find the combined patch in RHEL4.7 Beta(kernel 2.6.9-70),

Is there any guy who can tell me from which kernel version this patch
will be added?

Thanks

Comment 47 Bhavna Sarathy 2008-05-28 13:57:27 UTC

snapshot 1

Comment 48 Vivek Goyal 2008-05-29 20:51:07 UTC

Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 50 Shane Huang 2008-06-16 02:46:54 UTC

verified with Snapshot1, thanks.

Comment 51 Pete Zaitcev 2008-07-10 15:52:07 UTC

Dell reports a pretty clear regression with this in bug 454479.
They often use high-speed HID devices coming from Avocent, so
EHCI is directly involved.

Comment 53 errata-xmlrpc 2008-07-24 19:27:14 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html

Note You need to log in before you can comment on or make changes to this bug.