Bug 435787

Summary: RHEL4.7: USB stress test failure on AMD SBX00
Product: Red Hat Enterprise Linux 4 Reporter: Bhavna Sarathy <bnagendr>
Component: kernelAssignee: Bhavna Sarathy <bnagendr>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: bhavna.sarathy, nmsasiku, peterm, rdoty, shane.huang, vgoyal, zaitcev
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2008-0665 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-07-24 19:27:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 335381    
Attachments:
Description Flags
backported patch to fix it
none
backported patch to fix it (updated)
none
Fix-ups which worked for me
none
Fix-up patch
none
dmesg under 2.6.9-67 plus two EHCI patches
none
all in one patch for the stress test failure none

Description Bhavna Sarathy 2008-03-03 21:35:31 UTC
+++ This bug was initially created as a clone of Bug #435670 +++

Description of problem:

There is one SB600/SB700 USB bug which will lead to usb stress
test failure:
http://bugzilla.kernel.org/show_bug.cgi?id=8692

The workaround contains three linux patches:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commitdiff;h=07d29b63ef6b39963ab37818653284d861cf55af

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commitdiff;h=f8fa7571a928d6d0e1b7444b0ea69ec7dc7db3b6

http://lkml.org/lkml/2008/2/19/546

I backported these three patches for RHEL5.2, please add it into the kernel.

Thanks

-- Additional comment from shane.huang on 2008-03-03 04:01 EST --
Created an attachment (id=296565)
backported patch to fix USB stress test failure


-- Additional comment from shane.huang on 2008-03-03 04:04 EST --
You may add this patch and build one kernel rpm package for us,
I can ask our QA to test this rpm package. Thanks

-- Additional comment from bnagendr on 2008-03-03 09:31 EST --
Russ, please add to the 5.2 master bug list.  Peter has indicated that this
patch will go into a snapshot build.
Bhavana

-- Additional comment from rdoty on 2008-03-03 10:19 EST --
Requesting inclusion in a Beta snapshot.

-- Additional comment from bnagendr on 2008-03-03 15:16 EST --
Shane, thanks for attaching the backport.  Brew build uploaded for AMD chipset
QA team to test.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1193626

Comment 1 Bhavna Sarathy 2008-03-03 21:36:33 UTC
Russ, please add to the 4.7 tracker, for 5.2 and 4.7 bug fix parity.

Comment 2 Bhavna Sarathy 2008-03-03 21:40:45 UTC
Shane, is this fix needed for R4.7 as well?

Comment 3 Shane Huang 2008-03-04 01:02:17 UTC
Yes, but the backported patches may be different, I will check them first.


Comment 4 Shane Huang 2008-03-10 10:20:21 UTC
Created attachment 297411 [details]
backported patch to fix it

Comment 5 Shane Huang 2008-03-10 10:20:54 UTC
The third patch in http://lkml.org/lkml/2008/2/19/546
has been replaced by another one:
http://marc.info/?l=linux-usb&m=120469059715031&w=2
because the latter can make things better.

Since kernel 2.6.9 is a little old and there is much difference between
2.6.9 and 2.6.25, there are many dependencies in porting the patches.

Bhavana,

Can you add some USB experts from RedHat into this bugzilla CC list?
So that they may be able to help to review the backported patch(comment #4)
in case of some potential regression issues.


Comment 6 Shane Huang 2008-03-11 10:24:35 UTC
Created attachment 297591 [details]
backported patch to fix it (updated)

Comment 7 Shane Huang 2008-03-11 10:25:27 UTC
The third patch http://marc.info/?l=linux-usb&m=120469059715031&w=2
has been added to linus source tree too, which is:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-
2.6.git;a=commit;h=e82cc1288fa57857c6af8c57f3d07096d4bcd9d9

And I also did some modification to the original backported patch,
please use the patch in comment #6 instead.


Comment 8 RHEL Program Management 2008-03-17 19:10:52 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Vivek Goyal 2008-03-27 23:23:59 UTC
Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 11 Pete Zaitcev 2008-04-09 03:31:55 UTC
I was tinkering with my local testbed for bug 441552, and experienced
a hang on -68.32.EL. I did rmmod usb_storage && modprobe usb_storage
and the modprobe hung in =>command_abort=>hcd_unlink_urb. Clearly a
lost interrupt. Unplugging the device clears. I applied an additional
patch and modprobe started working.

At this point I suspect that Shane's backport was not 100% correct
for the 2.6.9 base which we have in RHEL 4. It looks like he busted
the useful code in ehci_urb_dequeue(), and I missed that when I reviewed.

I'm wondering now if the stress test AMD was running included command
aborts. That's where all bugs always crop up. Maybe they just transferred
a lot of data and called it a day. But the error processing is much
trickier than just doing transfers.

Comment 12 Pete Zaitcev 2008-04-09 03:33:15 UTC
Created attachment 301746 [details]
Fix-ups which worked for me

Comment 14 Shane Huang 2008-04-09 04:44:12 UTC
There is much difference between 2.6.9 and up to date kernel like 2.6.25,
including that unlink_async() is not implemented in 2.6.9.
So it's possible that my backport in incomplete. That's why I asked
RedHat USB experts to review my patch in comment #5.

Our stress test focused at data transfer as you guessed,
command abort case was not tested well.

Any other EHCI patch is encouraged to RHEL4, Thanks

BTW:
I'm not able to visit BZ #441552, can you also put me in the CC list?


Comment 15 Pete Zaitcev 2008-04-09 05:45:51 UTC
There's noting interesting in 441552 for us, it's just a use-after-free
in usb-storage. It probably has access bits set because a partner reported
it through an escalation.

I'm going to make a better fix-up patch tomorrow.

Comment 16 Shane Huang 2008-04-09 07:07:54 UTC
OK, thanks. please provide your patch when it is ready,
I can ask our QA to test it again on RHEL4.


Comment 17 Pete Zaitcev 2008-04-10 00:18:00 UTC
Created attachment 301913 [details]
Fix-up patch

Since we're following bleeding edge upstream here, I thought it prudent
to swipe the new update_async and otherwise get a bit closer to upstream.
I tested this patch on top of -68.32.EL with basic unit tests. Now we
need to build a kernel for AMD to test. Maybe Bhavana can help?

Comment 21 Vivek Goyal 2008-04-15 18:24:58 UTC
Reverted this patch in 68.34 build as this patch had created issues. Putting the
bug back to assigned state.

Comment 22 Shane Huang 2008-04-16 06:35:21 UTC
OK. I'm combining the patch in comment #6 and patch in comment #17 into a
single one, then ask our QA to test the combined patch.

I will submit the combined patch here after our QA's test. Thanks.


Comment 23 Shane Huang 2008-04-17 07:30:46 UTC
Pete's patch in comment #17 PLUS my patch in comment #6 lead to error:
    SCSI error : <5 0 0 0> return code = 0x6000000
on kernel 2.6.9-68.34 or 2.6.9-67 when I mount external USB HDD partitions, 
while I do NOT find the similar error with my patch alone.

Pete, would you please check your patch again? Did you find this error on your
platform? Thanks


Comment 24 Shane Huang 2008-04-18 10:21:30 UTC
Is there any update on the "SCSI error"?

Thanks


Comment 25 Pete Zaitcev 2008-04-19 01:37:05 UTC
I have done my tests from the beginning, with various versions. The result
is the same, modprobe ehci_hcd (or modprobe usb_storage) hangs with just
the patch from comment #6 and works with both. So, no change on my end
from what I reported previously.

I tried different controllers (but not SB600):
05:04.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)

I have access to an SB600:
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
But for some reason I'm unable to run RHEL 4 on that system (issues
with either LVM2 or SATA not recognized), and the USB is onboard, so
that's not tested. Lots of time wasted though.

The error Shane reported in comment #23 did not appear. However, I'm
quite ready to accept that my fix-up made it worse for his case.
So Peter was wise not to bank on my patch.

Shane, please attach an unmodified dmesg, I would like to see it.

Comment 26 Shane Huang 2008-04-21 09:37:31 UTC
Created attachment 303125 [details]
dmesg under 2.6.9-67 plus two EHCI patches

Comment 27 Shane Huang 2008-04-21 09:40:01 UTC
The "SCSI error" appears after I plug in USB HDD for several seconds,
it is not related with "mount". Please check it.



Comment 28 Shane Huang 2008-04-25 01:09:19 UTC
Pete, Is there any update at your side?
Can you duplicate the SCSI error?

Comment 29 Pete Zaitcev 2008-04-25 02:23:32 UTC
No change, it's not happening here.

But actually it may be a good thing. In case the abort didn't work
properly previously and started working now, the error handler would
be able to print that message. I would not be too concerned IF the
device continued to work (if sdb1 is mountable and accessible).

Comment 30 Shane Huang 2008-04-25 02:55:46 UTC
So, will the EHCI patches in comment #6 and comment #17 be added to
RHEL4.7 kernel?

Comment 31 henry su 2008-04-29 10:26:45 UTC
The SCSI errors come from the SCSI completion processing for block I/O 
requests, actually, the external hard disk is is mountable and accessible even 
these errors come up.
the backport of qh_refresh is not complete,if not  we still need to refresh the 
queue head when the state of QH is idle, I combine shane and Pete's patches 
into one, it should fixed all issues(modprobe hang and SCSI errors).
Pete, please review and test this patch on your side, thank you!

Comment 32 henry su 2008-04-29 10:28:13 UTC
Created attachment 304091 [details]
all in one patch for the stress test failure

Comment 33 henry su 2008-04-30 01:13:08 UTC
if  you plug the USB hard disk,the SCSI errors come up a few minutes later, no 
errors with the USB flash disk.

Comment 34 Pete Zaitcev 2008-04-30 01:18:43 UTC
The patch in comment #32 works for me.

Comment 35 Shane Huang 2008-04-30 01:37:39 UTC
Bhavana and Pete,

Can you add this combined patch in #32 into RHEL4.7 kernel before its final 
release?

And can you provide us one testing rpm package for our QA's test?

Thanks


Comment 36 Pete Zaitcev 2008-04-30 01:54:40 UTC
Bhavana has to get an exception for it from the PM team.

Comment 37 Russell Doty 2008-04-30 13:32:16 UTC
Bhavana, please report when the patch has its rhkernel acks.

Comment 38 Bhavna Sarathy 2008-04-30 13:51:42 UTC
I most certainly can, stay tuned.  But folks reviewing would want to know if the
patch has an exception.  Peter?

Comment 39 Peter Martuccelli 2008-04-30 14:58:42 UTC
The review process is separate, regardless RHKL ACKs will be required first for
this issue to be approved as an exception.  Plus I will need a write-up on the
combined patch - specifically the difference between the original patch(comment
6), Pete's follow-up patch(comment 17), and the combined patch(comment 32).  I
see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume
this is based on comment 31 and was required to resolve the SCSI errors.

Has anyone seen any additional SCSI errors when testing with the combined patch,
(comment 32)?

Comment 40 Pete Zaitcev 2008-04-30 16:06:38 UTC
I haven't seen the error with any combination of patches. It has
something to do with the HD enclosure AMD was using.

Comment 41 Bhavna Sarathy 2008-05-01 16:40:43 UTC
Patch posted to RHML Apr 30, two ACKs (PeteZ, PeterM).

Brew build for testing:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1300222

Comment 42 Shane Huang 2008-05-06 01:46:21 UTC
Our QA's full USB test to x86_64 testing rpm package provided by Bhavana
is okay, which includes the combined patch in comment #32.

So can you guys add the patch in comment #32 to RHEL4.7 kernel ASAP?

The test result of i386 rpm package will come out tomorrow.

Thanks


Comment 43 Shane Huang 2008-05-07 02:39:58 UTC
The i386 testing kernel with the combined patch in comment #32 can work well
too on our SB700 board.  Thanks.


Comment 44 Bhavna Sarathy 2008-05-07 20:35:03 UTC
(In reply to comment #39)
> The review process is separate, regardless RHKL ACKs will be required first for
> this issue to be approved as an exception.  Plus I will need a write-up on the
> combined patch - specifically the difference between the original patch(comment
> 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32).  I
> see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume
> this is based on comment 31 and was required to resolve the SCSI errors.
> 

Pete's patch adds the following:
Function unlink_async()
Uncomments the function in QH_STATE_COMPLETING:
Updates QH_STATE_LINKED to periodically self-unlink when empty and call
unlink_async()
Adds qh_refresh() and invokes refresh in QH_STATE_IDLE

Henry's changes to comment #6 + comment#17 are:
a call added to qh_make() to refresh QH via qh_refresh()
Code added to qh_link_async() to clear halt or toggle to recover from 
silicon quirk, via a call to qh_refresh() if qh_state is QH_STATE_IDLE

Both these changes completed the back port of qh_refresh that Pete's 
patch included.  These changes are necessary to fix the SCSI errors
that were occuring during the block I/O requests on AMD hardware.

Complete QA tests at AMD have shown that the combined patch in #32 works
without any issues.

> Has anyone seen any additional SCSI errors when testing with the combined patch,
> (comment 32)?

AMD has not seen any errors and Pete said he has not either.

Peter, do you have enough information for you and Russ to ask for the 
exception for this patch for RHEL4.7?    

Regards,
Bhavana

Comment 46 Shane Huang 2008-05-28 02:05:14 UTC
I do not find the combined patch in RHEL4.7 Beta(kernel 2.6.9-70),

Is there any guy who can tell me from which kernel version this patch
will be added?

Thanks


Comment 47 Bhavna Sarathy 2008-05-28 13:57:27 UTC
snapshot 1

Comment 48 Vivek Goyal 2008-05-29 20:51:07 UTC
Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 50 Shane Huang 2008-06-16 02:46:54 UTC
verified with Snapshot1, thanks.

Comment 51 Pete Zaitcev 2008-07-10 15:52:07 UTC
Dell reports a pretty clear regression with this in bug 454479.
They often use high-speed HID devices coming from Avocent, so
EHCI is directly involved.

Comment 53 errata-xmlrpc 2008-07-24 19:27:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html