Red Hat Bugzilla – Bug 435787
RHEL4.7: USB stress test failure on AMD SBX00
Last modified: 2009-08-10 03:27:26 EDT
+++ This bug was initially created as a clone of Bug #435670 +++
Description of problem:
There is one SB600/SB700 USB bug which will lead to usb stress
The workaround contains three linux patches:
I backported these three patches for RHEL5.2, please add it into the kernel.
-- Additional comment from firstname.lastname@example.org on 2008-03-03 04:01 EST --
Created an attachment (id=296565)
backported patch to fix USB stress test failure
-- Additional comment from email@example.com on 2008-03-03 04:04 EST --
You may add this patch and build one kernel rpm package for us,
I can ask our QA to test this rpm package. Thanks
-- Additional comment from firstname.lastname@example.org on 2008-03-03 09:31 EST --
Russ, please add to the 5.2 master bug list. Peter has indicated that this
patch will go into a snapshot build.
-- Additional comment from email@example.com on 2008-03-03 10:19 EST --
Requesting inclusion in a Beta snapshot.
-- Additional comment from firstname.lastname@example.org on 2008-03-03 15:16 EST --
Shane, thanks for attaching the backport. Brew build uploaded for AMD chipset
QA team to test.
Russ, please add to the 4.7 tracker, for 5.2 and 4.7 bug fix parity.
Shane, is this fix needed for R4.7 as well?
Yes, but the backported patches may be different, I will check them first.
Created attachment 297411 [details]
backported patch to fix it
The third patch in http://lkml.org/lkml/2008/2/19/546
has been replaced by another one:
because the latter can make things better.
Since kernel 2.6.9 is a little old and there is much difference between
2.6.9 and 2.6.25, there are many dependencies in porting the patches.
Can you add some USB experts from RedHat into this bugzilla CC list?
So that they may be able to help to review the backported patch(comment #4)
in case of some potential regression issues.
Created attachment 297591 [details]
backported patch to fix it (updated)
The third patch http://marc.info/?l=linux-usb&m=120469059715031&w=2
has been added to linus source tree too, which is:
And I also did some modification to the original backported patch,
please use the patch in comment #6 instead.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
I was tinkering with my local testbed for bug 441552, and experienced
a hang on -68.32.EL. I did rmmod usb_storage && modprobe usb_storage
and the modprobe hung in =>command_abort=>hcd_unlink_urb. Clearly a
lost interrupt. Unplugging the device clears. I applied an additional
patch and modprobe started working.
At this point I suspect that Shane's backport was not 100% correct
for the 2.6.9 base which we have in RHEL 4. It looks like he busted
the useful code in ehci_urb_dequeue(), and I missed that when I reviewed.
I'm wondering now if the stress test AMD was running included command
aborts. That's where all bugs always crop up. Maybe they just transferred
a lot of data and called it a day. But the error processing is much
trickier than just doing transfers.
Created attachment 301746 [details]
Fix-ups which worked for me
There is much difference between 2.6.9 and up to date kernel like 2.6.25,
including that unlink_async() is not implemented in 2.6.9.
So it's possible that my backport in incomplete. That's why I asked
RedHat USB experts to review my patch in comment #5.
Our stress test focused at data transfer as you guessed,
command abort case was not tested well.
Any other EHCI patch is encouraged to RHEL4, Thanks
I'm not able to visit BZ #441552, can you also put me in the CC list?
There's noting interesting in 441552 for us, it's just a use-after-free
in usb-storage. It probably has access bits set because a partner reported
it through an escalation.
I'm going to make a better fix-up patch tomorrow.
OK, thanks. please provide your patch when it is ready,
I can ask our QA to test it again on RHEL4.
Created attachment 301913 [details]
Since we're following bleeding edge upstream here, I thought it prudent
to swipe the new update_async and otherwise get a bit closer to upstream.
I tested this patch on top of -68.32.EL with basic unit tests. Now we
need to build a kernel for AMD to test. Maybe Bhavana can help?
Reverted this patch in 68.34 build as this patch had created issues. Putting the
bug back to assigned state.
OK. I'm combining the patch in comment #6 and patch in comment #17 into a
single one, then ask our QA to test the combined patch.
I will submit the combined patch here after our QA's test. Thanks.
Pete's patch in comment #17 PLUS my patch in comment #6 lead to error:
SCSI error : <5 0 0 0> return code = 0x6000000
on kernel 2.6.9-68.34 or 2.6.9-67 when I mount external USB HDD partitions,
while I do NOT find the similar error with my patch alone.
Pete, would you please check your patch again? Did you find this error on your
Is there any update on the "SCSI error"?
I have done my tests from the beginning, with various versions. The result
is the same, modprobe ehci_hcd (or modprobe usb_storage) hangs with just
the patch from comment #6 and works with both. So, no change on my end
from what I reported previously.
I tried different controllers (but not SB600):
05:04.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
I have access to an SB600:
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
But for some reason I'm unable to run RHEL 4 on that system (issues
with either LVM2 or SATA not recognized), and the USB is onboard, so
that's not tested. Lots of time wasted though.
The error Shane reported in comment #23 did not appear. However, I'm
quite ready to accept that my fix-up made it worse for his case.
So Peter was wise not to bank on my patch.
Shane, please attach an unmodified dmesg, I would like to see it.
Created attachment 303125 [details]
dmesg under 2.6.9-67 plus two EHCI patches
The "SCSI error" appears after I plug in USB HDD for several seconds,
it is not related with "mount". Please check it.
Pete, Is there any update at your side?
Can you duplicate the SCSI error?
No change, it's not happening here.
But actually it may be a good thing. In case the abort didn't work
properly previously and started working now, the error handler would
be able to print that message. I would not be too concerned IF the
device continued to work (if sdb1 is mountable and accessible).
So, will the EHCI patches in comment #6 and comment #17 be added to
The SCSI errors come from the SCSI completion processing for block I/O
requests, actually, the external hard disk is is mountable and accessible even
these errors come up.
the backport of qh_refresh is not complete,if not we still need to refresh the
queue head when the state of QH is idle, I combine shane and Pete's patches
into one, it should fixed all issues(modprobe hang and SCSI errors).
Pete, please review and test this patch on your side, thank you!
Created attachment 304091 [details]
all in one patch for the stress test failure
if you plug the USB hard disk,the SCSI errors come up a few minutes later, no
errors with the USB flash disk.
The patch in comment #32 works for me.
Bhavana and Pete,
Can you add this combined patch in #32 into RHEL4.7 kernel before its final
And can you provide us one testing rpm package for our QA's test?
Bhavana has to get an exception for it from the PM team.
Bhavana, please report when the patch has its rhkernel acks.
I most certainly can, stay tuned. But folks reviewing would want to know if the
patch has an exception. Peter?
The review process is separate, regardless RHKL ACKs will be required first for
this issue to be approved as an exception. Plus I will need a write-up on the
combined patch - specifically the difference between the original patch(comment
6), Pete's follow-up patch(comment 17), and the combined patch(comment 32). I
see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume
this is based on comment 31 and was required to resolve the SCSI errors.
Has anyone seen any additional SCSI errors when testing with the combined patch,
I haven't seen the error with any combination of patches. It has
something to do with the HD enclosure AMD was using.
Patch posted to RHML Apr 30, two ACKs (PeteZ, PeterM).
Brew build for testing:
Our QA's full USB test to x86_64 testing rpm package provided by Bhavana
is okay, which includes the combined patch in comment #32.
So can you guys add the patch in comment #32 to RHEL4.7 kernel ASAP?
The test result of i386 rpm package will come out tomorrow.
The i386 testing kernel with the combined patch in comment #32 can work well
too on our SB700 board. Thanks.
(In reply to comment #39)
> The review process is separate, regardless RHKL ACKs will be required first for
> this issue to be approved as an exception. Plus I will need a write-up on the
> combined patch - specifically the difference between the original patch(comment
> 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32). I
> see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume
> this is based on comment 31 and was required to resolve the SCSI errors.
Pete's patch adds the following:
Uncomments the function in QH_STATE_COMPLETING:
Updates QH_STATE_LINKED to periodically self-unlink when empty and call
Adds qh_refresh() and invokes refresh in QH_STATE_IDLE
Henry's changes to comment #6 + comment#17 are:
a call added to qh_make() to refresh QH via qh_refresh()
Code added to qh_link_async() to clear halt or toggle to recover from
silicon quirk, via a call to qh_refresh() if qh_state is QH_STATE_IDLE
Both these changes completed the back port of qh_refresh that Pete's
patch included. These changes are necessary to fix the SCSI errors
that were occuring during the block I/O requests on AMD hardware.
Complete QA tests at AMD have shown that the combined patch in #32 works
without any issues.
> Has anyone seen any additional SCSI errors when testing with the combined patch,
> (comment 32)?
AMD has not seen any errors and Pete said he has not either.
Peter, do you have enough information for you and Russ to ask for the
exception for this patch for RHEL4.7?
I do not find the combined patch in RHEL4.7 Beta(kernel 2.6.9-70),
Is there any guy who can tell me from which kernel version this patch
will be added?
Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
verified with Snapshot1, thanks.
Dell reports a pretty clear regression with this in bug 454479.
They often use high-speed HID devices coming from Avocent, so
EHCI is directly involved.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.