+++ This bug was initially created as a clone of Bug #435670 +++ Description of problem: There is one SB600/SB700 USB bug which will lead to usb stress test failure: http://bugzilla.kernel.org/show_bug.cgi?id=8692 The workaround contains three linux patches: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux- 2.6.git;a=commitdiff;h=07d29b63ef6b39963ab37818653284d861cf55af http://git.kernel.org/?p=linux/kernel/git/torvalds/linux- 2.6.git;a=commitdiff;h=f8fa7571a928d6d0e1b7444b0ea69ec7dc7db3b6 http://lkml.org/lkml/2008/2/19/546 I backported these three patches for RHEL5.2, please add it into the kernel. Thanks -- Additional comment from shane.huang on 2008-03-03 04:01 EST -- Created an attachment (id=296565) backported patch to fix USB stress test failure -- Additional comment from shane.huang on 2008-03-03 04:04 EST -- You may add this patch and build one kernel rpm package for us, I can ask our QA to test this rpm package. Thanks -- Additional comment from bnagendr on 2008-03-03 09:31 EST -- Russ, please add to the 5.2 master bug list. Peter has indicated that this patch will go into a snapshot build. Bhavana -- Additional comment from rdoty on 2008-03-03 10:19 EST -- Requesting inclusion in a Beta snapshot. -- Additional comment from bnagendr on 2008-03-03 15:16 EST -- Shane, thanks for attaching the backport. Brew build uploaded for AMD chipset QA team to test. http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1193626
Russ, please add to the 4.7 tracker, for 5.2 and 4.7 bug fix parity.
Shane, is this fix needed for R4.7 as well?
Yes, but the backported patches may be different, I will check them first.
Created attachment 297411 [details] backported patch to fix it
The third patch in http://lkml.org/lkml/2008/2/19/546 has been replaced by another one: http://marc.info/?l=linux-usb&m=120469059715031&w=2 because the latter can make things better. Since kernel 2.6.9 is a little old and there is much difference between 2.6.9 and 2.6.25, there are many dependencies in porting the patches. Bhavana, Can you add some USB experts from RedHat into this bugzilla CC list? So that they may be able to help to review the backported patch(comment #4) in case of some potential regression issues.
Created attachment 297591 [details] backported patch to fix it (updated)
The third patch http://marc.info/?l=linux-usb&m=120469059715031&w=2 has been added to linus source tree too, which is: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux- 2.6.git;a=commit;h=e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 And I also did some modification to the original backported patch, please use the patch in comment #6 instead.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
I was tinkering with my local testbed for bug 441552, and experienced a hang on -68.32.EL. I did rmmod usb_storage && modprobe usb_storage and the modprobe hung in =>command_abort=>hcd_unlink_urb. Clearly a lost interrupt. Unplugging the device clears. I applied an additional patch and modprobe started working. At this point I suspect that Shane's backport was not 100% correct for the 2.6.9 base which we have in RHEL 4. It looks like he busted the useful code in ehci_urb_dequeue(), and I missed that when I reviewed. I'm wondering now if the stress test AMD was running included command aborts. That's where all bugs always crop up. Maybe they just transferred a lot of data and called it a day. But the error processing is much trickier than just doing transfers.
Created attachment 301746 [details] Fix-ups which worked for me
There is much difference between 2.6.9 and up to date kernel like 2.6.25, including that unlink_async() is not implemented in 2.6.9. So it's possible that my backport in incomplete. That's why I asked RedHat USB experts to review my patch in comment #5. Our stress test focused at data transfer as you guessed, command abort case was not tested well. Any other EHCI patch is encouraged to RHEL4, Thanks BTW: I'm not able to visit BZ #441552, can you also put me in the CC list?
There's noting interesting in 441552 for us, it's just a use-after-free in usb-storage. It probably has access bits set because a partner reported it through an escalation. I'm going to make a better fix-up patch tomorrow.
OK, thanks. please provide your patch when it is ready, I can ask our QA to test it again on RHEL4.
Created attachment 301913 [details] Fix-up patch Since we're following bleeding edge upstream here, I thought it prudent to swipe the new update_async and otherwise get a bit closer to upstream. I tested this patch on top of -68.32.EL with basic unit tests. Now we need to build a kernel for AMD to test. Maybe Bhavana can help?
Reverted this patch in 68.34 build as this patch had created issues. Putting the bug back to assigned state.
OK. I'm combining the patch in comment #6 and patch in comment #17 into a single one, then ask our QA to test the combined patch. I will submit the combined patch here after our QA's test. Thanks.
Pete's patch in comment #17 PLUS my patch in comment #6 lead to error: SCSI error : <5 0 0 0> return code = 0x6000000 on kernel 2.6.9-68.34 or 2.6.9-67 when I mount external USB HDD partitions, while I do NOT find the similar error with my patch alone. Pete, would you please check your patch again? Did you find this error on your platform? Thanks
Is there any update on the "SCSI error"? Thanks
I have done my tests from the beginning, with various versions. The result is the same, modprobe ehci_hcd (or modprobe usb_storage) hangs with just the patch from comment #6 and works with both. So, no change on my end from what I reported previously. I tried different controllers (but not SB600): 05:04.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) I have access to an SB600: 00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI) But for some reason I'm unable to run RHEL 4 on that system (issues with either LVM2 or SATA not recognized), and the USB is onboard, so that's not tested. Lots of time wasted though. The error Shane reported in comment #23 did not appear. However, I'm quite ready to accept that my fix-up made it worse for his case. So Peter was wise not to bank on my patch. Shane, please attach an unmodified dmesg, I would like to see it.
Created attachment 303125 [details] dmesg under 2.6.9-67 plus two EHCI patches
The "SCSI error" appears after I plug in USB HDD for several seconds, it is not related with "mount". Please check it.
Pete, Is there any update at your side? Can you duplicate the SCSI error?
No change, it's not happening here. But actually it may be a good thing. In case the abort didn't work properly previously and started working now, the error handler would be able to print that message. I would not be too concerned IF the device continued to work (if sdb1 is mountable and accessible).
So, will the EHCI patches in comment #6 and comment #17 be added to RHEL4.7 kernel?
The SCSI errors come from the SCSI completion processing for block I/O requests, actually, the external hard disk is is mountable and accessible even these errors come up. the backport of qh_refresh is not complete,if not we still need to refresh the queue head when the state of QH is idle, I combine shane and Pete's patches into one, it should fixed all issues(modprobe hang and SCSI errors). Pete, please review and test this patch on your side, thank you!
Created attachment 304091 [details] all in one patch for the stress test failure
if you plug the USB hard disk,the SCSI errors come up a few minutes later, no errors with the USB flash disk.
The patch in comment #32 works for me.
Bhavana and Pete, Can you add this combined patch in #32 into RHEL4.7 kernel before its final release? And can you provide us one testing rpm package for our QA's test? Thanks
Bhavana has to get an exception for it from the PM team.
Bhavana, please report when the patch has its rhkernel acks.
I most certainly can, stay tuned. But folks reviewing would want to know if the patch has an exception. Peter?
The review process is separate, regardless RHKL ACKs will be required first for this issue to be approved as an exception. Plus I will need a write-up on the combined patch - specifically the difference between the original patch(comment 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32). I see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume this is based on comment 31 and was required to resolve the SCSI errors. Has anyone seen any additional SCSI errors when testing with the combined patch, (comment 32)?
I haven't seen the error with any combination of patches. It has something to do with the HD enclosure AMD was using.
Patch posted to RHML Apr 30, two ACKs (PeteZ, PeterM). Brew build for testing: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1300222
Our QA's full USB test to x86_64 testing rpm package provided by Bhavana is okay, which includes the combined patch in comment #32. So can you guys add the patch in comment #32 to RHEL4.7 kernel ASAP? The test result of i386 rpm package will come out tomorrow. Thanks
The i386 testing kernel with the combined patch in comment #32 can work well too on our SB700 board. Thanks.
(In reply to comment #39) > The review process is separate, regardless RHKL ACKs will be required first for > this issue to be approved as an exception. Plus I will need a write-up on the > combined patch - specifically the difference between the original patch(comment > 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32). I > see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume > this is based on comment 31 and was required to resolve the SCSI errors. > Pete's patch adds the following: Function unlink_async() Uncomments the function in QH_STATE_COMPLETING: Updates QH_STATE_LINKED to periodically self-unlink when empty and call unlink_async() Adds qh_refresh() and invokes refresh in QH_STATE_IDLE Henry's changes to comment #6 + comment#17 are: a call added to qh_make() to refresh QH via qh_refresh() Code added to qh_link_async() to clear halt or toggle to recover from silicon quirk, via a call to qh_refresh() if qh_state is QH_STATE_IDLE Both these changes completed the back port of qh_refresh that Pete's patch included. These changes are necessary to fix the SCSI errors that were occuring during the block I/O requests on AMD hardware. Complete QA tests at AMD have shown that the combined patch in #32 works without any issues. > Has anyone seen any additional SCSI errors when testing with the combined patch, > (comment 32)? AMD has not seen any errors and Pete said he has not either. Peter, do you have enough information for you and Russ to ask for the exception for this patch for RHEL4.7? Regards, Bhavana
I do not find the combined patch in RHEL4.7 Beta(kernel 2.6.9-70), Is there any guy who can tell me from which kernel version this patch will be added? Thanks
snapshot 1
Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
verified with Snapshot1, thanks.
Dell reports a pretty clear regression with this in bug 454479. They often use high-speed HID devices coming from Avocent, so EHCI is directly involved.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html