Bug 435787
Summary: | RHEL4.7: USB stress test failure on AMD SBX00 | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bhavna Sarathy <bnagendr> | ||||||||||||||
Component: | kernel | Assignee: | Bhavna Sarathy <bnagendr> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | high | ||||||||||||||||
Version: | 4.7 | CC: | bhavna.sarathy, nmsasiku, peterm, rdoty, shane.huang, vgoyal, zaitcev | ||||||||||||||
Target Milestone: | rc | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | All | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | RHSA-2008-0665 | Doc Type: | Bug Fix | ||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2008-07-24 19:27:14 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Bug Depends On: | |||||||||||||||||
Bug Blocks: | 335381 | ||||||||||||||||
Attachments: |
|
Description
Bhavna Sarathy
2008-03-03 21:35:31 UTC
Russ, please add to the 4.7 tracker, for 5.2 and 4.7 bug fix parity. Shane, is this fix needed for R4.7 as well? Yes, but the backported patches may be different, I will check them first. Created attachment 297411 [details]
backported patch to fix it
The third patch in http://lkml.org/lkml/2008/2/19/546 has been replaced by another one: http://marc.info/?l=linux-usb&m=120469059715031&w=2 because the latter can make things better. Since kernel 2.6.9 is a little old and there is much difference between 2.6.9 and 2.6.25, there are many dependencies in porting the patches. Bhavana, Can you add some USB experts from RedHat into this bugzilla CC list? So that they may be able to help to review the backported patch(comment #4) in case of some potential regression issues. Created attachment 297591 [details]
backported patch to fix it (updated)
The third patch http://marc.info/?l=linux-usb&m=120469059715031&w=2 has been added to linus source tree too, which is: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux- 2.6.git;a=commit;h=e82cc1288fa57857c6af8c57f3d07096d4bcd9d9 And I also did some modification to the original backported patch, please use the patch in comment #6 instead. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Committed in 68.27.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ I was tinkering with my local testbed for bug 441552, and experienced a hang on -68.32.EL. I did rmmod usb_storage && modprobe usb_storage and the modprobe hung in =>command_abort=>hcd_unlink_urb. Clearly a lost interrupt. Unplugging the device clears. I applied an additional patch and modprobe started working. At this point I suspect that Shane's backport was not 100% correct for the 2.6.9 base which we have in RHEL 4. It looks like he busted the useful code in ehci_urb_dequeue(), and I missed that when I reviewed. I'm wondering now if the stress test AMD was running included command aborts. That's where all bugs always crop up. Maybe they just transferred a lot of data and called it a day. But the error processing is much trickier than just doing transfers. Created attachment 301746 [details]
Fix-ups which worked for me
There is much difference between 2.6.9 and up to date kernel like 2.6.25, including that unlink_async() is not implemented in 2.6.9. So it's possible that my backport in incomplete. That's why I asked RedHat USB experts to review my patch in comment #5. Our stress test focused at data transfer as you guessed, command abort case was not tested well. Any other EHCI patch is encouraged to RHEL4, Thanks BTW: I'm not able to visit BZ #441552, can you also put me in the CC list? There's noting interesting in 441552 for us, it's just a use-after-free in usb-storage. It probably has access bits set because a partner reported it through an escalation. I'm going to make a better fix-up patch tomorrow. OK, thanks. please provide your patch when it is ready, I can ask our QA to test it again on RHEL4. Created attachment 301913 [details]
Fix-up patch
Since we're following bleeding edge upstream here, I thought it prudent
to swipe the new update_async and otherwise get a bit closer to upstream.
I tested this patch on top of -68.32.EL with basic unit tests. Now we
need to build a kernel for AMD to test. Maybe Bhavana can help?
Reverted this patch in 68.34 build as this patch had created issues. Putting the bug back to assigned state. OK. I'm combining the patch in comment #6 and patch in comment #17 into a single one, then ask our QA to test the combined patch. I will submit the combined patch here after our QA's test. Thanks. Pete's patch in comment #17 PLUS my patch in comment #6 lead to error: SCSI error : <5 0 0 0> return code = 0x6000000 on kernel 2.6.9-68.34 or 2.6.9-67 when I mount external USB HDD partitions, while I do NOT find the similar error with my patch alone. Pete, would you please check your patch again? Did you find this error on your platform? Thanks Is there any update on the "SCSI error"? Thanks I have done my tests from the beginning, with various versions. The result is the same, modprobe ehci_hcd (or modprobe usb_storage) hangs with just the patch from comment #6 and works with both. So, no change on my end from what I reported previously. I tried different controllers (but not SB600): 05:04.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) I have access to an SB600: 00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI) But for some reason I'm unable to run RHEL 4 on that system (issues with either LVM2 or SATA not recognized), and the USB is onboard, so that's not tested. Lots of time wasted though. The error Shane reported in comment #23 did not appear. However, I'm quite ready to accept that my fix-up made it worse for his case. So Peter was wise not to bank on my patch. Shane, please attach an unmodified dmesg, I would like to see it. Created attachment 303125 [details]
dmesg under 2.6.9-67 plus two EHCI patches
The "SCSI error" appears after I plug in USB HDD for several seconds, it is not related with "mount". Please check it. Pete, Is there any update at your side? Can you duplicate the SCSI error? No change, it's not happening here. But actually it may be a good thing. In case the abort didn't work properly previously and started working now, the error handler would be able to print that message. I would not be too concerned IF the device continued to work (if sdb1 is mountable and accessible). So, will the EHCI patches in comment #6 and comment #17 be added to RHEL4.7 kernel? The SCSI errors come from the SCSI completion processing for block I/O requests, actually, the external hard disk is is mountable and accessible even these errors come up. the backport of qh_refresh is not complete,if not we still need to refresh the queue head when the state of QH is idle, I combine shane and Pete's patches into one, it should fixed all issues(modprobe hang and SCSI errors). Pete, please review and test this patch on your side, thank you! Created attachment 304091 [details]
all in one patch for the stress test failure
if you plug the USB hard disk,the SCSI errors come up a few minutes later, no errors with the USB flash disk. The patch in comment #32 works for me. Bhavana and Pete, Can you add this combined patch in #32 into RHEL4.7 kernel before its final release? And can you provide us one testing rpm package for our QA's test? Thanks Bhavana has to get an exception for it from the PM team. Bhavana, please report when the patch has its rhkernel acks. I most certainly can, stay tuned. But folks reviewing would want to know if the patch has an exception. Peter? The review process is separate, regardless RHKL ACKs will be required first for this issue to be approved as an exception. Plus I will need a write-up on the combined patch - specifically the difference between the original patch(comment 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32). I see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume this is based on comment 31 and was required to resolve the SCSI errors. Has anyone seen any additional SCSI errors when testing with the combined patch, (comment 32)? I haven't seen the error with any combination of patches. It has something to do with the HD enclosure AMD was using. Patch posted to RHML Apr 30, two ACKs (PeteZ, PeterM). Brew build for testing: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=1300222 Our QA's full USB test to x86_64 testing rpm package provided by Bhavana is okay, which includes the combined patch in comment #32. So can you guys add the patch in comment #32 to RHEL4.7 kernel ASAP? The test result of i386 rpm package will come out tomorrow. Thanks The i386 testing kernel with the combined patch in comment #32 can work well too on our SB700 board. Thanks. (In reply to comment #39) > The review process is separate, regardless RHKL ACKs will be required first for > this issue to be approved as an exception. Plus I will need a write-up on the > combined patch - specifically the difference between the original patch(comment > 6), Pete's follow-up patch(comment 17), and the combined patch(comment 32). I > see additional calls to qh_refresh() in ehci-q.c in the combined patch, I assume > this is based on comment 31 and was required to resolve the SCSI errors. > Pete's patch adds the following: Function unlink_async() Uncomments the function in QH_STATE_COMPLETING: Updates QH_STATE_LINKED to periodically self-unlink when empty and call unlink_async() Adds qh_refresh() and invokes refresh in QH_STATE_IDLE Henry's changes to comment #6 + comment#17 are: a call added to qh_make() to refresh QH via qh_refresh() Code added to qh_link_async() to clear halt or toggle to recover from silicon quirk, via a call to qh_refresh() if qh_state is QH_STATE_IDLE Both these changes completed the back port of qh_refresh that Pete's patch included. These changes are necessary to fix the SCSI errors that were occuring during the block I/O requests on AMD hardware. Complete QA tests at AMD have shown that the combined patch in #32 works without any issues. > Has anyone seen any additional SCSI errors when testing with the combined patch, > (comment 32)? AMD has not seen any errors and Pete said he has not either. Peter, do you have enough information for you and Russ to ask for the exception for this patch for RHEL4.7? Regards, Bhavana I do not find the combined patch in RHEL4.7 Beta(kernel 2.6.9-70), Is there any guy who can tell me from which kernel version this patch will be added? Thanks snapshot 1 Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ verified with Snapshot1, thanks. Dell reports a pretty clear regression with this in bug 454479. They often use high-speed HID devices coming from Avocent, so EHCI is directly involved. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html |