Description of problem: System crashes at uhci_scan_schedule(). Version-Release number of selected component (if applicable): How reproducible: This only occur once during our stress and breaker tests. Steps to Reproduce: 1.Run stress and breaker tests 2. 3. Actual results: System crashes after 36 hours. Expected results: System does not crash. Additional info: crash> bt PID: 1616 TASK: ffff8106a22bb080 CPU: 5 COMMAND: "tcp_test" #0 [ffff81011de17bb0] crash_kexec at ffffffff800ad517 #1 [ffff81011de17c70] __die at ffffffff80066127 #2 [ffff81011de17cb0] do_page_fault at ffffffff80067da7 #3 [ffff81011de17da0] error_exit at ffffffff8005ede9 [exception RIP: uhci_scan_schedule+162] RIP: ffffffff88021950 RSP: ffff81011de17e58 RFLAGS: 00010003 RAX: 00000000001000f8 RBX: 00000000001000f8 RCX: 000000000000000b RDX: 0000000000000000 RSI: ffff81047a4ebf58 RDI: ffff81083eb81d50 RBP: ffff81083eb81d50 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81083eb81d50 R13: ffff81083eb81c00 R14: ffff81047a4ebf58 R15: ffff81047a4ebf58 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #4 [ffff81011de17e50] uhci_scan_schedule at ffffffff88021918 #5 [ffff81011de17ed0] uhci_irq at ffffffff88023cb8 #6 [ffff81011de17f10] usb_hcd_irq at ffffffff801f1c1f #7 [ffff81011de17f20] handle_IRQ_event at ffffffff8001123b #8 [ffff81011de17f50] __do_IRQ at ffffffff800ba749 #9 [ffff81011de17f90] do_IRQ at ffffffff8006d986 --- <IRQ stack> --- #10 [ffff81047a4ebf58] ret_from_intr at ffffffff8005e615 RIP: 000000000804ac27 RSP: 00000000f60dd010 RFLAGS: 00000293 RAX: 0000000000000216 RBX: 00000000f60dd018 RCX: 0000000000000000 RDX: 00000000f4c16008 RSI: 0000000000000000 RDI: 00000000f60ddb90 RBP: 00000000f60dd018 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: fffffffffffffff4 CS: 0023 SS: 002b This is from the log: ACPI: PCI interrupt for device 0000:7c:00.1 disabled Trying to free nonexistent resource <00000000a8000000-00000000afffffff> Trying to free nonexistent resource <00000000a4800000-00000000a480ffff> uhci_hcd 0000:7e:1d.0: remove, state 1 usb usb2: USB disconnect, address 1 usb 2-1: USB disconnect, address 2 Unable to handle kernel paging request at 0000000000100100 RIP: [<ffffffff88021950>] :uhci_hcd:uhci_scan_schedule+0xa2/0x89c PGD 68de7c067 PUD 66477a067 PMD 0 Oops: 0000 [1] SMP This is because uhci_scan_schedule() is working through uhci->skelqh, which is already freed by uhci_stop(), called through void usb_remove_hcd(struct usb_hcd *hcd) { …………. hcd->driver->stop(hcd); hcd->state = HC_STATE_HALT; if (hcd->irq >= 0) free_irq(hcd->irq, hcd); usb_deregister_bus(&hcd->self); hcd_buffer_destroy(hcd); } Between uhci_stop() and free_irq(), a uhci_irq() occurs, resulting in the oops.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion.
I'm tempted to fix this in-place by having ->stop taking an appropriate lock instead of pulling whole hog in from upstream. But I'm open to any suggestions Stratus people may have about this. Pondering now.
Created attachment 357895 [details] Test patch 1 I'm not certain that this is a 100% reliable solution, but let's try it out.
Please test the kernel 2.6.18-164.el5.bz516851.1 from this location: http://people.redhat.com/zaitcev/ftp/516851/
I will check with our SQA.
Does this change involve the uhci driver?
ok I just notice your diffs - thanks
Below is a comment by one of my colleague: The interrupt handler for the device gets unregistered by free_irq(). Hcd->irq is just an int which holds the interrupt that was allocated. From what I can tell uhci_stop() isn’t touching anything, directly anyways, that is cleaned up by free_irq(). uhci_scan_schedule() does clear or set the Interrupt on Completion bit in the transfer descriptor of the uhci_hcd struct though, and I’m still figuring out what the effect of that would be. The window appears to be between hcd->driver->stop() and hcd->state = HC_STATE_HALT . If HC_STATE_HALT is the state then usb_hcd_irq() does not call the irq op for the driver it just returns IRQ_NONE. Maybe the better option would be to set the state and then call the stop op.
http://marc.info/?l=linux-usb&m=125071195604346&w=2 Alan Stern suggests that we wait out any pending interrupts after the stopping the HC but before the freeing the schedules.
Created attachment 358014 [details] Test patch 2
Please test the kernel 2.6.18-164.el5.bz516851.2 from the same location: http://people.redhat.com/zaitcev/ftp/516851/ If the 2.6.18-164.el5.bz516851.1 is already under test, don't stop the run, it's valuable too. When it runs out, switch to the next one please.
The tests complete without problem, with the first set of change. Will start tests on the second set of change.
The tests, with the second set of changes, pass without problem.
I'm moving this back to RHEL 5. We'll clone this for RHEL 6 if needed. As it is we may yet receive this through -stable branch anyway.
Upstream commit d23356da714595b888686d22cd19061323c09190 We're looking at issues in RHEL 6 for bug 579093. Most likely it's unrelated, but I need some certainty before posting this.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-207.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Stratus is about to begin testing this.
As noted in comment 15, Stratus has verified this fix. Stratus could not provide a timely verification using RHEL 5.6 Beta; however, Stratus has verified that the kernel patch for this problem in RHEL 5.6 beta is the same patch that was verified on 2009-08-27.
(In reply to comment #27) > As noted in comment 15, Stratus has verified this fix. Stratus could not > provide a timely verification using RHEL 5.6 Beta; however, Stratus has > verified that the kernel patch for this problem in RHEL 5.6 beta is the same > patch that was verified on 2009-08-27. As a follow-up - Stratus stated that reproducing this issue is *very* difficult but when it does occur is serious. Setting to Verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html