Bug 516851 - [Stratus 5.6 bug] System crashes at uhci_scan_schedule().
Summary: [Stratus 5.6 bug] System crashes at uhci_scan_schedule().
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: rc
: 5.6
Assignee: Pete Zaitcev
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 657319
TreeView+ depends on / blocked
 
Reported: 2009-08-11 16:48 UTC by Oonkwee Lim
Modified: 2011-01-13 20:51 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 20:51:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Test patch 1 (590 bytes, patch)
2009-08-19 07:02 UTC, Pete Zaitcev
no flags Details | Diff
Test patch 2 (665 bytes, patch)
2009-08-20 00:27 UTC, Pete Zaitcev
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Oonkwee Lim 2009-08-11 16:48:16 UTC
Description of problem:
System crashes at uhci_scan_schedule().

Version-Release number of selected component (if applicable):


How reproducible:
This only occur once during our stress and breaker tests.

Steps to Reproduce:
1.Run stress and breaker tests
2.
3.
  
Actual results:
System crashes after 36 hours.

Expected results:
System does not crash.

Additional info:
crash> bt
PID: 1616   TASK: ffff8106a22bb080  CPU: 5   COMMAND: "tcp_test"
 #0 [ffff81011de17bb0] crash_kexec at ffffffff800ad517
 #1 [ffff81011de17c70] __die at ffffffff80066127
 #2 [ffff81011de17cb0] do_page_fault at ffffffff80067da7
 #3 [ffff81011de17da0] error_exit at ffffffff8005ede9
    [exception RIP: uhci_scan_schedule+162]
    RIP: ffffffff88021950  RSP: ffff81011de17e58  RFLAGS: 00010003
    RAX: 00000000001000f8  RBX: 00000000001000f8  RCX: 000000000000000b
    RDX: 0000000000000000  RSI: ffff81047a4ebf58  RDI: ffff81083eb81d50
    RBP: ffff81083eb81d50   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff81083eb81d50
    R13: ffff81083eb81c00  R14: ffff81047a4ebf58  R15: ffff81047a4ebf58
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
 #4 [ffff81011de17e50] uhci_scan_schedule at ffffffff88021918
 #5 [ffff81011de17ed0] uhci_irq at ffffffff88023cb8
 #6 [ffff81011de17f10] usb_hcd_irq at ffffffff801f1c1f
 #7 [ffff81011de17f20] handle_IRQ_event at ffffffff8001123b
 #8 [ffff81011de17f50] __do_IRQ at ffffffff800ba749
 #9 [ffff81011de17f90] do_IRQ at ffffffff8006d986
--- <IRQ stack> ---
#10 [ffff81047a4ebf58] ret_from_intr at ffffffff8005e615
    RIP: 000000000804ac27  RSP: 00000000f60dd010  RFLAGS: 00000293
    RAX: 0000000000000216  RBX: 00000000f60dd018  RCX: 0000000000000000
    RDX: 00000000f4c16008  RSI: 0000000000000000  RDI: 00000000f60ddb90
    RBP: 00000000f60dd018   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: fffffffffffffff4  CS: 0023  SS: 002b

This is from the log:

ACPI: PCI interrupt for device 0000:7c:00.1 disabled
Trying to free nonexistent resource <00000000a8000000-00000000afffffff>
Trying to free nonexistent resource <00000000a4800000-00000000a480ffff>
uhci_hcd 0000:7e:1d.0: remove, state 1
usb usb2: USB disconnect, address 1
usb 2-1: USB disconnect, address 2
Unable to handle kernel paging request at 0000000000100100 RIP:
 [<ffffffff88021950>] :uhci_hcd:uhci_scan_schedule+0xa2/0x89c
PGD 68de7c067 PUD 66477a067 PMD 0
Oops: 0000 [1] SMP

This is because uhci_scan_schedule() is working through uhci->skelqh, which is already freed by uhci_stop(), called through
void usb_remove_hcd(struct usb_hcd *hcd)
{
………….
        hcd->driver->stop(hcd);
        hcd->state = HC_STATE_HALT;

        if (hcd->irq >= 0)
                free_irq(hcd->irq, hcd);
        usb_deregister_bus(&hcd->self);
        hcd_buffer_destroy(hcd);
}

Between uhci_stop() and free_irq(), a uhci_irq() occurs, resulting in the oops.

Comment 2 RHEL Program Management 2009-08-11 17:32:33 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 Pete Zaitcev 2009-08-18 23:01:05 UTC
I'm tempted to fix this in-place by having ->stop taking an appropriate
lock instead of pulling whole hog in from upstream. But I'm open to
any suggestions Stratus people may have about this. Pondering now.

Comment 4 Pete Zaitcev 2009-08-19 07:02:12 UTC
Created attachment 357895 [details]
Test patch 1

I'm not certain that this is a 100% reliable solution, but let's try it out.

Comment 5 Pete Zaitcev 2009-08-19 07:53:34 UTC
Please test the kernel 2.6.18-164.el5.bz516851.1 from this location:
 http://people.redhat.com/zaitcev/ftp/516851/

Comment 6 Oonkwee Lim 2009-08-19 16:11:43 UTC
I will check with our SQA.

Comment 7 Oonkwee Lim 2009-08-19 16:12:13 UTC
Does this change involve the uhci driver?

Comment 8 Oonkwee Lim 2009-08-19 16:21:36 UTC
ok I just notice your diffs - thanks

Comment 9 Oonkwee Lim 2009-08-19 16:26:07 UTC
ok I just notice your diffs - thanks

Comment 10 Oonkwee Lim 2009-08-19 16:28:12 UTC
Below is a comment by one of my colleague:

The interrupt handler for the device gets unregistered by free_irq(). Hcd->irq is just an int which holds the interrupt that was allocated. From what I can tell uhci_stop() isn’t touching anything, directly anyways, that is cleaned up by free_irq(). uhci_scan_schedule() does clear or set the Interrupt on Completion bit in the transfer descriptor of the uhci_hcd struct though, and I’m still figuring out what the effect of that would be.

The window appears to be between hcd->driver->stop() and hcd->state = HC_STATE_HALT . If HC_STATE_HALT is the state then usb_hcd_irq() does not call the irq op for the driver it just returns IRQ_NONE. Maybe the better option would be to set the state and then call the stop op.

Comment 11 Pete Zaitcev 2009-08-19 23:38:18 UTC
http://marc.info/?l=linux-usb&m=125071195604346&w=2
Alan Stern suggests that we wait out any pending interrupts after
the stopping the HC but before the freeing the schedules.

Comment 12 Pete Zaitcev 2009-08-20 00:27:06 UTC
Created attachment 358014 [details]
Test patch 2

Comment 13 Pete Zaitcev 2009-08-20 03:43:56 UTC
Please test the kernel 2.6.18-164.el5.bz516851.2 from the same location:
 http://people.redhat.com/zaitcev/ftp/516851/
If the 2.6.18-164.el5.bz516851.1 is already under test, don't stop the
run, it's valuable too. When it runs out, switch to the next one please.

Comment 14 Oonkwee Lim 2009-08-22 00:55:30 UTC
The tests complete without problem, with the first set of change.

Will start tests on the second set of change.

Comment 15 Oonkwee Lim 2009-08-27 21:26:44 UTC
The tests, with the second set of changes, pass without problem.

Comment 17 Pete Zaitcev 2010-04-22 05:02:08 UTC
I'm moving this back to RHEL 5.
We'll clone this for RHEL 6 if needed. As it is we may yet receive
this through -stable branch anyway.

Comment 18 Pete Zaitcev 2010-04-22 05:04:31 UTC
Upstream commit d23356da714595b888686d22cd19061323c09190
We're looking at issues in RHEL 6 for bug 579093. Most likely it's
unrelated, but I need some certainty before posting this.

Comment 21 RHEL Program Management 2010-07-07 05:08:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 24 Jarod Wilson 2010-07-19 21:14:01 UTC
in kernel-2.6.18-207.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 26 Andrius Benokraitis 2010-10-12 18:07:54 UTC
Stratus is about to begin testing this.

Comment 27 Robert N. Evans 2010-11-11 21:18:37 UTC
As noted in comment 15, Stratus has verified this fix.  Stratus could not provide a timely verification using RHEL 5.6 Beta; however, Stratus has verified that the kernel patch for this problem in RHEL 5.6 beta is the same patch that was verified on 2009-08-27.

Comment 28 Andrius Benokraitis 2010-11-11 23:16:23 UTC
(In reply to comment #27)
> As noted in comment 15, Stratus has verified this fix.  Stratus could not
> provide a timely verification using RHEL 5.6 Beta; however, Stratus has
> verified that the kernel patch for this problem in RHEL 5.6 beta is the same
> patch that was verified on 2009-08-27.

As a follow-up - Stratus stated that reproducing this issue is *very* difficult but when it does occur is serious. Setting to Verified.

Comment 33 errata-xmlrpc 2011-01-13 20:51:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.