Bug 171220 - USB: khubd deadlock on error path
Summary: USB: khubd deadlock on error path
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Pete Zaitcev
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168429
TreeView+ depends on / blocked
 
Reported: 2005-10-19 15:37 UTC by Kimball Murray
Modified: 2007-11-30 22:07 UTC (History)
1 user (show)

Fixed In Version: RHSA-2006-0132
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-07 20:31:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch for kernel 2.6.9-22 (1.99 KB, patch)
2005-10-25 19:55 UTC, Kimball Murray
no flags Details | Diff
Candidate #2 - Same as before, only changed comments (2.06 KB, patch)
2005-10-26 22:52 UTC, Pete Zaitcev
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:808 0 normal SHIPPED_LIVE Important: kernel security update 2005-10-27 04:00:00 UTC
Red Hat Product Errata RHSA-2006:0132 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3 2006-03-09 16:31:00 UTC

Description Kimball Murray 2005-10-19 15:37:08 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225

Description of problem:
Stratus can heavily exercise hotplug functionality in both USB and PCI space.  While doing this, we hit a deadlock in khubd and were able to take a dump of memory.  Analyzing the dump with crash showed us that khubd had wedged itself via hub_events(...) in which locktree is called, thereby taking a serialize semaphore, and via a 2-part error path, usb_disconnect is later called, which tries to take the same semaphore.  A visual code inspection can uncover this as well.  Here is the sequence:

starting with hub_events, called from khubd...

                /* Lock the device, then check to see if we were
                 * disconnected while waiting for the lock to succeed. */
                if (locktree(hdev) < 0)        <- if successful, hdev->semaphore is taken.
                        break;
                if (hdev->state != USB_STATE_CONFIGURED ||
                                !hdev->actconfig ||
                                hub != usb_get_intfdata(
                                        hdev->actconfig->interface[0]))
                        goto loop;

                if (hub->error) {        <- error path condition #1
                        dev_dbg (hub_dev, "resetting for error %d\n",
                                hub->error);

                        if (hub_reset(hub)) {    <- error path condition #2
                                dev_dbg (hub_dev,
                                        "can't reset; disconnecting\n");
                                hub_start_disconnect(hdev);

And here is hub_start_disconnect:

/* caller has locked the hub */
/* FIXME!  This routine should be subsumed into hub_reset */
static void hub_start_disconnect(struct usb_device *hdev)
{
        struct usb_device *parent = hdev->parent;
        int i;

        /* Find the device pointer to disconnect */
        if (parent) {
                for (i = 0; i < parent->maxchild; i++) {
                        if (parent->children[i] == hdev) {
                                usb_disconnect(&parent->children[i]);
                                return;
                        }
                }
        }

        dev_err(&hdev->dev, "cannot disconnect hub!\n");
}

Even the comment above hub_start_disconnect points out that the device is locked. But usb_disconnect is going to take the same lock:

void usb_disconnect(struct usb_device **pdev)
{
        struct usb_device       *udev = *pdev;
        int                     i;

        if (!udev) {
                pr_debug ("%s nodev\n", __FUNCTION__);
                return;
        }

        /* mark the device as inactive, so any further urb submissions for
         * this device (and any of its children) will fail immediately.
         * this quiesces everyting except pending urbs.
         */
        usb_set_device_state(udev, USB_STATE_NOTATTACHED);

        /* lock the bus list on behalf of HCDs unregistering their root hubs */
        if (!udev->parent)
                down(&usb_bus_list_lock);
        down(&udev->serialize);    <- and here we are deadlocked because locktree did this already.


Version-Release number of selected component (if applicable):
kernel-2.6.9-22 (and earlier)

How reproducible:
Sometimes

Steps to Reproduce:
To hit this deadlock, one must first have a hub plugged into the host controller.   It shouldn't matter what is plugged into the hub itself.

It's not easy to take the error path described above. In our hardware, we have to use a hotplug disconnect on the PCI slot that hosts the ohci root hub at just the right time.  But when we do take the error path, khubd deadlocks itself.
  

Actual Results:  deadlock

Expected Results:  no deadlock

Additional info:

Comment 1 Kimball Murray 2005-10-25 19:55:57 UTC
Created attachment 120381 [details]
Patch for kernel 2.6.9-22

Comment 2 Pete Zaitcev 2005-10-25 22:24:08 UTC
Test kernel is available at:
 ftp://people.redhat.com/zaitcev/171220/


Comment 3 Pete Zaitcev 2005-10-26 22:52:05 UTC
Created attachment 120436 [details]
Candidate #2 - Same as before, only changed comments

Comment 4 Kimball Murray 2005-10-27 15:10:06 UTC
The test wizards at Stratus have been able to reproduce this bug on a standard
PC running the RHEL4-U2 GA distro.  They used an older HP Vectra PC and
connected two USB hubs in series to it.  The hubs had external power supplies,
and could also run from the PC power through the USB port.  A USB storage stick
was attached to the first hub, and a keyboard and mouse were connected behind
the second hub (the mouse was actually connected to the keyboard, so in effect,
it was actually behind a 3rd hub).  At some point, while data was being dumped
to the memory stick using 'dd', the power supplies were removed so that the hubs
ran off the PC power.  Everything was still working at this point. Then the
keyboard was pulled from the second hub, and plugged back in.  

The connection event never registered, because khubd was stuck in
usb_disconnect, deadlocked on a semaphore as described in this bug.  The test
crew tried rmmod uhci-hcd, but rmmod got wedged on the same semaphore (again in
usb_disconnect).  Of course, we expect rmmod to hang if khubd is stuck in
usb_disconnect.  The machine had to be rebooted to clear this condition.

It took about 30 or so tries of various plug/unplug operations before hitting
this deadlock.  It is likely possible to do this with different operations, but
since it's a little racy, it's hard to pinpoint an exact formula for causing the
bug.

Here are the back traces from crash for both khubd, and the rmmod processes that
were stuck:

crash> bt 6

PID: 6      TASK: c13c9790  CPU: 0   COMMAND: "khubd"

 #0 [cfe8fef4] schedule at c030d814

 #1 [cfe8ff70] .text.lock.hub (via usb_disconnect) at c0287ebf

 #2 [cfe8ff84] hub_events at c028765b

 #3 [cfe8fff0] kernel_thread_helper at c01041d7

crash> bt 28436
 
PID: 28436  TASK: c0c77830  CPU: 0   COMMAND: "rmmod"

 #0 [cf30be70] schedule at c030d814

 #1 [cf30beec] .text.lock.hub (via usb_disconnect) at c0287ebf

 #2 [cf30bf00] usb_disconnect at c0285f61

 #3 [cf30bf14] usb_hcd_pci_remove at c028d4d8

 #4 [cf30bf24] pci_device_remove at c01ec1d9

 #5 [cf30bf2c] device_release_driver at c024a399

 #6 [cf30bf38] driver_detach at c024a3b8

 #7 [cf30bf44] bus_remove_driver at c024a748

 #8 [cf30bf54] driver_unregister at c024ab0e

 #9 [cf30bf60] pci_unregister_driver at c01ec393

#10 [cf30bf68] cleanup_module at d09ec588

#11 [cf30bf6c] sys_delete_module at c013b6ff

#12 [cf30bfc0] system_call at c030f918

    EAX: 00000081  EBX: bff7a840  ECX: 00000880  EDX: bff7a8a0

    DS:  007b      ESI: bff7a840  ES:  007b      EDI: 00000000

    SS:  007b      ESP: bff7a814  EBP: bff7d0f8

    CS:  0073      EIP: 001847a2  ERR: 00000081  EFLAGS: 00000246

Comment 14 Red Hat Bugzilla 2006-03-07 20:31:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html



Note You need to log in before you can comment on or make changes to this bug.