From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: Stratus can heavily exercise hotplug functionality in both USB and PCI space. While doing this, we hit a deadlock in khubd and were able to take a dump of memory. Analyzing the dump with crash showed us that khubd had wedged itself via hub_events(...) in which locktree is called, thereby taking a serialize semaphore, and via a 2-part error path, usb_disconnect is later called, which tries to take the same semaphore. A visual code inspection can uncover this as well. Here is the sequence: starting with hub_events, called from khubd... /* Lock the device, then check to see if we were * disconnected while waiting for the lock to succeed. */ if (locktree(hdev) < 0) <- if successful, hdev->semaphore is taken. break; if (hdev->state != USB_STATE_CONFIGURED || !hdev->actconfig || hub != usb_get_intfdata( hdev->actconfig->interface[0])) goto loop; if (hub->error) { <- error path condition #1 dev_dbg (hub_dev, "resetting for error %d\n", hub->error); if (hub_reset(hub)) { <- error path condition #2 dev_dbg (hub_dev, "can't reset; disconnecting\n"); hub_start_disconnect(hdev); And here is hub_start_disconnect: /* caller has locked the hub */ /* FIXME! This routine should be subsumed into hub_reset */ static void hub_start_disconnect(struct usb_device *hdev) { struct usb_device *parent = hdev->parent; int i; /* Find the device pointer to disconnect */ if (parent) { for (i = 0; i < parent->maxchild; i++) { if (parent->children[i] == hdev) { usb_disconnect(&parent->children[i]); return; } } } dev_err(&hdev->dev, "cannot disconnect hub!\n"); } Even the comment above hub_start_disconnect points out that the device is locked. But usb_disconnect is going to take the same lock: void usb_disconnect(struct usb_device **pdev) { struct usb_device *udev = *pdev; int i; if (!udev) { pr_debug ("%s nodev\n", __FUNCTION__); return; } /* mark the device as inactive, so any further urb submissions for * this device (and any of its children) will fail immediately. * this quiesces everyting except pending urbs. */ usb_set_device_state(udev, USB_STATE_NOTATTACHED); /* lock the bus list on behalf of HCDs unregistering their root hubs */ if (!udev->parent) down(&usb_bus_list_lock); down(&udev->serialize); <- and here we are deadlocked because locktree did this already. Version-Release number of selected component (if applicable): kernel-2.6.9-22 (and earlier) How reproducible: Sometimes Steps to Reproduce: To hit this deadlock, one must first have a hub plugged into the host controller. It shouldn't matter what is plugged into the hub itself. It's not easy to take the error path described above. In our hardware, we have to use a hotplug disconnect on the PCI slot that hosts the ohci root hub at just the right time. But when we do take the error path, khubd deadlocks itself. Actual Results: deadlock Expected Results: no deadlock Additional info:
Created attachment 120381 [details] Patch for kernel 2.6.9-22
Test kernel is available at: ftp://people.redhat.com/zaitcev/171220/
Created attachment 120436 [details] Candidate #2 - Same as before, only changed comments
The test wizards at Stratus have been able to reproduce this bug on a standard PC running the RHEL4-U2 GA distro. They used an older HP Vectra PC and connected two USB hubs in series to it. The hubs had external power supplies, and could also run from the PC power through the USB port. A USB storage stick was attached to the first hub, and a keyboard and mouse were connected behind the second hub (the mouse was actually connected to the keyboard, so in effect, it was actually behind a 3rd hub). At some point, while data was being dumped to the memory stick using 'dd', the power supplies were removed so that the hubs ran off the PC power. Everything was still working at this point. Then the keyboard was pulled from the second hub, and plugged back in. The connection event never registered, because khubd was stuck in usb_disconnect, deadlocked on a semaphore as described in this bug. The test crew tried rmmod uhci-hcd, but rmmod got wedged on the same semaphore (again in usb_disconnect). Of course, we expect rmmod to hang if khubd is stuck in usb_disconnect. The machine had to be rebooted to clear this condition. It took about 30 or so tries of various plug/unplug operations before hitting this deadlock. It is likely possible to do this with different operations, but since it's a little racy, it's hard to pinpoint an exact formula for causing the bug. Here are the back traces from crash for both khubd, and the rmmod processes that were stuck: crash> bt 6 PID: 6 TASK: c13c9790 CPU: 0 COMMAND: "khubd" #0 [cfe8fef4] schedule at c030d814 #1 [cfe8ff70] .text.lock.hub (via usb_disconnect) at c0287ebf #2 [cfe8ff84] hub_events at c028765b #3 [cfe8fff0] kernel_thread_helper at c01041d7 crash> bt 28436 PID: 28436 TASK: c0c77830 CPU: 0 COMMAND: "rmmod" #0 [cf30be70] schedule at c030d814 #1 [cf30beec] .text.lock.hub (via usb_disconnect) at c0287ebf #2 [cf30bf00] usb_disconnect at c0285f61 #3 [cf30bf14] usb_hcd_pci_remove at c028d4d8 #4 [cf30bf24] pci_device_remove at c01ec1d9 #5 [cf30bf2c] device_release_driver at c024a399 #6 [cf30bf38] driver_detach at c024a3b8 #7 [cf30bf44] bus_remove_driver at c024a748 #8 [cf30bf54] driver_unregister at c024ab0e #9 [cf30bf60] pci_unregister_driver at c01ec393 #10 [cf30bf68] cleanup_module at d09ec588 #11 [cf30bf6c] sys_delete_module at c013b6ff #12 [cf30bfc0] system_call at c030f918 EAX: 00000081 EBX: bff7a840 ECX: 00000880 EDX: bff7a8a0 DS: 007b ESI: bff7a840 ES: 007b EDI: 00000000 SS: 007b ESP: bff7a814 EBP: bff7d0f8 CS: 0073 EIP: 001847a2 ERR: 00000081 EFLAGS: 00000246
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html