Description of problem: When loading/unloading ohci-hcd in a loop, the following messages were printed on the serial console after the while, ... usb 3-1: device not accepting address 2, error -108 hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: couldn't allocate port 1 usb_device hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: hub_port_status failed (err = -19) hub 3-0:1.0: connect-debounce failed, port 1 disabled hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: hub_port_status failed (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: Cannot enable port 1. Maybe the USB cable is bad? hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: couldn't allocate port 1 usb_device hub 3-0:1.0: cannot disable port 1 (err = -19) usb 3-1: control timeout on ep0out As the result, ohci-hcd was not able to remove anymore, and the system could not reboot. # rmmod ohci-hcd ERROR: Removing 'ohci_hcd': Device or resource busy # rmmod -f -v ohci-hcd rmmod ohci_hcd, wait=no force ERROR: Removing 'ohci_hcd': Device or resource busy # lsmod Module Size Used by ohci_hcd 24657 0 parport_pc 29569 0 lp 15345 0 parport 44621 2 parport_pc,lp netconsole 8541 0 netdump 15552 0 md5 5953 1 ipv6 285729 14 autofs4 28105 0 sunrpc 177337 1 ds 21961 0 yenta_socket 23105 0 pcmcia_core 70097 2 ds,yenta_socket powernow_k8 14881 0 cpufreq_powersave 3265 0 loop 18769 0 joydev 12225 0 button 9313 0 battery 11465 0 ac 6985 0 ehci_hcd 35273 0 k8_edac 19561 0 edac_mc 34305 1 k8_edac snd_hda_intel 325417 0 snd_pcm_oss 60025 0 snd_mixer_oss 21825 1 snd_pcm_oss snd_pcm 110537 2 snd_hda_intel,snd_pcm_oss snd_timer 29129 1 snd_pcm snd 64265 5 snd_hda_intel,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer soundcore 12769 1 snd snd_page_alloc 14041 2 snd_hda_intel,snd_pcm tg3 120133 0 floppy 66065 0 sr_mod 20965 0 dm_snapshot 19201 0 dm_zero 3649 0 dm_mirror 33609 0 ext3 138961 2 jbd 69489 1 ext3 dm_mod 76969 8 dm_snapshot,dm_zero,dm_mirror ahci 28101 2 libata 124713 1 ahci sd_mod 19393 3 scsi_mod 145425 4 sr_mod,ahci,libata,sd_mod # reboot INIT: Sending pStopping HAL daemon: [FAILED] Stopping system message bus: [ OK ] Stopping atd: [ OK ] Stopping cups: [ OK ] Shutting down xfs: [ OK ] Shutting down console mouse services: [ OK ] Stopping sshd:[ OK ] Shutting down sm-client: [ OK ] Shutting down sendmail: [ OK ] Shutting down smartd: [ OK ] disabling netconsole [ OK ] disabling netdump [ OK ] Stopping xinetd: [ OK ] Stopping acpi daemon: [ OK ] Stopping crond: [ OK ] Stopping autofs: [ OK ] Shutting down ntpd: [ OK ] Stopping NFS statd: [ OK ] Stopping irqbalance: [ OK ] Stopping portmap: [ OK ] Shutting down kernel logger: [ OK ] Shutting down system logger: [ OK ] Shutting down interface eth0: [ OK ] Shutting down loopback interface: [ OK ] Stopping pcmcia: <hung here ...> I'll attach some SysRq logs captured during the hang. Version-Release number of selected component (if applicable): kernel-2.6.9-78.0.8.EL How reproducible: always Steps to Reproduce: 1. reserve hp-xw4550-01.rhts.bos.redhat.com 2. while :; do rmmod ohci-hcd; modprobe ohci-hcd; done 3. monitor the serial console and wait for the following message coming up. usb 3-1: control timeout on ep0out 4. rmmod ohci-hcd 5. reboot Actual results: In step 3, the module cannot be removed due to "ERROR: Removing 'ohci_hcd': Device or resource busy". In step 4, the machine cannot reboot. Expected results: The module should be able to remove; the machine should be able to reboot.
By the way, it is usually reproducible within 30 minutes.
Created attachment 323558 [details] some SysRq logs captured during the hang.
The following from the above looks interesting to me. rmmod D 0000000000503010 0 3904 6952 (NOTLB) 000001001fc87d38 0000000000000002 000001001fc87cc8 ffffffff00000073 0000000000000001 000000008013475a 0000010001196b00 00000000803f68a0 00000100361687f0 0000000000005eb5 Call Trace:<ffffffff801347b0>{__wake_up+54} <ffffffff803178b7>{__down+147} <ffffffff80134709>{default_wake_function+0} <ffffffff803193d5>{__down_failed+53} <ffffffff802947a9>{.text.lock.hub+55} <ffffffff802992f5>{usb_hcd_pci_remove+137} <ffffffff801f8ff0>{pci_device_remove+34} <ffffffff8024f8e8>{device_release_driver+83} <ffffffff8024fc96>{bus_remove_driver+153} <ffffffff8024ffbc>{driver_unregister+17} <ffffffff801f8e1f>{pci_unregister_driver+15} <ffffffff8014f5df>{sys_delete_module+487} <ffffffff801ef400>{__down_read_trylock+41} <ffffffff8016f570>{sys_munmap+94} <ffffffff801102f6>{system_call+126}
The symptom is different from the oops in bug 455843, but while working on that one I found that since there is a whole bunch of hanging pointers, rmmod should not be tried without it fixed. Results may be quite wild, and necesserily just the oops. Cai, plase try the kernel from this URL: http://people.redhat.com/zaitcev/ftp/455843/ There's also a hang specifically in rmmod ohci_hcd (does not happen with other HCDs). It occurs when unlinks cross with rmmod: rmmod changes the controller's state and so unlinks cannot return any callbacks, so something hangs. The following may be helpful: --- linux-2.6.9-78.12.EL/drivers/usb/host/ohci-q.c 2004-10-18 14:55:24.000000000 -0700 +++ linux-2.6.9-78.12.EL-u1/drivers/usb/host/ohci-q.c 2008-11-20 19:26:24.000000000 -0800 @@ -170,9 +170,6 @@ static int ed_schedule (struct ohci_hcd { int branch; - if (ohci->hcd.state == USB_STATE_QUIESCING) - return -EAGAIN; - ed->state = ED_OPER; ed->ed_prev = NULL; ed->ed_next = NULL; But we don't know if this is what this bug is. Let's reproduce the problem on the test kernel for bug 455843, then work from there.
The new Kernel gave an oops. hub 3-0:1.0: connect-debounce failed, port 1 disabled hub 3-0:1.0: cannot disable port 1 (err = -19) bus 3: replacing with dummies bus 4: replacing with dummies bus 5: replacing with dummies bus 6: replacing with dummies bus 2: replacing with dummies hub 3-0:1.0: hub_port_status failed (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: Cannot enable port 1. Maybe the USB cable is bad? hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: couldn't allocate port 1 usb_device hub 3-0:1.0: cannot disable port 1 (err = -19) bus 3: replacing with dummies bus 4: replacing with dummies bus 5: replacing with dummies bus 6: replacing with dummies bus 2: replacing with dummies hub 3-0:1.0: hub_port_status failed (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: Cannot enable port 1. Maybe the USB cable is bad? hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: couldn't allocate port 1 usb_device hub 3-0:1.0: cannot disable port 1 (err = -19) bus 3: replacing with dummies bus 4: replacing with dummies bus 5: replacing with dummies bus 6: replacing with dummies bus 2: replacing with dummies Unable to handle kernel NULL pointer dereference at 00000000000001b8 RIP: <ffffffffa02b8568>{:ohci_hcd:ohci_hub_resume+23} PML4 2fa21067 PGD 30378067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: ohci_hcd parport_pc lp parport md5 ipv6 autofs4 sunrpc ds yenta_socket pcmcia_core powernow_k8 cpufreq_powersave loop joydev button battery ac ehci_hcd k8_edac edac_mc snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore tg3 floppy sr_mod dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ahci libata sd_mod scsi_mod Pid: 7, comm: events/1 Not tainted 2.6.9-78.18.EL.bz455843.1smp RIP: 0010:[<ffffffffa02b8568>] <ffffffffa02b8568>{:ohci_hcd:ohci_hub_resume+23} RSP: 0018:0000010037b8de28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000010034a95800 RCX: 0000000000000297 RDX: 0000000000000000 RSI: 000000000000006c RDI: 0000010034a95800 RBP: 0000010034a95800 R08: 0000010037b8c000 R09: 0000000300000000 R10: 0000000300000000 R11: 00000000013ddb5c R12: 00000000ffffff8d R13: 0000000000000000 R14: 0000010034a95800 R15: ffffffffa02b8af3 FS: 0000002a958a5b00(0000) GS:ffffffff8050cf80(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000000001b8 CR3: 0000000037c28000 CR4: 00000000000006e0 Process events/1 (pid: 7, threadinfo 0000010037b8c000, task 0000010037b65030) Stack: 0000000300000000 0000010034a95800 0000010034a95e70 0000010036fcdc30 0000010036fcdc00 ffffffffa02b8b2c 0000010034a95e68 0000010037cb9440 0000000000000297 ffffffff8014894e Call Trace:<ffffffffa02b8b2c>{:ohci_hcd:ohci_rh_resume+57} <ffffffff8014894e>{worker_thread+419} <ffffffff80134837>{default_wake_function+0} <ffffffff80134888>{__wake_up_common+67} <ffffffff80134837>{default_wake_function+0} <ffffffff801487ab>{worker_thread+0} <ffffffff8014c6a7>{kthread+200} <ffffffff80110fd3>{child_rip+8} <ffffffff8014c5df>{kthread+0} <ffffffff80110fcb>{child_rip+0} Code: 41 83 bd b8 01 00 00 00 0f 84 74 05 00 00 48 8b 05 43 6c 25 RIP <ffffffffa02b8568>{:ohci_hcd:ohci_hub_resume+23} RSP <0000010037b8de28> CR2: 00000000000001b8 <0>Kernel panic - not syncing: Oops
Even with the new Linux kernel, I can still reproduce the original problem sometimes. bus 3: replacing with dummies bus 4: replacing with dummies bus 5: replacing with dummies bus 6: replacing with dummies bus 2: replacing with dummies hub 3-0:1.0: hub_port_status failed (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: cannot reset port 1 (err = -19) hub 3-0:1.0: Cannot enable port 1. Maybe the USB cable is bad? hub 3-0:1.0: cannot disable port 1 (err = -19) hub 3-0:1.0: couldn't allocate port 1 usb_device hub 3-0:1.0: cannot disable port 1 (err = -19) bus 3: replacing with dummies bus 4: replacing with dummies bus 5: replacing with dummies bus 6: replacing with dummies bus 2: replacing with dummies usb 3-1: control timeout on ep0in
I have tried 6 times on this one. 3 times with this bug; 3 times with the oops.
I was able to reproduce the hang @hp-xw4550-01, although not the oops, sadly. Working on the hang for now. My patchlet for ed_schedule does not fix it. The key is khubd being stuck with a lost unlink callback: khubd D 0000000000001388 0 37 1 66 7 (L-TLB) 000001003770bbf8 0000000000000046 0000000000000246 0000010000000073 0000000232a679c0 0000000000000000 0000010001195aa0 0000000000000000 00000100376d87f0 000000000000031b Call Trace:<ffffffff8029552d>{hcd_submit_urb+1463} <ffffffff80318708>{wait_for_completion+167} <ffffffff80134837>{default_wake_function+0} <ffffffff80134837>{default_wake_function+0} <ffffffff80296022>{usb_start_wait_urb+185} <ffffffff80295ef4>{timeout_kill+0} <ffffffff802960d9>{usb_internal_control_msg+122} <ffffffff8011f63c>{flat_send_IPI_mask+0} <ffffffff80296168>{usb_control_msg+123} <ffffffff802968b3>{usb_get_descriptor+103} <ffffffff80296bc0>{usb_get_device_descriptor+74} <ffffffff8029337d>{hub_port_init+1193} <ffffffff80293e79>{hub_thread+1399} <ffffffff801361f8>{autoremove_wake_function+0} <ffffffff801361f8>{autoremove_wake_function+0} <ffffffff80110fd3>{child_rip+8} <ffffffff801db498>{selinux_d_instantiate+0} <ffffffff80293902>{hub_thread+0} <ffffffff80110fcb>{child_rip+0} I think rmmod is secondary to the above.
Created attachment 325802 [details] WIP - survives for a day Looks like I found the oops problem: there's a window when usb_disconnect(&hcd->self.root_hub) set the root_hub to NULL, but the HC is still running and can produce an RD event. In response we schedule the work, and it oopses with NULL root_hub. However, the hang has become elusive now. Although maybe it was caused by RD-initiated schedule poking at HC while we were stopping it. This is work in progress. The patch includes the whole of the fix for bug 455843, or else rmmod loop will inevitably oops for other reasons.
I have a test kernel 2.6.9-78.18.EL.bz455843.4 which incorporates the necessary fixes from bz455843, a fix for a hang in ed_schedule, a fix for the crash due to NULL root_hub (like comment #5). I am not exactly sure what has caused the hang in comment #6 (we know that it's a missing unlink callback, but we don't know why it occured). I am unable to reproduce it anymore on hp-xw4550-01. Please test, same location: http://people.redhat.com/zaitcev/ftp/455843/
The machine has survived without hang or panic after 8 hours.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Reproducer available and HW in RHTS.
Committed in 78.29.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Pete, I have seen an Oops on another machine while testing if a RHEL4.7 EUS kernel, ... hub 2-0:1.0: hub_port_status failed (err = -19) hub 2-0:1.0: connect-debounce failed, port 1 disabled hub 2-0:1.0: cannot disable port 1 (err = -19) hub 2-0:1.0: hub_port_status failed (err = -19) hub 2-0:1.0: hub_port_status failed (err = -19) hub 2-0:1.0: connect-debounce failed, port 1 disabled hub 2-0:1.0: cannot disable port 1 (err = -19) hub 2-0:1.0: hub_port_status failed (err = -19) hub 2-0:1.0: hub_port_status failed (err = -19) hub 2-0:1.0: connect-debounce failed, port 2 disabled hub 2-0:1.0: cannot disable port 2 (err = -19) Unable to handle kernel NULL pointer dereference at virtual address 00000158 printing eip: f893e2cd *pde = 00000000 Oops: 0000 [#1] Modules linked in: ohci_hcd nfs nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 sunrpc cpufreq_powersave loop joydev button battery ac ehci_hcd k8_edac edac_mc tg3 sg dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sata_svw libata mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod CPU: 0 EIP: 0060:[<f893e2cd>] Not tainted VLI EFLAGS: 00010246 (2.6.9-78.0.19.EL) EIP is at ohci_hub_resume+0x17/0x3db [ohci_hcd] eax: 00000000 ebx: f3308d40 ecx: f2127024 edx: f7fef000 esi: f3308fe4 edi: ffffff8d ebp: 00000000 esp: f7feff4c ds: 007b es: 007b ss: 0068 Process events/0 (pid: 3, threadinfo=f7fef000 task=f7e50bd0) Stack: f2127024 f3308fe4 f2127024 f2127000 f7e95180 f893e6c4 f3308f94 00000283 f3308fe4 c0138904 f893e691 ffffffff ffffffff 00000001 00000000 c012007b 00010000 00000000 c0325289 f7feffcc 00000046 00000000 f7e50bd0 c012007b Call Trace: [<f893e6c4>] ohci_rh_resume+0x33/0x46 [ohci_hcd] [<c0138904>] worker_thread+0x1eb/0x2f0 [<f893e691>] ohci_rh_resume+0x0/0x46 [ohci_hcd] [<c012007b>] default_wake_function+0x0/0xc [<c0325289>] schedule+0x44d/0x606 [<c012007b>] default_wake_function+0x0/0xc [<c0138719>] worker_thread+0x0/0x2f0 [<c013d7e1>] kthread+0x69/0x91 [<c013d778>] kthread+0x0/0x91 [<c01041dd>] kernel_thread_helper+0x5/0xb Code: 7e c7 c7 43 04 00 00 00 00 fb 89 e9 5a 89 c8 5b 5e 5f 5d c3 55 57 bf 8d ff ff ff 56 53 8d 98 5c fd ff ff 56 8b 68 28 89 c6 31 c0 <83> bd 58 01 00 00 00 0f 84 b1 03 00 00 a1 20 84 37 c0 3b 83 4c <0>Fatal exception: panic in 5 seconds Kernel panic - not syncing: Fatal exception Do you think it worth trying to reproduce it for RHEL4.8 kernels? More information, https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=7702158 ********** System Information ********** Hostname = ibm-e326m.rhts.bos.redhat.com Kernel Version = 2.6.9-78.0.19.EL Machine Hardware Name = i686 Processor Type = athlon uname -a output = Linux ibm-e326m.rhts.bos.redhat.com 2.6.9-78.0.19.EL #1 Mon Apr 6 07:39:02 EDT 2009 i686 athlon i386 GNU/Linux Swap Size = 4031 MB Mem Size = 2026 MB Number of Processors = 1 System Release = Red Hat Enterprise Linux AS release 4 (Nahant Update 7) Command Line = ro root=/dev/VolGroup00/LogVol00 console=tty0 console=ttyS0,115200 System NMI Interrupts = NMI: 0 ********** LSPCI ********** 00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge 00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge 00:02.1 IDE interface: Broadcom BCM5785 [HT1000] IDE 00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC 00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 01) 00:06.0 PCI bridge: Broadcom BCM5780 [HT2000] PCI-X bridge (rev a3) 00:07.0 PCI bridge: Broadcom BCM5780 [HT2000] PCI-X bridge (rev a3) 00:08.0 PCI bridge: Broadcom BCM5780 [HT2000] PCI-Express Bridge (rev a3) 00:09.0 PCI bridge: Broadcom BCM5780 [HT2000] PCI-Express Bridge (rev a3) 00:0a.0 PCI bridge: Broadcom BCM5780 [HT2000] PCI-Express Bridge (rev a3) 00:0b.0 PCI bridge: Broadcom BCM5780 [HT2000] PCI-Express Bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev b2) 01:0e.0 IDE interface: Broadcom BCM5785 [HT1000] SATA (Native SATA Mode) 01:0e.1 IDE interface: Broadcom BCM5785 [HT1000] SATA (Native SATA Mode) 02:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) 04:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03) 04:04.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet (rev 03) ********** Modprob ********** alias eth0 tg3 alias eth1 tg3 alias scsi_hostadapter mptbase alias scsi_hostadapter1 mptscsi alias scsi_hostadapter2 mptspi alias scsi_hostadapter3 mptsas alias scsi_hostadapter4 mptscsih alias scsi_hostadapter5 sata_svw alias usb-controller ehci-hcd alias usb-controller1 ohci-hcd
FYI. I have not been able to this on the latest RHEL4.8 kernel -- 2.6.9-88.EL in the above RHTS machine by running the reproducer for 8 hours.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html