Bug 613620 - usb bus gets locked and requires power cycle
usb bus gets locked and requires power cycle
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.5
All Linux
low Severity medium
: rc
: ---
Assigned To: Don Zickus
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-07-12 08:46 EDT by Greg Matthews
Modified: 2010-08-13 11:35 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-08-13 11:35:27 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Greg Matthews 2010-07-12 08:46:29 EDT
Description of problem:

unplugging and plugging usb devices results in a complete hang of all usb ports which can only be fixed by a hard power cycle (reboot is ineffective)

Version-Release number of selected component (if applicable):
kernel: 2.6.18-164.9.1.el5

How reproducible:
fully reproducible using a WD MyBook external hard drive.


Steps to Reproduce:
1.unplug MyBook from one port
2.plug into same or other port
3.observe /var/log/messages
  
Actual results:
The following logs from /var/log/messages are cut and pasted after steps 1 and 2 above:

Jul 12 11:55:42 i04-dd003 kernel: usb 1-2: new high speed USB device using ehci_hcd and address 4
Jul 12 11:55:42 i04-dd003 kernel: usb 1-2: configuration #1 chosen from 1 choice
Jul 12 11:55:42 i04-dd003 kernel: input: Western Digital External HDD as /class/input/input4
Jul 12 11:55:42 i04-dd003 kernel: input: USB HID v1.11 Device [Western Digital External HDD] on usb-0000:00:1a.7-2
Jul 12 11:55:42 i04-dd003 kernel: Initializing USB Mass Storage driver...
Jul 12 11:55:42 i04-dd003 kernel: scsi8 : SCSI emulation for USB Mass Storage devices
Jul 12 11:55:42 i04-dd003 kernel: usbcore: registered new driver usb-storage
Jul 12 11:55:42 i04-dd003 kernel: USB Mass Storage support registered.
Jul 12 11:55:47 i04-dd003 kernel:   Vendor: WD        Model: 2500JS External   Rev: 106a
Jul 12 11:55:47 i04-dd003 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 04
Jul 12 11:55:47 i04-dd003 kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
Jul 12 11:55:47 i04-dd003 kernel: sdb: Write Protect is off
Jul 12 11:55:47 i04-dd003 kernel: sdb: assuming drive cache: write through
Jul 12 11:55:47 i04-dd003 kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
Jul 12 11:55:47 i04-dd003 kernel: sdb: Write Protect is off
Jul 12 11:55:47 i04-dd003 kernel: sdb: assuming drive cache: write through
Jul 12 11:55:47 i04-dd003 kernel:  sdb: sdb1
Jul 12 11:55:47 i04-dd003 kernel: sd 8:0:0:0: Attached scsi disk sdb
Jul 12 11:55:47 i04-dd003 kernel: sd 8:0:0:0: Attached scsi generic sg1 type 0
Jul 12 11:55:48 i04-dd003 kernel: kjournald starting.  Commit interval 5 seconds
Jul 12 11:55:48 i04-dd003 kernel: EXT3 FS on sdb1, internal journal
Jul 12 11:55:48 i04-dd003 kernel: EXT3-fs: recovery complete.
Jul 12 11:55:48 i04-dd003 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 12 11:55:59 i04-dd003 kernel: usb 1-2: USB disconnect, address 4
Jul 12 11:55:59 i04-dd003 kernel: sd 8:0:0:0: rejecting I/O to device being removed
Jul 12 11:55:59 i04-dd003 kernel: Buffer I/O error on device sdb1, logical block 1027
Jul 12 11:55:59 i04-dd003 kernel: lost page write due to I/O error on sdb1
Jul 12 11:55:59 i04-dd003 kernel: Aborting journal on device sdb1.
Jul 12 11:55:59 i04-dd003 kernel: scsi 8:0:0:0: rejecting I/O to dead device
Jul 12 11:55:59 i04-dd003 kernel: Buffer I/O error on device sdb1, logical block 1545
Jul 12 11:55:59 i04-dd003 kernel: lost page write due to I/O error on sdb1
Jul 12 11:55:59 i04-dd003 kernel: ext3_abort called.
Jul 12 11:55:59 i04-dd003 kernel: EXT3-fs error (device sdb1): ext3_put_super: Couldn't clean up the journal
Jul 12 11:55:59 i04-dd003 kernel: Remounting filesystem read-only
Jul 12 11:56:04 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 3 disabled
Jul 12 11:56:06 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 4 disabled
Jul 12 11:56:08 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 2 disabled
Jul 12 11:56:09 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 3 disabled
Jul 12 11:56:11 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 4 disabled
Jul 12 11:56:13 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 2 disabled
Jul 12 11:56:14 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 3 disabled
Jul 12 11:56:16 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 4 disabled
Jul 12 11:56:18 i04-dd003 kernel: hub 2-0:1.0: connect-debounce failed, port 2 disabled
.....

All usb ports are unusable including keyboard. A reboot is very slow (POST mem check takes ages) and problem persists until power is physically removed.

modprobe -r ehci_hcd stops the log messages but doesn't solve the problem. The hardware appears to be in some sort of infinite loop.

Expected results:
Jul 12 11:22:25 i04-dd003 kernel: usb 1-2: new high speed USB device using ehci_hcd and address 3
Jul 12 11:22:25 i04-dd003 kernel: usb 1-2: configuration #1 chosen from 1 choice
Jul 12 11:22:25 i04-dd003 kernel: input: Western Digital External HDD as /class/input/input4
Jul 12 11:22:25 i04-dd003 kernel: input: USB HID v1.11 Device [Western Digital External HDD] on usb-0000:00:1a.7-2
Jul 12 11:22:25 i04-dd003 kernel: Initializing USB Mass Storage driver...
Jul 12 11:22:25 i04-dd003 kernel: scsi8 : SCSI emulation for USB Mass Storage devices
Jul 12 11:22:25 i04-dd003 kernel: usbcore: registered new driver usb-storage
Jul 12 11:22:25 i04-dd003 kernel: USB Mass Storage support registered.
Jul 12 11:22:30 i04-dd003 kernel:   Vendor: WD        Model: 2500JS External   Rev: 106a
Jul 12 11:22:30 i04-dd003 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 04
Jul 12 11:22:30 i04-dd003 kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
Jul 12 11:22:30 i04-dd003 kernel: sdb: Write Protect is off
Jul 12 11:22:30 i04-dd003 kernel: sdb: assuming drive cache: write through
Jul 12 11:22:30 i04-dd003 kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
Jul 12 11:22:30 i04-dd003 kernel: sdb: Write Protect is off
Jul 12 11:22:30 i04-dd003 kernel: sdb: assuming drive cache: write through
Jul 12 11:22:30 i04-dd003 kernel:  sdb: sdb1
Jul 12 11:22:30 i04-dd003 kernel: sd 8:0:0:0: Attached scsi disk sdb
Jul 12 11:22:30 i04-dd003 kernel: sd 8:0:0:0: Attached scsi generic sg1 type 0
Jul 12 11:22:31 i04-dd003 kernel: kjournald starting.  Commit interval 5 seconds
Jul 12 11:22:31 i04-dd003 kernel: EXT3 FS on sdb1, internal journal
Jul 12 11:22:31 i04-dd003 kernel: EXT3-fs: recovery complete.
Jul 12 11:22:31 i04-dd003 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 12 11:22:45 i04-dd003 kernel: usb 1-2: USB disconnect, address 3
Jul 12 11:22:45 i04-dd003 kernel: sd 8:0:0:0: rejecting I/O to device being removed
Jul 12 11:22:45 i04-dd003 kernel: Buffer I/O error on device sdb1, logical block 1027
Jul 12 11:22:45 i04-dd003 kernel: lost page write due to I/O error on sdb1
Jul 12 11:22:45 i04-dd003 kernel: Aborting journal on device sdb1.
Jul 12 11:22:45 i04-dd003 kernel: scsi 8:0:0:0: rejecting I/O to dead device
Jul 12 11:22:45 i04-dd003 kernel: Buffer I/O error on device sdb1, logical block 1545
Jul 12 11:22:45 i04-dd003 kernel: lost page write due to I/O error on sdb1
Jul 12 11:22:45 i04-dd003 kernel: ext3_abort called.
Jul 12 11:22:45 i04-dd003 kernel: EXT3-fs error (device sdb1): ext3_put_super: Couldn't clean up the journal
Jul 12 11:22:45 i04-dd003 kernel: Remounting filesystem read-only
Jul 12 11:23:19 i04-dd003 kernel: usb 1-1: new high speed USB device using ehci_hcd and address 4
Jul 12 11:23:19 i04-dd003 kernel: usb 1-1: configuration #1 chosen from 1 choice
Jul 12 11:23:19 i04-dd003 kernel: scsi9 : SCSI emulation for USB Mass Storage devices
Jul 12 11:23:19 i04-dd003 kernel: input: Western Digital External HDD as /class/input/input5
Jul 12 11:23:19 i04-dd003 kernel: input: USB HID v1.11 Device [Western Digital External HDD] on usb-0000:00:1a.7-1
Jul 12 11:23:24 i04-dd003 kernel:   Vendor: WD        Model: 2500JS External   Rev: 106a
Jul 12 11:23:24 i04-dd003 kernel:   Type:   Direct-Access                      ANSI SCSI revision: 04
Jul 12 11:23:24 i04-dd003 kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
Jul 12 11:23:24 i04-dd003 kernel: sdb: Write Protect is off
Jul 12 11:23:24 i04-dd003 kernel: sdb: assuming drive cache: write through
Jul 12 11:23:24 i04-dd003 kernel: SCSI device sdb: 488397168 512-byte hdwr sectors (250059 MB)
Jul 12 11:23:24 i04-dd003 kernel: sdb: Write Protect is off
Jul 12 11:23:24 i04-dd003 kernel: sdb: assuming drive cache: write through
Jul 12 11:23:24 i04-dd003 kernel:  sdb: sdb1
Jul 12 11:23:24 i04-dd003 kernel: sd 9:0:0:0: Attached scsi disk sdb
Jul 12 11:23:24 i04-dd003 kernel: sd 9:0:0:0: Attached scsi generic sg1 type 0
Jul 12 11:23:25 i04-dd003 kernel: kjournald starting.  Commit interval 5 seconds
Jul 12 11:23:25 i04-dd003 kernel: EXT3 FS on sdb1, internal journal
Jul 12 11:23:25 i04-dd003 kernel: EXT3-fs: recovery complete.
Jul 12 11:23:25 i04-dd003 kernel: EXT3-fs: mounted filesystem with ordered data mode.

or for whatever debounce procedure to actually fully reset the bus and recover.

Additional info:

lspci info for USB ports - 2 in detail (1 EHCI and 1 UHCI)

$ lspci | grep USB
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)

$ sudo lspci -vvv -s 00:1d.7
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI])
        Subsystem: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin A routed to IRQ 74
        Region 0: Memory at fdffe000 (32-bit, non-prefetchable) [size=1K]
        Capabilities: [50] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Debug port
        Capabilities: [98] #13 [0306]

$ sudo lspci -vvv -s 00:1d.0
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03) (prog-if 00 [UHCI])
        Subsystem: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
        Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0
        Interrupt: pin A routed to IRQ 74
        Region 4: I/O ports at fb00 [size=32]
        Capabilities: [50] #13 [0306]
Comment 1 Greg Matthews 2010-07-16 10:52:46 EDT
Don...

let me know if I can do anything to diagnose/test. This is a showstopper for a high profile project here.

thanks

GREG
Comment 2 Don Zickus 2010-07-16 11:00:25 EDT
How long are you waiting between unplug/plug?  If you exaggerate it say 30 seconds between unplugging/plugging does the problem still happen?

Trying to determine if this is a race condition or a functional problem with broken hardware.

Cheers,
Don
Comment 3 Greg Matthews 2010-07-20 11:01:15 EDT
I tried to reproduce again to check. In fact I'm able to trigger this without a re-plug.

If I have a zebra disk enclosure plugged into USB2-1, when I plug in the MyBook into USB1-2 it triggers the debounce problem and all usb devices are non-functional until power is removed and reconnected.
Comment 4 Don Zickus 2010-08-04 15:00:15 EDT
Searching through the upstream code, it seems WD changed he behaviour of their disks.  I can't tell from above if there are certain scenarios in which you can get the device to register and mount properly.

If so, can you run the command:
./smartctl -d sat -a /dev/sdb -T permissive

It should fail, if it has bigger sense data.  In that case I should be able to backport the specialized fix.

Cheers,
Don
Comment 5 Greg Matthews 2010-08-10 03:23:39 EDT
The WD disk can be mounted properly, if there are no other disks plugged in the MyBook can be mounted but the problem can also be triggered if the usb is replugged too quickly.

The following data was captured when the disk is registered correctly (data can't be collected once the usb bus has been put into a bad state).

sudo smartctl -d sat -a /dev/sdd -T permissive
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     [No Information Found]
Serial Number:    [No Information Found]
Firmware Version: [No Information Found]
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   1
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Aug 10 08:14:40 2010 BST
SMART is only available in ATA Version 3 Revision 3 or greater.
We will try to proceed in spite of this.
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 82-83 don't show if SMART supported.
                  Checking for SMART support by trying SMART ENABLE command.
                  SMART ENABLE appeared to work!  Continuing.
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 85-87 don't show if SMART is enabled.
                  Checking to be sure by trying SMART RETURN STATUS command.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Total time to complete Offline 
data collection:                 (   0) seconds.
Offline data collection
capabilities:                    (0x00)         Offline data collection not supported.
SMART capabilities:            (0x0000) Automatic saving of SMART data                                  is not implemented.
Error logging capability:        (0x00) Error logging NOT supported.
                                        No General Purpose Logging support.

Warning: device does not support Error Logging
SMART Error Log Version: 0
No Errors Logged

Warning: device does not support Self Test Logging
SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


Device does not support Selective Self Tests/Logging
Comment 6 Don Zickus 2010-08-13 10:27:40 EDT
Hi Greg,

Thanks for the info.  After talking with some other USB folks, their experience tells them that this is most likely bad hardware somewhere.  The trick is to figure out if it is the disk, cable, or the EHCI connector on the machine.

Unfortunately, there is no easy way to debug this.  The suggestions that were presented to me basically say, try to swap things around to see what consistently works and what consistently breaks.

IOW, perhaps start with the USB cables, swapping them for something else and testing to see if things are any different.  Then either move the disk to another machine and play with it there (under RHEL-5 if possible).  Also, if possible, put Microsoft Windows on the machine in question to see if the disk fails the same way there.  

It's a lot of grunt work, I understand, but the nature of the problem is that after plugging in the disk, somehow the connection goes bad and hoses the machine.  The question is who is breaking the connection, the machine, disk, or cable.

If you could put together a table of the different combinations you tried and post it in the bugzilla, it might help narrow in where the problem is.

Sorry I can't give you a better answer.

Cheers,
Don
Comment 7 Greg Matthews 2010-08-13 10:36:06 EDT
This is what I was afraid of. I have a report open with the supplier who has contacted the motherboard manufacturer. Basically I've tried different usb cables (I thought certain cables appeared to make the failure more likely) and I've tried with the disk on different hardware where I was unable to reproduce the fault that way.

It does point towards the motherboard/usb chipset. Unfortunately I don't have much more time to debug this so we'll probably have to ditch the hardware at this point.

thanks for looking into this.

GREG
Comment 8 Don Zickus 2010-08-13 11:35:27 EDT
Hi Greg,

I am going to close this as a hardware problem.  Feel free to re-open this if you run into this again with different hardware.

Cheers,
Don

Note You need to log in before you can comment on or make changes to this bug.