Bug 1003926

Summary: PANIC: "Oops: 0002 [#1] SMP " when plugging in a dodgy USB3 SSD HD
Product: [Fedora] Fedora Reporter: kubrick <kubrick>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 19CC: dblack, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, marcelo.barbosa
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-03-10 14:45:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vmcore-dmesg.txt
none
kernel crash backtrace
none
kernel crash backtrace 3.11.1-200.fc19.x86_64 none

Description kubrick@fgv6.net 2013-09-03 14:15:44 UTC
Description of problem:
When I plug a malfunctioning SSD HardDrive with an USB 3.0 adaptor. I sometimes have this crash.

      KERNEL: /usr/lib/debug/lib/modules/3.10.10-200.fc19.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Tue Sep  3 14:51:20 2013
      UPTIME: 04:54:09
LOAD AVERAGE: 2.12, 2.16, 2.10
       TASKS: 550
    NODENAME: fguerrazltp
     RELEASE: 3.10.10-200.fc19.x86_64
     VERSION: #1 SMP Thu Aug 29 19:05:45 UTC 2013
     MACHINE: x86_64  (2494 Mhz)
      MEMORY: 7.7 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)
         PID: 1650
     COMMAND: "goa-daemon"
        TASK: ffff880211772dc0  [THREAD_INFO: ffff8801f197a000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

I have a core dump and can run more commands agaist it you want

Comment 1 kubrick@fgv6.net 2013-09-03 14:18:49 UTC
Created attachment 793213 [details]
vmcore-dmesg.txt

Comment 2 kubrick@fgv6.net 2013-09-03 14:19:42 UTC
Created attachment 793214 [details]
kernel crash backtrace

Comment 3 kubrick@fgv6.net 2013-09-03 20:53:54 UTC
Sorry I'm not a kernel specialist but it I make an attempt at trying to find what's wrong, that's what I think:

drivers/usb/host/xhci-ring.c:

in handle_reset_ep_completion(), TRB_TO_EP_INDEX is apparently returning an unexpected value (0x40 in the case of this crash), causing:
andl   $0xfffffffd,0x40(%rcx)
    or, in the source,
xhci->devs[slot_id]->eps[ep_index].ep_state &= ~EP_HALTED;
to trigger a NULL pointer dereference.

I don't even know if it's really possible, but shouldn't we using the "xhci_event_cmd *event" structure instead with 
	ep_index = TRB_TO_EP_ID(le32_to_cpu(event->flags)) - 1;

?

I'll try that and see how it goes.

Comment 4 kubrick@fgv6.net 2013-09-04 11:06:11 UTC
Even when it doesn't crash, this USB device / hard drive systematically corrupts the USB stack:

Sep  4 12:00:07 fguerrazltp kernel: [  533.774596] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff8802038cc500
Sep  4 12:00:07 fguerrazltp kernel: [  533.774600] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff8802038cc540
Sep  4 12:00:07 fguerrazltp systemd-udevd[2809]: inotify_add_watch(7, /dev/sdc, 10) failed: No such file or directory
Sep  4 12:00:07 fguerrazltp systemd-udevd[2809]: inotify_add_watch(7, /dev/sdc, 10) failed: No such file or directory
Sep  4 12:00:13 fguerrazltp kernel: [  539.256996] xhci_hcd 0000:00:14.0: Timeout while waiting for address device command
Sep  4 12:00:18 fguerrazltp kernel: [  544.463844] xhci_hcd 0000:00:14.0: Timeout while waiting for address device command
Sep  4 12:00:18 fguerrazltp kernel: [  544.665048] usb 4-1: device not accepting address 4, error -62
Sep  4 12:00:24 fguerrazltp kernel: [  550.229274] xhci_hcd 0000:00:14.0: Timeout while waiting for address device command
Sep  4 12:00:24 fguerrazltp kernel: [  550.229291] xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd with unknown completion code of 24.
Sep  4 12:00:29 fguerrazltp kernel: [  555.436123] xhci_hcd 0000:00:14.0: Timeout while waiting for address device command
Sep  4 12:00:29 fguerrazltp kernel: [  555.637347] usb 4-1: device not accepting address 6, error -62
Sep  4 12:00:35 fguerrazltp kernel: [  561.201584] xhci_hcd 0000:00:14.0: Timeout while waiting for address device command
Sep  4 12:00:35 fguerrazltp kernel: [  561.402824] xhci_hcd 0000:00:14.0: ERROR: unexpected command completion code 0x0.
Sep  4 12:00:35 fguerrazltp kernel: [  561.604045] usb 4-1: device not accepting address 8, error -22
Sep  4 12:00:40 fguerrazltp kernel: [  566.609664] xhci_hcd 0000:00:14.0: Timeout while waiting for a slot
Sep  4 12:00:40 fguerrazltp kernel: [  566.609677] hub 4-0:1.0: couldn't allocate port 1 usb_device

And then, when I disconnect this device and try plugging in a working USB stick:
Sep  4 12:03:23 fguerrazltp kernel: [  730.167231] xhci_hcd 0000:00:14.0: ERROR: unexpected command completion code 0x0.
Sep  4 12:03:24 fguerrazltp kernel: [  730.368462] xhci_hcd 0000:00:14.0: Setup ERROR: address device command for slot 10.
Sep  4 12:03:24 fguerrazltp kernel: [  730.569702] usb 4-1: device not accepting address 9, error -22

Comment 5 Josh Boyer 2013-09-18 20:28:54 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 6 kubrick@fgv6.net 2013-09-19 10:52:01 UTC
Yes, still happens on 3.11.1-200.fc19 :

      KERNEL: /usr/lib/debug/lib/modules/3.11.1-200.fc19.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Thu Sep 19 11:39:34 2013
      UPTIME: 00:11:49
LOAD AVERAGE: 0.93, 0.61, 0.37
       TASKS: 429
    NODENAME: fguerrazltp
     RELEASE: 3.11.1-200.fc19.x86_64
     VERSION: #1 SMP Sat Sep 14 15:04:51 UTC 2013
     MACHINE: x86_64  (2494 Mhz)
      MEMORY: 7.7 GB
       PANIC: 
WARNING: log buf data structure(s) have changed
""
         PID: 0
     COMMAND: "swapper/2"
        TASK: ffff8802133bd3e0  (1 of 4)  [THREAD_INFO: ffff880212c6e000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

Comment 7 kubrick@fgv6.net 2013-09-19 11:00:24 UTC
Created attachment 799836 [details]
kernel crash backtrace 3.11.1-200.fc19.x86_64

Comment 8 Dustin Black 2013-09-25 00:52:51 UTC
I haven't experienced a panic, or gotten a core or abrt report, but with

$ uname -a
Linux dblack 3.11.1-200.fc19.x86_64 #1 SMP Sat Sep 14 15:04:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


and a USB3 external HDD that I don't *think* is dodgy, I consistently get a failure of xhci similar to the OP's log above.


Sep 24 20:41:24 dblack kernel: [  310.762507] usb 3-1: new full-speed USB device number 2 using xhci_hcd
Sep 24 20:41:35 dblack chronyd[711]: Selected source 88.190.34.44
Sep 24 20:41:43 dblack kernel: [  315.760676] xhci_hcd 0000:0d:00.0: Timeout while waiting for address device command
Sep 24 20:41:43 dblack kernel: [  329.601351] xhci_hcd 0000:0d:00.0: Stopped the command ring failed, maybe the host is dead
Sep 24 20:41:43 dblack kernel: [  329.645507] xhci_hcd 0000:0d:00.0: Host not halted after 16000 microseconds.
Sep 24 20:41:43 dblack kernel: [  329.645509] xhci_hcd 0000:0d:00.0: Abort command ring failed
Sep 24 20:41:43 dblack kernel: [  329.645904] xhci_hcd 0000:0d:00.0: HC died; cleaning up
Sep 24 20:41:43 dblack rtkit-daemon[725]: The canary thread is apparently starving. Taking action.
Sep 24 20:41:43 dblack rtkit-daemon[725]: Demoting known real-time threads.
Sep 24 20:41:43 dblack rtkit-daemon[725]: Successfully demoted thread 1920 of process 1889 (/usr/bin/pulseaudio).
Sep 24 20:41:43 dblack rtkit-daemon[725]: Successfully demoted thread 1909 of process 1889 (/usr/bin/pulseaudio).
Sep 24 20:41:43 dblack rtkit-daemon[725]: Successfully demoted thread 1903 of process 1889 (/usr/bin/pulseaudio).
Sep 24 20:41:43 dblack rtkit-daemon[725]: Successfully demoted thread 1897 of process 1889 (/usr/bin/pulseaudio).
Sep 24 20:41:43 dblack rtkit-daemon[725]: Successfully demoted thread 1889 of process 1889 (/usr/bin/pulseaudio).
Sep 24 20:41:43 dblack rtkit-daemon[725]: Demoted 5 threads.
Sep 24 20:41:48 dblack kernel: [  334.852703] xhci_hcd 0000:0d:00.0: Timeout while waiting for address device command
Sep 24 20:41:48 dblack kernel: [  334.852713] xhci_hcd 0000:0d:00.0: Abort the command ring, but the xHCI is dead.
Sep 24 20:41:48 dblack kernel: [  335.053616] usb 3-1: device not accepting address 2, error -108
Sep 24 20:41:53 dblack kernel: [  340.051550] xhci_hcd 0000:0d:00.0: Timeout while waiting for a slot
Sep 24 20:41:53 dblack kernel: [  340.051562] xhci_hcd 0000:0d:00.0: Abort the command ring, but the xHCI is dead.
Sep 24 20:41:58 dblack kernel: [  345.049489] xhci_hcd 0000:0d:00.0: Timeout while waiting for a slot
Sep 24 20:41:58 dblack kernel: [  345.049502] xhci_hcd 0000:0d:00.0: Abort the command ring, but the xHCI is dead.
Sep 24 20:42:03 dblack kernel: [  350.047391] xhci_hcd 0000:0d:00.0: Timeout while waiting for a slot
Sep 24 20:42:03 dblack kernel: [  350.047401] xhci_hcd 0000:0d:00.0: Abort the command ring, but the xHCI is dead.


After this happens, subsequent plug/unplug of the device to the USB 3 port results in no logging whatsoever.

The device can be plugged into a legacy USB port and function normally (albeit slowly).

The USB 3 port no longer seems to accept connection from any USB device -- nothing is logged and no tested devices work.

So far, I can't resolve this and use the USB 3 port again (for legacy devices) without rebooting the system.

Comment 9 Justin M. Forbes 2014-01-03 22:07:06 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.12.6-200.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 10 kubrick@fgv6.net 2014-01-08 17:39:22 UTC
As if somebody cared...

I don't own the hardware that allowed be to trigger the bug anyway.

Comment 11 Justin M. Forbes 2014-03-10 14:45:27 UTC
*********** MASS BUG UPDATE **************

This bug has been in a needinfo state for more than 1 month and is being closed with insufficient data due to inactivity. If this is still an issue with Fedora 19, please feel free to reopen the bug and provide the additional information requested.