2019788 – kernel 5.14.14 and above USB devices fail during boot

Bug 2019788 - kernel 5.14.14 and above USB devices fail during boot

Summary: kernel 5.14.14 and above USB devices fail during boot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	35
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	2018703 2019542 2019576 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-03 10:58 UTC by Dave Hatton
Modified:	2022-02-01 11:38 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2022-02-01 11:38:42 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	mathias: needinfo-

Attachments	(Terms of Use)
trace log (289.98 KB, text/plain) 2021-11-04 02:29 UTC, Brian	no flags	Details
dmesg log (125.27 KB, text/plain) 2021-11-04 02:29 UTC, Brian	no flags	Details
dmesg xhci TRB error 4 (471.60 KB, text/plain) 2021-11-06 22:52 UTC, Jeff Cook	no flags	Details
dmesg from 5.14.16-201 (125.32 KB, text/plain) 2021-11-08 23:15 UTC, Brian	no flags	Details
trace from 5.14.16-201 (289.98 KB, text/plain) 2021-11-08 23:16 UTC, Brian	no flags	Details
testpatch keep address0 mutex over retries (3.65 KB, application/mbox) 2021-11-10 21:30 UTC, Mathias Nyman	no flags	Details
dmesg log (w/patch) (125.15 KB, text/plain) 2021-11-11 13:38 UTC, Brian	no flags	Details
trace log (w/patch) (289.98 KB, text/plain) 2021-11-11 13:39 UTC, Brian	no flags	Details
testpatch v2 keep address0 mutex over retries, fixed locking (4.67 KB, application/mbox) 2021-11-22 10:15 UTC, Mathias Nyman	no flags	Details
Show Obsolete (1) View All

Description Dave Hatton 2021-11-03 10:58:46 UTC

1. Please describe the problem:

During the kernel boot, all USB devices fail. 
Unable to use keyboard/mouse to operate machine.


Similar is described in:

https://ask.fedoraproject.org/t/the-new-update-of-kernel-and-dracut-deactivated-some-usb-ports/17550

https://bbs.archlinux.org/viewtopic.php?id=270795




2. What is the Version-Release number of the kernel:

5.14.13.200 works
Anything after this kernel exhibits the issue


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

5.14.13.200 works
Anything after this kernel exhibits the issue



4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Boot the machine with 5.14.14 +



5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

not tried


6. Are you running any modules that not shipped with directly Fedora's kernel?:

no



7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 10 using xhci_hcd
Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: New USB device found, idVendor=051d, idProduct=0002, bcdDevice= 1.06
Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: New USB device strings: Mfr=3, Product=1, SerialNumber=2
Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: Product: Back-UPS RS 1500 FW:8.g9 .I USB FW:g9
Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: Manufacturer: American Power Conversion
Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: SerialNumber: NB0647001556
Nov 03 09:27:10 tyr.starground.co.uk kernel: hid-generic 0003:051D:0002.000A: hiddev100,hidraw9: USB HID v1.10 Device [American Power Conversion Back-UPS RS 1500 FW:8.g9 .I USB FW:g9 ] on usb-0000:00:14.0-7/input0
Nov 03 09:27:15 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-10: new full-speed USB device number 11 using xhci_hcd
Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-10: New USB device found, idVendor=8087, idProduct=0a2b, bcdDevice= 0.01
Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Nov 03 09:27:31 tyr.starground.co.uk rngd[278]: [rtlsdr]: Initialization Failed
Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 12 using xhci_hcd
Nov 03 09:27:36 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 03 09:27:52 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 03 09:27:52 tyr.starground.co.uk kernel: usb 1-4-port4: attempt power cycle
Nov 03 09:27:53 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 13 using xhci_hcd
Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Abort failed to stop command ring: -110
Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up
Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 1-2: USB disconnect, device number 2
Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 2-4: USB disconnect, device number 2
Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 1-3: USB disconnect, device number 3
Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 1-4: USB disconnect, device number 4
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4.4: device not accepting address 13, error -108
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4-port4: couldn't allocate usb_device
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4.1: USB disconnect, device number 6
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4.2: USB disconnect, device number 8
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-5: USB disconnect, device number 5
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-6: USB disconnect, device number 7
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-7: USB disconnect, device number 9
Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-10: USB disconnect, device number 11
Nov 03 09:28:04 tyr.starground.co.uk systemd[1]: Finished dracut initqueue hook.

Comment 1 Hans de Goede 2021-11-03 12:50:22 UTC

Thank you for your bug-report.

This is a known issue / regression since 5.14.14, see:

https://lore.kernel.org/stable/42bcbea6-5eb8-16c7-336a-2cb72e71bc36@redhat.com/

We know which patches in 5.14.14 are causing this and they will be reverted for the next 5.14.y release, for now please stay with a kernel < 5.14.14 to work around this.

Comment 2 Hans de Goede 2021-11-03 12:52:21 UTC

*** Bug 2019576 has been marked as a duplicate of this bug. ***

Comment 3 Hans de Goede 2021-11-03 12:52:24 UTC

*** Bug 2019542 has been marked as a duplicate of this bug. ***

Comment 4 Mathias Nyman 2021-11-03 23:30:29 UTC

Any chance this could be reproduced with xhci dynamic debug and tracing enabled?
To enable them at boot please add:
"usbcore.dyndbg=+p xhci_hcd.dyndbg=+p trace_event=xhci-hcd trace_buf_size=80M"
to your kernel cmdline before booting.

after boot send output of dmesg, and content of /sys/kernel/debug/tracing/trace
Thanks

Comment 5 Brian 2021-11-04 02:29:20 UTC

Created attachment 1839802 [details]
trace log

Comment 6 Brian 2021-11-04 02:29:53 UTC

Created attachment 1839803 [details]
dmesg log

Comment 7 Brian 2021-11-04 02:30:44 UTC

Mathias,

I've attached the logs, hopefully they're helpful. Let me know if I can provide any more details.

Thx,
brian

Comment 8 Arjun 2021-11-04 18:31:40 UTC

*** Bug 2018703 has been marked as a duplicate of this bug. ***

Comment 9 Hans de Goede 2021-11-05 07:11:09 UTC

This should be fixed in kernel-5.14.16-201.fc34 /  kernel-5.14.16-301.fc35 which are available in the updates-testing repo now, please give these kernels a try:

sudo dnf update --enablerepo updates-testing 'kernel*'

Comment 10 Dave Hatton 2021-11-05 08:17:09 UTC

I'm afraid I still have the issue with kernel-5.14.16-301.fc35


-- Journal begins at Tue 2021-05-11 13:07:27 BST, ends at Fri 2021-11-05 08:14:32 GMT. --
Nov 05 07:57:17 tyr.starground.co.uk kernel: microcode: microcode updated early to revision 0xea, date = 2021-01-05
Nov 05 07:57:17 tyr.starground.co.uk kernel: Linux version 5.14.16-301.fc35.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1), GNU ld version 2.37-10.fc35) #1 SMP Wed Nov 3 13:55:42 UTC 2021
Nov 05 07:57:17 tyr.starground.co.uk kernel: Command line: BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.16-301.fc35.x86_64 root=UUID=5cbb8d95-2001-470e-8c03-1c77b44d0b59 ro resume=UUID=6abb24af-8338-4a6b-b26e-323ea6b06033 rhgb quiet
...
...
ov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: New USB device found, idVendor=046d, idProduct=c52b, bcdDevice=12.11
Nov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Nov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: Product: USB Receiver
Nov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: Manufacturer: Logitech
Nov 05 07:57:23 tyr.starground.co.uk kernel: input: Logitech USB Receiver as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.0/0003:046D:C52B.0004/input/input10
Nov 05 07:57:23 tyr.starground.co.uk kernel: hid-generic 0003:046D:C52B.0004: input,hidraw3: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:00:14.0-4.2/input0
Nov 05 07:57:23 tyr.starground.co.uk kernel: input: Logitech USB Receiver Mouse as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:046D:C52B.0005/input/input11
Nov 05 07:57:23 tyr.starground.co.uk kernel: input: Logitech USB Receiver Consumer Control as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:046D:C52B.0005/input/input12
Nov 05 07:57:24 tyr.starground.co.uk kernel: input: Logitech USB Receiver System Control as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:046D:C52B.0005/input/input13
Nov 05 07:57:24 tyr.starground.co.uk kernel: hid-generic 0003:046D:C52B.0005: input,hiddev98,hidraw4: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:00:14.0-4.2/input1
Nov 05 07:57:24 tyr.starground.co.uk kernel: hid-generic 0003:046D:C52B.0006: hiddev99,hidraw5: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:00:14.0-4.2/input2
Nov 05 07:57:24 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 9 using xhci_hcd
Nov 05 07:57:24 tyr.starground.co.uk kernel: hid-generic 0003:051D:0002.0007: hiddev100,hidraw6: USB HID v1.10 Device [American Power Conversion Back-UPS RS 1500 FW:8.g9 .I USB FW:g9 ] on usb-0000:00:14.0-7/input0
Nov 05 07:57:29 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-10: new full-speed USB device number 10 using xhci_hcd
Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-10: New USB device found, idVendor=8087, idProduct=0a2b, bcdDevice= 0.01
Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Nov 05 07:57:45 tyr.starground.co.uk rngd[277]: [rtlsdr]: Initialization Failed
Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 11 using xhci_hcd
Nov 05 07:57:50 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 05 07:58:06 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110
Nov 05 07:58:06 tyr.starground.co.uk kernel: usb 1-4-port4: attempt power cycle
Nov 05 07:58:07 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 12 using xhci_hcd
Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Abort failed to stop command ring: -110
Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up
Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 1-2: USB disconnect, device number 2
Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 2-4: USB disconnect, device number 2
Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 1-3: USB disconnect, device number 3
Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 1-4: USB disconnect, device number 4
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4.4: device not accepting address 12, error -108
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4-port4: couldn't allocate usb_device
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4.1: USB disconnect, device number 6
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4.2: USB disconnect, device number 8
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-5: USB disconnect, device number 5
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-7: USB disconnect, device number 7
Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-10: USB disconnect, device number 10
Nov 05 07:58:18 tyr.starground.co.uk systemd[1]: Finished dracut initqueue hook.
...
...

Comment 11 Mathias Nyman 2021-11-05 21:24:45 UTC

Thanks for the trace and dmesg, I'm looking into those.

If you could take a couple more traces of failing cases just to check if they
have the same pattern of "enable slot A, set Address for slot B" before failing,
it would help a lot. 

Details:

Rootcause not found yet
I'm currently suspecting two xhci slots are in a "default" state
at the same time, meaning two USB devices are waiting with address 0 for host to 
give them an unique address. xHC can't handle this as stated in xhci 4.5.3.4

Normally we first enable a slot, then later set an address for the device behind it,
and move to the next slot. Trace shows it hangs after one roothub thread enables slot 6,
and the second roothub thread sets the address of slot 5, a different slot right after that.

The address0_mutex taken right after enabling a slot is supposed to prevent
two slots from being in default state at the same time.
Maybe there is some small window during a retry that leave a slot in default
state and unlocks the mutex? Need to check that. 

This could make sense as after the patch both roothubs are created almost simultaneously,
causing a higher race risk during enumeration.

Comment 12 Brian 2021-11-05 21:43:50 UTC

@mathias.nyman.com Do you want me to provide more traces from that same machine or are you looking for data from other machines? I only have one computer with this problem, unsure if that data will be helpful or not.

Thx,
brian

Comment 13 Jeff Cook 2021-11-06 22:49:14 UTC

I've been encountering same issue on a Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10), though it usually doesn't hit until the bus has been running for several hours; there will typically be at least one failure within 24-48 hours. Removing the PCI device and rescanning the PCI bus will restore function for another several hours when the same bug will hit again.

I was able to get a debug log the last time this occurred. Attaching a 2-minute dmesg snippet so there's some space before and after; the error occurs at 03:02:42 on line 3898. Unfortunately, the trace has since cycled out.

Comment 14 Jeff Cook 2021-11-06 22:52:40 UTC

Created attachment 1840495 [details]
dmesg xhci TRB error 4

Comment 15 allregshere 2021-11-08 21:58:17 UTC

I have a MacBook Pro late 2014 and I am seeing this issue with Kernel 5.14.14 and later.

Comment 16 Mathias Nyman 2021-11-08 22:01:31 UTC

@Brian, Same machine is fine

@Jeff, Thanks, your case seems to be related to resume.

Same theory that several xHC slots end up in default state at the same time still fits.
Log show a reset-resume caes, which will also reset the device and set xHC slot to default state.

The TRB error also fits. Specs say xHC should respond with a TRB error in case a SET_ADDRESS control transfer
is submitted while the host is currently submitting a SET ADDRESS request for another device as part
of a Address device command. (see xhci 4.5.4.1)

Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 5 ep 0 on endpoint
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: TRB error 4, halted endpoint index = 0
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: // Ding dong!
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Ignoring reset ep completion code of 1
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Removing canceled TD starting at 0x118ab8300 (dma).
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Set TR Deq ptr 0x118ab8330, cycle 0
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: // Ding dong!
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Successful Set TR Deq Ptr cmd, deq = @118ab8330
Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Giveback URB 00000000740a70e1, len = 0, expected = 2, status = -71
Nov 06 03:02:42 kernel: usb 6-4: retry with reset-resume

Comment 17 Brian 2021-11-08 23:15:06 UTC

Created attachment 1840767 [details]
dmesg from 5.14.16-201

Comment 18 Brian 2021-11-08 23:16:03 UTC

Created attachment 1840768 [details]
trace from 5.14.16-201

Comment 19 Brian 2021-11-08 23:17:52 UTC

@mathias.nyman.com I've attached dmesg/trace logs from 5.14.16-201. FWIW this kernel did not fix the problem on my machine. Thanks!

Comment 20 Mathias Nyman 2021-11-10 21:30:14 UTC

Created attachment 1841103 [details]
testpatch keep address0 mutex over retries

Can you give this testpatch a try?
It tries to avoid letting several devices enter xhci slot default state during hub_port_init retries.

It also adds some extra prints to log, mostly just to see it applied successfully.
Thanks

Comment 21 Brian 2021-11-11 13:37:36 UTC

@mathias.nyman.com No go on that patch. I rebuilt the kernel with it applies and rebooted to the same symptom. I'll attach dmesg and trace log from this boot in case those are useful to you.

brian

Comment 22 Brian 2021-11-11 13:38:56 UTC

Created attachment 1841207 [details]
dmesg log (w/patch)

Comment 23 Brian 2021-11-11 13:39:37 UTC

Created attachment 1841208 [details]
trace log (w/patch)

Comment 24 Mathias Nyman 2021-11-12 09:30:04 UTC

Thanks, I think something went wrong when applying the patch.
dmesg log should show "GREPME port_init_try" if patch is applied.

Comment 25 Brian 2021-11-12 13:35:38 UTC

That's entirely possible. It's been well over a decade since I last built a kernel. Are there official Fedora instructions I can follow or (ideally) something like Copr that can do the heavy lifting?

Comment 26 Justin M. Forbes 2021-11-12 16:12:10 UTC

The easiest way would be 

fedpkg clone kernel
cd kernel
fedpkg switch-branch f35
cat /path/to/patch > linux-kernel-test.patch
fedpkg srpm
koji build --scratch --arch=x86_64 f35 /path/to/srpm

Comment 27 Brian 2021-11-12 17:48:38 UTC

Comment 28 Brian 2021-11-12 17:50:29 UTC

Justin - thx. Do I need to do anything special for fedpkg access to kernel? I've never used fedpkg before but when I try to clone I see:
  > fedpkg clone kernel
  Cloning into 'kernel'...
  Connection closed by 38.145.60.17 port 22
  fatal: Could not read from remote repository.

  Please make sure you have the correct access rights
  and the repository exists.
  Could not execute clone: Failed to execute command.

Comment 29 Justin M. Forbes 2021-11-12 18:31:32 UTC

(In reply to Brian from comment #28)
> Justin - thx. Do I need to do anything special for fedpkg access to kernel?


Sorry, you can clone it with 'fedpkg --anonymous clone kernel'

Comment 30 Adam Pribyl 2021-11-13 12:18:50 UTC

I can not boot the HP MicroServer Gen10 with kernels above 5.14.14, most of the time it just kernel panics during mode setting remaining in the black screen - nothing is visible if you attach the monitor after crash. There are many USB devices connect to the server, but I am not sure which can cause this if any.  As of now the only kernel installed during the system-upgrade is 5.14.16 this is pretty dangerous for upgrade process when started from older kernel..

Comment 31 Brian 2021-11-13 13:20:21 UTC

Justin - thanks, that worked. Well, 'fedpkg clone --anonymous kernel'. And using koji made this so much simpler so thanks for those instructions.

@mathias.nyman.com that patch worked! I'm writing this from a custom build of 5.14.18-200 with your patch applied:
  > journalctl -b -1 --no-hostname -k | grep GREPME
  Nov 13 08:11:53 kernel: usb usb1-port3: GREPME port_init_try 0
  Nov 13 08:11:53 kernel: usb usb2-port4: GREPME port_init_try 0
  Nov 13 08:11:53 kernel: usb usb1-port9: GREPME port_init_try 0
  Nov 13 08:11:53 kernel: usb 1-3-port1: GREPME port_init_try 0
  Nov 13 08:11:53 kernel: usb usb2-port3: GREPME port_init_try 0
  Nov 13 08:12:04 kernel: usb usb2-port3: GREPME port_init_try 1
  Nov 13 08:12:04 kernel: usb 1-3-port4: GREPME port_init_try 0
No issues so far with keyboard or mouse.

Let me know if you want to see the trace log or dmesg and I can attach those files but from my perspective I'd love to see this patch merged into f34 and f35. Thanks!

Comment 32 Mathias Nyman 2021-11-15 22:02:06 UTC

Thanks, I'll clean up the patch a bit (remove extra printks) and send it upstream (linux-usb) for review

Comment 33 Brian 2021-11-15 22:10:30 UTC

Thanks. Hopefully it's gets merged quickly!

Is there any way to track when the patch lands in F34 and F35? I'm blocked from upgrading kernels until then. Thanks!

Comment 34 Dave Hatton 2021-11-20 16:02:37 UTC

@mathias.nyman.com would you have an ETA on this patch? 
I'm really grateful you have fixed it, but it would be great to know when it might hit a mainstream kernel?
Sorry, don't wish to sound ungrateful for you efforts, just want to plan for upgrades.

Comment 35 Mathias Nyman 2021-11-22 09:59:37 UTC

Patch was/is on its way upstream, but it has some locking issues
Currently in usb-linus:
6ae6dc22d2d1 usb: hub: Fix usb enumeration issue due to address0 race

I need to fix that.

Could you test a new version of the patch that has fixed locking.
(apply without previous patch)

Comment 36 Mathias Nyman 2021-11-22 10:15:13 UTC

Created attachment 1842979 [details]
testpatch v2  keep address0 mutex over retries, fixed locking

testpatch v2, keep address0 mutex over retries, fixed locking issues

Adds extra "GREPMEv2" print to log for debugging purposes.
Thanks

Comment 37 Brian 2021-11-22 18:14:10 UTC

Mathias,

The new attachment boots OK for me when applied to 5.15.4.

> cat /proc/version
Linux version 5.15.4-100.fc34.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1), GNU ld version 2.35.2-6.fc34) #1 SMP Mon Nov 22 14:36:34 UTC 2021

> sudo journalctl --no-hostname -k | grep GREPMEv2
Nov 22 13:11:05 kernel: usb usb1-port3: GREPMEv2 port_init_try 0
Nov 22 13:11:05 kernel: usb usb2-port3: GREPMEv2 port_init_try 0
Nov 22 13:11:05 kernel: usb usb1-port9: GREPMEv2 port_init_try 0
Nov 22 13:11:05 kernel: usb usb2-port4: GREPMEv2 port_init_try 0
Nov 22 13:11:05 kernel: usb 1-3-port1: GREPMEv2 port_init_try 0
Nov 22 13:11:05 kernel: usb 1-3-port4: GREPMEv2 port_init_try 0

Thanks,
brian

Comment 38 Mathias Nyman 2021-11-23 09:26:56 UTC

Thanks for testing,
I'll submit a patch upstream that fixes the locking issues of the already upstreamed first patch.
Final result will be same as the v2 patch, (without the extra GREPMEv2 printing)

-Mathias

Comment 39 Brian 2021-11-23 13:28:17 UTC

Thanks for you work on this, Mathias!

Still curious if anyone CCed on this ticket can help clarify how to track when this patch will find its way into F34 and F35? I'm currently on F34 and unable to upgrade kernels (or go to F35) until this patch makes its way back into Fedora somewhere but I'm not even sure how to know when that will be safe for my system. Thanks in advance!

Comment 40 Hans de Goede 2021-11-23 14:11:25 UTC

(In reply to Brian from comment #39)
> Still curious if anyone CCed on this ticket can help clarify how to track when this patch will find its way into F34 and F35?

What you can do is watch here:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=linux-5.15.y

For Matthias' patch to show up and once it has shown up there in a new 5.15.y release, then wait for that release to become available in fedora-updates-testing. You can check the latest kernel builds available for Fedora here:

https://koji.fedoraproject.org/koji/packageinfo?packageID=8

Comment 41 Brian 2021-11-23 14:13:36 UTC

Thanks Hans!

Comment 42 Brian 2021-12-01 17:51:19 UTC

I saw that the patch got merged into 5.15.6 earlier and just successfully tested build 5.15.6-100.fc34 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1861751).

Comment 43 Peter Janes 2021-12-02 00:18:52 UTC

Sad to report that build 5.15.6-200.fc35.x86_64 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1861750) doesn't resolve the issue here.  Logitech optical mouse lights up at boot but turns off after a timeout, and my external CD/DVD drives are nonfunctional as well.  (My HP LaserJet printer also isn't recognized, although that may predate the bug.)

Comment 44 Dave Hatton 2021-12-02 10:46:23 UTC

5.15.6-200.fc35.x86_64 also fails for me with the same (orginal) symptoms.

Comment 45 Hans de Goede 2021-12-02 12:40:47 UTC

Hmm, so it seems that we still have an issue somewhere, starting with 5.14.14, at least in Dave Hatton's case (the original reporter of this bug) it has been reported that 5.14.13 does not have this problem

Peter and Dave, can you please provide the output of:

lspci -nn | grep USB

So that we know which vendor/model controller we are dealing with here?

Between 5.14.13 an 5.14.14 there are only a few USB commits, and thus only a few likely candidates:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=2b6c75bf92027ebf79f1cb579b5ff3ac5d4f218d
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=e54abefe703ab7c4e5983e889babd1447738ca42
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=fa3093d37cceb5d126ef8a4c12a9fbf373052915
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=eacfdec26656dc81a04ca1460cddb98aa4da731a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=db96c1d87c953fa79e026b1e3cb40fff5b413edd

As I mentioned in bug 2019576 I believe that this one:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=e54abefe703ab7c4e5983e889babd1447738ca42

Is a likely candidate for causing this issue, based on me assuming that some controllers may not like writing only half of the 64 bit register.

Some of the others only apply to certain models xHCI controllers. So if one of you is using one of those models then they might explain the problem too.

I have started a test 5.15.6 kernel-build with commit e54abefe703ab7c4e5983e889babd1447738ca42 reverted here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=79510050

Note this is still building ATM, this takes a couple of hours. Pete, Dave, please give this build a try and let us know if it helps (and also please provide "lspci -nn | grep USB" output). 

Here are some generic instructions for installing a kernel directly from koji:
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Comment 46 Dave Hatton 2021-12-02 13:15:48 UTC

Hi Hans, requested output.


[root@tyr ~]# lspci -nn | grep USB
00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af]

Comment 47 Adam Pribyl 2021-12-02 13:25:47 UTC

I am not the one on request list, but anyway this is the easiest thing I can do:

# lspci -nn | grep USB
00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20)
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7908] (rev 49)

Comment 48 Peter Janes 2021-12-02 14:34:31 UTC

As requested.

# lspci -nn | grep USB
00:14.0 USB controller [0c03]: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller [8086:a12f] (rev 31)

Comment 49 Dave Hatton 2021-12-02 18:32:41 UTC

Hans, 

Thank you! I'm up and running on the test kernel you built ...

# cat /proc/cmdline 
BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.15.6-200.bz2019788.fc35.x86_64 root=UUID=5cbb8d95-2001-470e-8c03-1c77b44d0b59 ro resume=UUID=6abb24af-8338-4a6b-b26e-323ea6b06033 rhgb quiet

Comment 50 Dave Hatton 2021-12-03 09:20:26 UTC

Hi Hans,

A follow up ... whilst this kernel build  is much better, my logitech wireless keyboard/mouse are now not recovering correctly following suspension overnight.

It took a number of unplug/plug cycles of the dongle before they were recognized.

Dave

Comment 51 Tobias Bachmann 2021-12-04 13:07:13 UTC

I can confirm that reverting e54abefe703ab7c4e5983e889babd1447738ca42 does solve my issue of XHCI dying on boot. I'm on a MacBook Pro Mid 2014 (MacBookPro11,1), running Arch Linux.

# lspci -nn | grep USB
00:14.0 USB controller [0c03]: Intel Corporation 8 Series USB xHCI HC [8086:9c31] (rev 04)

Comment 52 Hans de Goede 2021-12-04 14:03:08 UTC

Ok, so now we have 2 confirmations that commit e54abefe703ab7c4e5983e889babd1447738ca42 (5.14.y hash) aka commit ff0e50d3564f33b7f4b35cadeabd951d66cfc570 (Linus' master branch hash) is causing problems on several Intel xHCI controllers.

I believe that the problem here likely is that the data-bus to the xHCI is actually a 64 bits data bus and the 32 bit write introduced by this commit is causing a read + modify + write over the 64 bit data bus to change just the 32 bit.

This actually makes the chances of hitting the problem the commit triies to fix larger on these controllers since now we do:

sw: read 32 bit
hw: read 64 bit, return 32 bit
sw: modify it
sw: write 32 bit
hw: read 64 bit
hw: modify 32 bit of 64 bit wor5d
hw: write 64 bit

Note this is just a theory, but it seems plausible to me. I will start a discussion about this upstream.

Comment 53 Hans de Goede 2021-12-04 14:17:57 UTC

Upstream discussion about this:
https://lore.kernel.org/linux-usb/0a4d723c-221f-a42a-434c-34283a4cdab7@redhat.com/

Comment 54 Adam Pribyl 2021-12-04 15:38:23 UTC

OK, this kernel fixed boot problem also for me i.e. it is not just Intel XHCI.

Comment 55 Adam Pribyl 2021-12-04 15:56:56 UTC

(In reply to Adam Pribyl from comment #54)
> OK, this kernel fixed boot problem also for me i.e. it is not just Intel
> XHCI.

Pardon me, I jumped directly from 5.14.13 to 5.15.6-200.bz2019788, but then I tried also a 5.15.6 from distro, and it works for me too, so this was something different.

Comment 56 Hans de Goede 2021-12-06 08:52:47 UTC

A proper fix (instead of a revert) for commit e54abefe703ab7 has now landed upstream:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09f736aa95476631227d2dc0e6b9aeee1ad7ed58

This is in the just released 5.16-rc4 and should show up in a newer 5.15.y release soon-ish too.

I expect 5.16-rc4 to show up in koji:
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

Pretty soon, so if you are interested in this fix, you can give 5.16-rc4 a try once it has been build.

Comment 57 Tobias Bachmann 2021-12-06 18:14:16 UTC

5.16-rc4 works for me.

Comment 58 Dave Hatton 2021-12-07 08:55:46 UTC

5.16-rc4 works for me, but as mentioned my logitech wireless keyboard and mouse no longer function correctly following a deep suspend. I've raised ticket 2029140 to cover this.

Comment 59 Arjun 2021-12-15 12:50:32 UTC

My peripherals are back on the stable 5.15.7, thank you.

Comment 60 Dave Hatton 2022-02-01 10:31:44 UTC

To follow up.
I recently carried out a fresh installation of Fedora35 on the PC having this issue.
I'm now running kernel 5.15.18-200.fc35.x86_64.
I'm no longer experiencing the problem with USB devices failing at boot time. I'm not sure if it is the later kernel or the fresh installation that resolved this.
I'd be happy to close this issue now.

Comment 61 Hans de Goede 2022-02-01 11:38:42 UTC

The upstream kernel recently has seen a bunch of fixes related to race conditions in the xHCI driver which trigger duing device enumeration. So it is likely that it is the move to 5.15.18 fixed this.

Thank you for letting us know that this is fixed, I'll close this now.

Note You need to log in before you can comment on or make changes to this bug.

acaringi
adscvr
airlied
alciregi
allregshere
arjunamernath
bfallik
bskeggs
bugzilla.redhat.com
covex
hdegoede
jarodwilson
jeff
jeremy
jforbes
jglisse
jonathan
josef
kernel-maint
lgoncalv
linville
masami256
mathias.nyman
mchehab
ptalbert
sergio.pasra
steved
tobachmann