1. Please describe the problem: During the kernel boot, all USB devices fail. Unable to use keyboard/mouse to operate machine. Similar is described in: https://ask.fedoraproject.org/t/the-new-update-of-kernel-and-dracut-deactivated-some-usb-ports/17550 https://bbs.archlinux.org/viewtopic.php?id=270795 2. What is the Version-Release number of the kernel: 5.14.13.200 works Anything after this kernel exhibits the issue 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : 5.14.13.200 works Anything after this kernel exhibits the issue 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Boot the machine with 5.14.14 + 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: not tried 6. Are you running any modules that not shipped with directly Fedora's kernel?: no 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 10 using xhci_hcd Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: New USB device found, idVendor=051d, idProduct=0002, bcdDevice= 1.06 Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: New USB device strings: Mfr=3, Product=1, SerialNumber=2 Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: Product: Back-UPS RS 1500 FW:8.g9 .I USB FW:g9 Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: Manufacturer: American Power Conversion Nov 03 09:27:10 tyr.starground.co.uk kernel: usb 1-7: SerialNumber: NB0647001556 Nov 03 09:27:10 tyr.starground.co.uk kernel: hid-generic 0003:051D:0002.000A: hiddev100,hidraw9: USB HID v1.10 Device [American Power Conversion Back-UPS RS 1500 FW:8.g9 .I USB FW:g9 ] on usb-0000:00:14.0-7/input0 Nov 03 09:27:15 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-10: new full-speed USB device number 11 using xhci_hcd Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-10: New USB device found, idVendor=8087, idProduct=0a2b, bcdDevice= 0.01 Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Nov 03 09:27:31 tyr.starground.co.uk rngd[278]: [rtlsdr]: Initialization Failed Nov 03 09:27:31 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 12 using xhci_hcd Nov 03 09:27:36 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 03 09:27:52 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 03 09:27:52 tyr.starground.co.uk kernel: usb 1-4-port4: attempt power cycle Nov 03 09:27:53 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 13 using xhci_hcd Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Abort failed to stop command ring: -110 Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up Nov 03 09:28:03 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 1-2: USB disconnect, device number 2 Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 2-4: USB disconnect, device number 2 Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 1-3: USB disconnect, device number 3 Nov 03 09:28:03 tyr.starground.co.uk kernel: usb 1-4: USB disconnect, device number 4 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4.4: device not accepting address 13, error -108 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4-port4: couldn't allocate usb_device Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4.1: USB disconnect, device number 6 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-4.2: USB disconnect, device number 8 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-5: USB disconnect, device number 5 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-6: USB disconnect, device number 7 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-7: USB disconnect, device number 9 Nov 03 09:28:04 tyr.starground.co.uk kernel: usb 1-10: USB disconnect, device number 11 Nov 03 09:28:04 tyr.starground.co.uk systemd[1]: Finished dracut initqueue hook.
Thank you for your bug-report. This is a known issue / regression since 5.14.14, see: https://lore.kernel.org/stable/42bcbea6-5eb8-16c7-336a-2cb72e71bc36@redhat.com/ We know which patches in 5.14.14 are causing this and they will be reverted for the next 5.14.y release, for now please stay with a kernel < 5.14.14 to work around this.
*** Bug 2019576 has been marked as a duplicate of this bug. ***
*** Bug 2019542 has been marked as a duplicate of this bug. ***
Any chance this could be reproduced with xhci dynamic debug and tracing enabled? To enable them at boot please add: "usbcore.dyndbg=+p xhci_hcd.dyndbg=+p trace_event=xhci-hcd trace_buf_size=80M" to your kernel cmdline before booting. after boot send output of dmesg, and content of /sys/kernel/debug/tracing/trace Thanks
Created attachment 1839802 [details] trace log
Created attachment 1839803 [details] dmesg log
Mathias, I've attached the logs, hopefully they're helpful. Let me know if I can provide any more details. Thx, brian
*** Bug 2018703 has been marked as a duplicate of this bug. ***
This should be fixed in kernel-5.14.16-201.fc34 / kernel-5.14.16-301.fc35 which are available in the updates-testing repo now, please give these kernels a try: sudo dnf update --enablerepo updates-testing 'kernel*'
I'm afraid I still have the issue with kernel-5.14.16-301.fc35 -- Journal begins at Tue 2021-05-11 13:07:27 BST, ends at Fri 2021-11-05 08:14:32 GMT. -- Nov 05 07:57:17 tyr.starground.co.uk kernel: microcode: microcode updated early to revision 0xea, date = 2021-01-05 Nov 05 07:57:17 tyr.starground.co.uk kernel: Linux version 5.14.16-301.fc35.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1), GNU ld version 2.37-10.fc35) #1 SMP Wed Nov 3 13:55:42 UTC 2021 Nov 05 07:57:17 tyr.starground.co.uk kernel: Command line: BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.14.16-301.fc35.x86_64 root=UUID=5cbb8d95-2001-470e-8c03-1c77b44d0b59 ro resume=UUID=6abb24af-8338-4a6b-b26e-323ea6b06033 rhgb quiet ... ... ov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: New USB device found, idVendor=046d, idProduct=c52b, bcdDevice=12.11 Nov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 Nov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: Product: USB Receiver Nov 05 07:57:23 tyr.starground.co.uk kernel: usb 1-4.2: Manufacturer: Logitech Nov 05 07:57:23 tyr.starground.co.uk kernel: input: Logitech USB Receiver as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.0/0003:046D:C52B.0004/input/input10 Nov 05 07:57:23 tyr.starground.co.uk kernel: hid-generic 0003:046D:C52B.0004: input,hidraw3: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:00:14.0-4.2/input0 Nov 05 07:57:23 tyr.starground.co.uk kernel: input: Logitech USB Receiver Mouse as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:046D:C52B.0005/input/input11 Nov 05 07:57:23 tyr.starground.co.uk kernel: input: Logitech USB Receiver Consumer Control as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:046D:C52B.0005/input/input12 Nov 05 07:57:24 tyr.starground.co.uk kernel: input: Logitech USB Receiver System Control as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4.2/1-4.2:1.1/0003:046D:C52B.0005/input/input13 Nov 05 07:57:24 tyr.starground.co.uk kernel: hid-generic 0003:046D:C52B.0005: input,hiddev98,hidraw4: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:00:14.0-4.2/input1 Nov 05 07:57:24 tyr.starground.co.uk kernel: hid-generic 0003:046D:C52B.0006: hiddev99,hidraw5: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:00:14.0-4.2/input2 Nov 05 07:57:24 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 9 using xhci_hcd Nov 05 07:57:24 tyr.starground.co.uk kernel: hid-generic 0003:051D:0002.0007: hiddev100,hidraw6: USB HID v1.10 Device [American Power Conversion Back-UPS RS 1500 FW:8.g9 .I USB FW:g9 ] on usb-0000:00:14.0-7/input0 Nov 05 07:57:29 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-10: new full-speed USB device number 10 using xhci_hcd Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-10: New USB device found, idVendor=8087, idProduct=0a2b, bcdDevice= 0.01 Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-10: New USB device strings: Mfr=0, Product=0, SerialNumber=0 Nov 05 07:57:45 tyr.starground.co.uk rngd[277]: [rtlsdr]: Initialization Failed Nov 05 07:57:45 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 11 using xhci_hcd Nov 05 07:57:50 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 05 07:58:06 tyr.starground.co.uk kernel: usb 1-4.4: device descriptor read/64, error -110 Nov 05 07:58:06 tyr.starground.co.uk kernel: usb 1-4-port4: attempt power cycle Nov 05 07:58:07 tyr.starground.co.uk kernel: usb 1-4.4: new high-speed USB device number 12 using xhci_hcd Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Abort failed to stop command ring: -110 Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: HC died; cleaning up Nov 05 07:58:17 tyr.starground.co.uk kernel: xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 1-2: USB disconnect, device number 2 Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 2-4: USB disconnect, device number 2 Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 1-3: USB disconnect, device number 3 Nov 05 07:58:17 tyr.starground.co.uk kernel: usb 1-4: USB disconnect, device number 4 Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4.4: device not accepting address 12, error -108 Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4-port4: couldn't allocate usb_device Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4.1: USB disconnect, device number 6 Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-4.2: USB disconnect, device number 8 Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-5: USB disconnect, device number 5 Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-7: USB disconnect, device number 7 Nov 05 07:58:18 tyr.starground.co.uk kernel: usb 1-10: USB disconnect, device number 10 Nov 05 07:58:18 tyr.starground.co.uk systemd[1]: Finished dracut initqueue hook. ... ...
Thanks for the trace and dmesg, I'm looking into those. If you could take a couple more traces of failing cases just to check if they have the same pattern of "enable slot A, set Address for slot B" before failing, it would help a lot. Details: Rootcause not found yet I'm currently suspecting two xhci slots are in a "default" state at the same time, meaning two USB devices are waiting with address 0 for host to give them an unique address. xHC can't handle this as stated in xhci 4.5.3.4 Normally we first enable a slot, then later set an address for the device behind it, and move to the next slot. Trace shows it hangs after one roothub thread enables slot 6, and the second roothub thread sets the address of slot 5, a different slot right after that. The address0_mutex taken right after enabling a slot is supposed to prevent two slots from being in default state at the same time. Maybe there is some small window during a retry that leave a slot in default state and unlocks the mutex? Need to check that. This could make sense as after the patch both roothubs are created almost simultaneously, causing a higher race risk during enumeration.
@mathias.nyman.com Do you want me to provide more traces from that same machine or are you looking for data from other machines? I only have one computer with this problem, unsure if that data will be helpful or not. Thx, brian
I've been encountering same issue on a Fresco Logic FL1100 USB 3.0 Host Controller [1b73:1100] (rev 10), though it usually doesn't hit until the bus has been running for several hours; there will typically be at least one failure within 24-48 hours. Removing the PCI device and rescanning the PCI bus will restore function for another several hours when the same bug will hit again. I was able to get a debug log the last time this occurred. Attaching a 2-minute dmesg snippet so there's some space before and after; the error occurs at 03:02:42 on line 3898. Unfortunately, the trace has since cycled out.
Created attachment 1840495 [details] dmesg xhci TRB error 4
I have a MacBook Pro late 2014 and I am seeing this issue with Kernel 5.14.14 and later.
@Brian, Same machine is fine @Jeff, Thanks, your case seems to be related to resume. Same theory that several xHC slots end up in default state at the same time still fits. Log show a reset-resume caes, which will also reset the device and set xHC slot to default state. The TRB error also fits. Specs say xHC should respond with a TRB error in case a SET_ADDRESS control transfer is submitted while the host is currently submitting a SET ADDRESS request for another device as part of a Address device command. (see xhci 4.5.4.1) Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Transfer error for slot 5 ep 0 on endpoint Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: TRB error 4, halted endpoint index = 0 Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: // Ding dong! Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Ignoring reset ep completion code of 1 Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Removing canceled TD starting at 0x118ab8300 (dma). Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Set TR Deq ptr 0x118ab8330, cycle 0 Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: // Ding dong! Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Successful Set TR Deq Ptr cmd, deq = @118ab8330 Nov 06 03:02:42 kernel: xhci_hcd 0000:07:00.0: Giveback URB 00000000740a70e1, len = 0, expected = 2, status = -71 Nov 06 03:02:42 kernel: usb 6-4: retry with reset-resume
Created attachment 1840767 [details] dmesg from 5.14.16-201
Created attachment 1840768 [details] trace from 5.14.16-201
@mathias.nyman.com I've attached dmesg/trace logs from 5.14.16-201. FWIW this kernel did not fix the problem on my machine. Thanks!
Created attachment 1841103 [details] testpatch keep address0 mutex over retries Can you give this testpatch a try? It tries to avoid letting several devices enter xhci slot default state during hub_port_init retries. It also adds some extra prints to log, mostly just to see it applied successfully. Thanks
@mathias.nyman.com No go on that patch. I rebuilt the kernel with it applies and rebooted to the same symptom. I'll attach dmesg and trace log from this boot in case those are useful to you. brian
Created attachment 1841207 [details] dmesg log (w/patch)
Created attachment 1841208 [details] trace log (w/patch)
Thanks, I think something went wrong when applying the patch. dmesg log should show "GREPME port_init_try" if patch is applied.
That's entirely possible. It's been well over a decade since I last built a kernel. Are there official Fedora instructions I can follow or (ideally) something like Copr that can do the heavy lifting?
The easiest way would be fedpkg clone kernel cd kernel fedpkg switch-branch f35 cat /path/to/patch > linux-kernel-test.patch fedpkg srpm koji build --scratch --arch=x86_64 f35 /path/to/srpm
@
Justin - thx. Do I need to do anything special for fedpkg access to kernel? I've never used fedpkg before but when I try to clone I see: > fedpkg clone kernel Cloning into 'kernel'... Connection closed by 38.145.60.17 port 22 fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. Could not execute clone: Failed to execute command.
(In reply to Brian from comment #28) > Justin - thx. Do I need to do anything special for fedpkg access to kernel? Sorry, you can clone it with 'fedpkg --anonymous clone kernel'
I can not boot the HP MicroServer Gen10 with kernels above 5.14.14, most of the time it just kernel panics during mode setting remaining in the black screen - nothing is visible if you attach the monitor after crash. There are many USB devices connect to the server, but I am not sure which can cause this if any. As of now the only kernel installed during the system-upgrade is 5.14.16 this is pretty dangerous for upgrade process when started from older kernel..
Justin - thanks, that worked. Well, 'fedpkg clone --anonymous kernel'. And using koji made this so much simpler so thanks for those instructions. @mathias.nyman.com that patch worked! I'm writing this from a custom build of 5.14.18-200 with your patch applied: > journalctl -b -1 --no-hostname -k | grep GREPME Nov 13 08:11:53 kernel: usb usb1-port3: GREPME port_init_try 0 Nov 13 08:11:53 kernel: usb usb2-port4: GREPME port_init_try 0 Nov 13 08:11:53 kernel: usb usb1-port9: GREPME port_init_try 0 Nov 13 08:11:53 kernel: usb 1-3-port1: GREPME port_init_try 0 Nov 13 08:11:53 kernel: usb usb2-port3: GREPME port_init_try 0 Nov 13 08:12:04 kernel: usb usb2-port3: GREPME port_init_try 1 Nov 13 08:12:04 kernel: usb 1-3-port4: GREPME port_init_try 0 No issues so far with keyboard or mouse. Let me know if you want to see the trace log or dmesg and I can attach those files but from my perspective I'd love to see this patch merged into f34 and f35. Thanks!
Thanks, I'll clean up the patch a bit (remove extra printks) and send it upstream (linux-usb) for review
Thanks. Hopefully it's gets merged quickly! Is there any way to track when the patch lands in F34 and F35? I'm blocked from upgrading kernels until then. Thanks!
@mathias.nyman.com would you have an ETA on this patch? I'm really grateful you have fixed it, but it would be great to know when it might hit a mainstream kernel? Sorry, don't wish to sound ungrateful for you efforts, just want to plan for upgrades.
Patch was/is on its way upstream, but it has some locking issues Currently in usb-linus: 6ae6dc22d2d1 usb: hub: Fix usb enumeration issue due to address0 race I need to fix that. Could you test a new version of the patch that has fixed locking. (apply without previous patch)
Created attachment 1842979 [details] testpatch v2 keep address0 mutex over retries, fixed locking testpatch v2, keep address0 mutex over retries, fixed locking issues Adds extra "GREPMEv2" print to log for debugging purposes. Thanks
Mathias, The new attachment boots OK for me when applied to 5.15.4. > cat /proc/version Linux version 5.15.4-100.fc34.x86_64 (mockbuild.fedoraproject.org) (gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1), GNU ld version 2.35.2-6.fc34) #1 SMP Mon Nov 22 14:36:34 UTC 2021 > sudo journalctl --no-hostname -k | grep GREPMEv2 Nov 22 13:11:05 kernel: usb usb1-port3: GREPMEv2 port_init_try 0 Nov 22 13:11:05 kernel: usb usb2-port3: GREPMEv2 port_init_try 0 Nov 22 13:11:05 kernel: usb usb1-port9: GREPMEv2 port_init_try 0 Nov 22 13:11:05 kernel: usb usb2-port4: GREPMEv2 port_init_try 0 Nov 22 13:11:05 kernel: usb 1-3-port1: GREPMEv2 port_init_try 0 Nov 22 13:11:05 kernel: usb 1-3-port4: GREPMEv2 port_init_try 0 Thanks, brian
Thanks for testing, I'll submit a patch upstream that fixes the locking issues of the already upstreamed first patch. Final result will be same as the v2 patch, (without the extra GREPMEv2 printing) -Mathias
Thanks for you work on this, Mathias! Still curious if anyone CCed on this ticket can help clarify how to track when this patch will find its way into F34 and F35? I'm currently on F34 and unable to upgrade kernels (or go to F35) until this patch makes its way back into Fedora somewhere but I'm not even sure how to know when that will be safe for my system. Thanks in advance!
(In reply to Brian from comment #39) > Still curious if anyone CCed on this ticket can help clarify how to track when this patch will find its way into F34 and F35? What you can do is watch here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=linux-5.15.y For Matthias' patch to show up and once it has shown up there in a new 5.15.y release, then wait for that release to become available in fedora-updates-testing. You can check the latest kernel builds available for Fedora here: https://koji.fedoraproject.org/koji/packageinfo?packageID=8
Thanks Hans!
I saw that the patch got merged into 5.15.6 earlier and just successfully tested build 5.15.6-100.fc34 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1861751).
Sad to report that build 5.15.6-200.fc35.x86_64 (https://koji.fedoraproject.org/koji/buildinfo?buildID=1861750) doesn't resolve the issue here. Logitech optical mouse lights up at boot but turns off after a timeout, and my external CD/DVD drives are nonfunctional as well. (My HP LaserJet printer also isn't recognized, although that may predate the bug.)
5.15.6-200.fc35.x86_64 also fails for me with the same (orginal) symptoms.
Hmm, so it seems that we still have an issue somewhere, starting with 5.14.14, at least in Dave Hatton's case (the original reporter of this bug) it has been reported that 5.14.13 does not have this problem Peter and Dave, can you please provide the output of: lspci -nn | grep USB So that we know which vendor/model controller we are dealing with here? Between 5.14.13 an 5.14.14 there are only a few USB commits, and thus only a few likely candidates: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=2b6c75bf92027ebf79f1cb579b5ff3ac5d4f218d https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=e54abefe703ab7c4e5983e889babd1447738ca42 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=fa3093d37cceb5d126ef8a4c12a9fbf373052915 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=eacfdec26656dc81a04ca1460cddb98aa4da731a https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=db96c1d87c953fa79e026b1e3cb40fff5b413edd As I mentioned in bug 2019576 I believe that this one: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.14.y&id=e54abefe703ab7c4e5983e889babd1447738ca42 Is a likely candidate for causing this issue, based on me assuming that some controllers may not like writing only half of the 64 bit register. Some of the others only apply to certain models xHCI controllers. So if one of you is using one of those models then they might explain the problem too. I have started a test 5.15.6 kernel-build with commit e54abefe703ab7c4e5983e889babd1447738ca42 reverted here: https://koji.fedoraproject.org/koji/taskinfo?taskID=79510050 Note this is still building ATM, this takes a couple of hours. Pete, Dave, please give this build a try and let us know if it helps (and also please provide "lspci -nn | grep USB" output). Here are some generic instructions for installing a kernel directly from koji: https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt
Hi Hans, requested output. [root@tyr ~]# lspci -nn | grep USB 00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af]
I am not the one on request list, but anyway this is the easiest thing I can do: # lspci -nn | grep USB 00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20) 00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7908] (rev 49)
As requested. # lspci -nn | grep USB 00:14.0 USB controller [0c03]: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller [8086:a12f] (rev 31)
Hans, Thank you! I'm up and running on the test kernel you built ... # cat /proc/cmdline BOOT_IMAGE=(hd2,gpt2)/vmlinuz-5.15.6-200.bz2019788.fc35.x86_64 root=UUID=5cbb8d95-2001-470e-8c03-1c77b44d0b59 ro resume=UUID=6abb24af-8338-4a6b-b26e-323ea6b06033 rhgb quiet
Hi Hans, A follow up ... whilst this kernel build is much better, my logitech wireless keyboard/mouse are now not recovering correctly following suspension overnight. It took a number of unplug/plug cycles of the dongle before they were recognized. Dave
I can confirm that reverting e54abefe703ab7c4e5983e889babd1447738ca42 does solve my issue of XHCI dying on boot. I'm on a MacBook Pro Mid 2014 (MacBookPro11,1), running Arch Linux. # lspci -nn | grep USB 00:14.0 USB controller [0c03]: Intel Corporation 8 Series USB xHCI HC [8086:9c31] (rev 04)
Ok, so now we have 2 confirmations that commit e54abefe703ab7c4e5983e889babd1447738ca42 (5.14.y hash) aka commit ff0e50d3564f33b7f4b35cadeabd951d66cfc570 (Linus' master branch hash) is causing problems on several Intel xHCI controllers. I believe that the problem here likely is that the data-bus to the xHCI is actually a 64 bits data bus and the 32 bit write introduced by this commit is causing a read + modify + write over the 64 bit data bus to change just the 32 bit. This actually makes the chances of hitting the problem the commit triies to fix larger on these controllers since now we do: sw: read 32 bit hw: read 64 bit, return 32 bit sw: modify it sw: write 32 bit hw: read 64 bit hw: modify 32 bit of 64 bit wor5d hw: write 64 bit Note this is just a theory, but it seems plausible to me. I will start a discussion about this upstream.
Upstream discussion about this: https://lore.kernel.org/linux-usb/0a4d723c-221f-a42a-434c-34283a4cdab7@redhat.com/
OK, this kernel fixed boot problem also for me i.e. it is not just Intel XHCI.
(In reply to Adam Pribyl from comment #54) > OK, this kernel fixed boot problem also for me i.e. it is not just Intel > XHCI. Pardon me, I jumped directly from 5.14.13 to 5.15.6-200.bz2019788, but then I tried also a 5.15.6 from distro, and it works for me too, so this was something different.
A proper fix (instead of a revert) for commit e54abefe703ab7 has now landed upstream: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09f736aa95476631227d2dc0e6b9aeee1ad7ed58 This is in the just released 5.16-rc4 and should show up in a newer 5.15.y release soon-ish too. I expect 5.16-rc4 to show up in koji: https://koji.fedoraproject.org/koji/packageinfo?packageID=8 Pretty soon, so if you are interested in this fix, you can give 5.16-rc4 a try once it has been build.
5.16-rc4 works for me.
5.16-rc4 works for me, but as mentioned my logitech wireless keyboard and mouse no longer function correctly following a deep suspend. I've raised ticket 2029140 to cover this.
My peripherals are back on the stable 5.15.7, thank you.
To follow up. I recently carried out a fresh installation of Fedora35 on the PC having this issue. I'm now running kernel 5.15.18-200.fc35.x86_64. I'm no longer experiencing the problem with USB devices failing at boot time. I'm not sure if it is the later kernel or the fresh installation that resolved this. I'd be happy to close this issue now.
The upstream kernel recently has seen a bunch of fixes related to race conditions in the xHCI driver which trigger duing device enumeration. So it is likely that it is the move to 5.15.18 fixed this. Thank you for letting us know that this is fixed, I'll close this now.