1. Please describe the problem: Thinkpad T480s with Thinkpad Thunderbolt 3 Dock can no longer resume from sleep since kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39. There's only black screen, the laptop is not responsive, ssh can't connect, it has to be force-rebooted. ** This only happens if the dock is attached. When not attached, resume works correctly. ** 2. What is the Version-Release number of the kernel: Latest tested stable kernel: kernel-6.4.8-200.fc38.x86_64 Latest tested unstable kernel: kernel-6.5.0-0.rc5.20230808git14f9643dc90a.37.fc39 The resume is 100% broken in them. 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : I narrowed the issue to a **single day** of kernel changes: kernel-6.4.0-0.rc0.20230427git6e98b09da931.5.fc39 - works 100% kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39 - broken 100% (The latest kernel in 6.3 series, kernel-6.3.13-200.fc38, works 100% as well). 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Trivial to reproduce, and completely reliable. Have the dock connected and suspend the laptop, then try to resume it. In the older kernels, it works 100%, in the newer kernel, it breaks 100%. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes, see 2). 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. I can't get any kernel log after a failed resume. The logs are not written to the disk. The previous boot journal ends with a suspend message. @jforbes Since I narrowed down the kernel changes to a single day, I assume there's a high chance to get this fixed. I can try to bisect even individual commits, if required. But communication with kernel devs is my biggest worry. Can you try to reach out to the best person on my behalf, or at least advise me how to do that (who, how)? Thanks a lot.
Created attachment 1982541 [details] lspci
> @jforbes Since I narrowed down the kernel changes to a single > day, I assume there's a high chance to get this fixed. I can try to bisect > even individual commits, if required. But communication with kernel devs is > my biggest worry. Can you try to reach out to the best person on my behalf, > or at least advise me how to do that (who, how)? Thanks a lot. While helpful, a single day in the merge window (rc0) is not a trivial number of commits. That day in particular was over 2700. If it is strictly in the thunderbolt code, there are few (4). Unfortunately thunderbolt interacts with USB and PCI as well, bringing the total commits closer to 300. A bisect would be helpful. If you don't have the time to do so, I can reach out, but if you are willing, Thunderbolt tends to react to bugs on bugzilla.kernel.org and they have bot which interfaces with linux-usb.org. Select USB as the component if filing a bugzilla there. Either way, let me know what you do here. We can either track the upstream to get a fix backported sooner, or I can act as an intermediary.
> That day in particular was over 2700. Ouch, I had no idea. I'll try to do the bisect and then file a bug in bugzilla.kernel.org. I found fedbisect [1], but it hasn't been touched in 6 years. Is it still the tool for this job, or is there some other fedora-specific tool/guide elsewhere? Thanks a lot for advice. [1] https://pagure.io/fedbisect
That would probably be a poor tool for the job. I highly recommend doing a bisect the upstream way as it is massively faster than building rpms for each one. We spent more time doing packaging bits than we do building the actual kernel. https://docs.kernel.org/admin-guide/bug-bisect.html has a quick guide. Your starting good is 6e98b09da931 and your starting bad is 33afd4b76393
Justin, I finally bisected this to be caused by the following commit. I verified that it fails to resume in 5/5 attempts, and the last tested good commit successfully resumes in 5/5 attempts. So I'm quite certain this is the source of regression. It's a change in drivers/pci/pci-driver.c. Should I still report it upstream according to your instructions in comment 2, or (since this is in PCI and not Thunderbolt) report it upstream differently? Thanks! e8b908146d44310473e43b3382eca126e12d279c is the first bad commit commit e8b908146d44310473e43b3382eca126e12d279c Author: Mika Westerberg <mika.westerberg.com> Date: Tue Apr 4 08:27:13 2023 +0300 PCI/PM: Increase wait time after resume PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional reset and this same delay is prescribed when coming out of D3cold (as that involves reset too). A device that requires more than 1 second to initialize after reset may respond to config requests with Request Retry Status completions (sec 2.3.1), and we accommodate that in Linux with a 60 second cap (PCIE_RESET_READY_POLL_MS). Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code path, not in the resume path. However, a device has surfaced, namely Intel Titan Ridge xHCI, which requires a longer delay also in the resume code path. Make the resume code path to use this same extended delay as the reset path. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728 Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com Reported-by: Chris Chiu <chris.chiu> Signed-off-by: Mika Westerberg <mika.westerberg.com> Signed-off-by: Bjorn Helgaas <bhelgaas> Cc: Lukas Wunner <lukas> drivers/pci/pci-driver.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Created attachment 1983351 [details] git bisect log
I also tested this with a different laptop, Thinkpad P1 gen 3. It resumes just fine with that dock. So this is not a general issue, but there's some connection between the dock and T480s which makes it exhibit the problem.
Nice work finding the commit! I would likely email linux-pci.org and CC the Signed-off-by and Reported-by emails on that commit. Explain the bisection and the symptoms.