1. Please describe the problem: //Edit: You can see a debugging summary in comment 23. Thinkpad T480s with Thinkpad Thunderbolt 3 Dock can no longer resume from sleep since kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39. There's only black screen, the laptop is not responsive, ssh can't connect, it has to be force-rebooted. //Edit: Turns out the resume finishes OK, but you have to wait for ~70 seconds. ** This only happens if the dock is attached. When not attached, resume works correctly. ** 2. What is the Version-Release number of the kernel: Latest tested stable kernel: kernel-6.4.8-200.fc38.x86_64 Latest tested unstable kernel: kernel-6.5.0-0.rc5.20230808git14f9643dc90a.37.fc39 The resume is 100% broken in them. 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : I narrowed the issue to a **single day** of kernel changes: kernel-6.4.0-0.rc0.20230427git6e98b09da931.5.fc39 - works 100% kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39 - broken 100% (The latest kernel in 6.3 series, kernel-6.3.13-200.fc38, works 100% as well). 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Trivial to reproduce, and completely reliable. Have the dock connected and suspend the laptop, then try to resume it. In the older kernels, it works 100%, in the newer kernel, it breaks 100%. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes, see 2). 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. I can't get any kernel log after a failed resume. The logs are not written to the disk. The previous boot journal ends with a suspend message. @jforbes Since I narrowed down the kernel changes to a single day, I assume there's a high chance to get this fixed. I can try to bisect even individual commits, if required. But communication with kernel devs is my biggest worry. Can you try to reach out to the best person on my behalf, or at least advise me how to do that (who, how)? Thanks a lot.
Created attachment 1982541 [details] lspci
> @jforbes Since I narrowed down the kernel changes to a single > day, I assume there's a high chance to get this fixed. I can try to bisect > even individual commits, if required. But communication with kernel devs is > my biggest worry. Can you try to reach out to the best person on my behalf, > or at least advise me how to do that (who, how)? Thanks a lot. While helpful, a single day in the merge window (rc0) is not a trivial number of commits. That day in particular was over 2700. If it is strictly in the thunderbolt code, there are few (4). Unfortunately thunderbolt interacts with USB and PCI as well, bringing the total commits closer to 300. A bisect would be helpful. If you don't have the time to do so, I can reach out, but if you are willing, Thunderbolt tends to react to bugs on bugzilla.kernel.org and they have bot which interfaces with linux-usb.org. Select USB as the component if filing a bugzilla there. Either way, let me know what you do here. We can either track the upstream to get a fix backported sooner, or I can act as an intermediary.
> That day in particular was over 2700. Ouch, I had no idea. I'll try to do the bisect and then file a bug in bugzilla.kernel.org. I found fedbisect [1], but it hasn't been touched in 6 years. Is it still the tool for this job, or is there some other fedora-specific tool/guide elsewhere? Thanks a lot for advice. [1] https://pagure.io/fedbisect
That would probably be a poor tool for the job. I highly recommend doing a bisect the upstream way as it is massively faster than building rpms for each one. We spent more time doing packaging bits than we do building the actual kernel. https://docs.kernel.org/admin-guide/bug-bisect.html has a quick guide. Your starting good is 6e98b09da931 and your starting bad is 33afd4b76393
Justin, I finally bisected this to be caused by the following commit. I verified that it fails to resume in 5/5 attempts, and the last tested good commit successfully resumes in 5/5 attempts. So I'm quite certain this is the source of regression. It's a change in drivers/pci/pci-driver.c. Should I still report it upstream according to your instructions in comment 2, or (since this is in PCI and not Thunderbolt) report it upstream differently? Thanks! e8b908146d44310473e43b3382eca126e12d279c is the first bad commit commit e8b908146d44310473e43b3382eca126e12d279c Author: Mika Westerberg <mika.westerberg.com> Date: Tue Apr 4 08:27:13 2023 +0300 PCI/PM: Increase wait time after resume PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional reset and this same delay is prescribed when coming out of D3cold (as that involves reset too). A device that requires more than 1 second to initialize after reset may respond to config requests with Request Retry Status completions (sec 2.3.1), and we accommodate that in Linux with a 60 second cap (PCIE_RESET_READY_POLL_MS). Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code path, not in the resume path. However, a device has surfaced, namely Intel Titan Ridge xHCI, which requires a longer delay also in the resume code path. Make the resume code path to use this same extended delay as the reset path. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728 Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com Reported-by: Chris Chiu <chris.chiu> Signed-off-by: Mika Westerberg <mika.westerberg.com> Signed-off-by: Bjorn Helgaas <bhelgaas> Cc: Lukas Wunner <lukas> drivers/pci/pci-driver.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Created attachment 1983351 [details] git bisect log
I also tested this with a different laptop, Thinkpad P1 gen 3. It resumes just fine with that dock. So this is not a general issue, but there's some connection between the dock and T480s which makes it exhibit the problem.
Nice work finding the commit! I would likely email linux-pci.org and CC the Signed-off-by and Reported-by emails on that commit. Explain the bisection and the symptoms.
I sent the email report (with some additional debugging details) to the linux-pci kernel list (and regressions list CCed, per upstream instructions): https://lore.kernel.org/linux-pci/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u https://lore.kernel.org/regressions/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u I also updated the kernel bugzilla ticket which the regression commit linked to: https://bugzilla.kernel.org/show_bug.cgi?id=216728 Thanks for pointers, Justin.
Turns out the resume is not broken, but delayed with additional 60 seconds. I'll continue the discussion in upstream kernel lists, but add required attachments here, so that I can link them from kernel lists (I assume attaching them directly to the email is not appreciated).
Created attachment 1984636 [details] dmesg after delayed resume
Created attachment 1984637 [details] journal after delayed resume
Created attachment 1984638 [details] lspci -vv before suspend
Created attachment 1984639 [details] lspci -vv after delayed resume
Created attachment 1984726 [details] dmesg after fast resume
Created attachment 1984727 [details] fwupdmgr get-devices output
Created attachment 1984728 [details] fwupdmgr get-devices output
Created attachment 1984785 [details] dmesg with TB assist mode
Created attachment 1984786 [details] dmesg with devices unavailable
Created attachment 1984802 [details] acpidump
Created attachment 1984803 [details] dmesg with thunderbolt.dyndbg=+p
Created attachment 1985262 [details] dmesg with TB user security level
After extensive debugging in upstream kernel (see comment 9), this is most probably related to the laptop's firmware. When the Thunderbolt security level is set to Secure Connection, it fails to properly reconnect to the dock after resume. With the latest kernels, that causes a ~60 second delay. When the security is lowered to User Authorization, it works fine. Kernel developers suggested to contact Lenovo and ask them to look into this and possibly publish a Thunderbolt firmware update. @mpearson Hi Mark, is this something you'd be interested in looking into? I don't know what Lenovo's support plans are for Thinkpad T480s, but it still seems to receive updated firmware.
Hi Kamil, Ack - I've created internal ticket LO-2616 for tracking. We'll confirm we can reproduce the issue and follow up with the FW team and get their feedback. Mark
Created attachment 1990815 [details] dmesh with TB secure security level and Wake by TB enabled
Mark, I found out that this problem only happens when I have "Wake by Thunderbolt 3" option disabled (which is a non-default value). See more here: https://lore.kernel.org/linux-pci/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#ma7b6e1740c042aa624cbf0cef63cd887ae5fd90d
Thanks. Finally got feedback from the FW team and they are saying this is by design and that the Wake by TB setting should be enabled. Not sure there is much else we can do here - let me know if we're missing anything important. Mark
Thanks a lot for your update, Mark. Sigh, this was a very long exercise regarding a BIOS option that could've been either not present (since the system doesn't work correctly when this is in a non-default value) or better explained :-/ Anyway, I'm going to close this bug - it only happens in non-default BIOS settings and it seems that there's no further action related to this. Thanks for looking into it.