2230357 – resume with a Thunderbolt dock delayed 60 seconds with commit e8b908146d44 "PCI/PM: Increase wait time after resume"

Bug 2230357 - resume with a Thunderbolt dock delayed 60 seconds with commit e8b908146d44 "PCI/PM: Increase wait time after resume"

Summary: resume with a Thunderbolt dock delayed 60 seconds with commit e8b908146d44 "P...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	38
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2184978
TreeView+	depends on / blocked

Reported:	2023-08-09 11:12 UTC by Kamil Páral
Modified:	2023-11-10 08:41 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2023-11-10 08:41:33 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci (2.87 KB, text/plain) 2023-08-09 11:14 UTC, Kamil Páral	no flags	Details
git bisect log (2.38 KB, text/plain) 2023-08-15 07:22 UTC, Kamil Páral	no flags	Details
dmesg after delayed resume (124.09 KB, text/plain) 2023-08-22 16:28 UTC, Kamil Páral	no flags	Details
journal after delayed resume (346.77 KB, text/plain) 2023-08-22 16:28 UTC, Kamil Páral	no flags	Details
lspci -vv before suspend (92.78 KB, text/plain) 2023-08-22 16:29 UTC, Kamil Páral	no flags	Details
lspci -vv after delayed resume (92.68 KB, text/plain) 2023-08-22 16:29 UTC, Kamil Páral	no flags	Details
dmesg after fast resume (122.99 KB, text/plain) 2023-08-23 07:55 UTC, Kamil Páral	no flags	Details
fwupdmgr get-devices output (10.26 KB, text/plain) 2023-08-23 08:08 UTC, Kamil Páral	no flags	Details
fwupdmgr get-devices output (9.53 KB, text/plain) 2023-08-23 08:10 UTC, Kamil Páral	no flags	Details
dmesg with TB assist mode (115.01 KB, text/plain) 2023-08-23 12:58 UTC, Kamil Páral	no flags	Details
dmesg with devices unavailable (126.62 KB, text/plain) 2023-08-23 12:59 UTC, Kamil Páral	no flags	Details
acpidump (1.22 MB, text/plain) 2023-08-23 13:59 UTC, Kamil Páral	no flags	Details
dmesg with thunderbolt.dyndbg=+p (136.59 KB, text/plain) 2023-08-23 13:59 UTC, Kamil Páral	no flags	Details
dmesg with TB user security level (100.53 KB, text/plain) 2023-08-25 08:13 UTC, Kamil Páral	no flags	Details
dmesh with TB secure security level and Wake by TB enabled (100.32 KB, text/plain) 2023-09-27 12:57 UTC, Kamil Páral	no flags	Details
Show Obsolete (1) View All

Description Kamil Páral 2023-08-09 11:12:42 UTC

1. Please describe the problem:

//Edit: You can see a debugging summary in comment 23.

Thinkpad T480s with Thinkpad Thunderbolt 3 Dock can no longer resume from sleep since kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39. There's only black screen, the laptop is not responsive, ssh can't connect, it has to be force-rebooted. //Edit: Turns out the resume finishes OK, but you have to wait for ~70 seconds.

** This only happens if the dock is attached. When not attached, resume works correctly. **

2. What is the Version-Release number of the kernel:

Latest tested stable kernel: kernel-6.4.8-200.fc38.x86_64
Latest tested unstable kernel: kernel-6.5.0-0.rc5.20230808git14f9643dc90a.37.fc39

The resume is 100% broken in them.

3. Did it work previously in Fedora? If so, what kernel version did the issue
*first* appear? Old kernels are available for download at
https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

I narrowed the issue to a **single day** of kernel changes:

kernel-6.4.0-0.rc0.20230427git6e98b09da931.5.fc39 - works 100%
kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39 - broken 100%

(The latest kernel in 6.3 series, kernel-6.3.13-200.fc38, works 100% as well).

4. Can you reproduce this issue? If so, please provide the steps to reproduce
the issue below:

Trivial to reproduce, and completely reliable. Have the dock connected and suspend the laptop, then try to resume it. In the older kernels, it works 100%, in the newer kernel, it breaks 100%.

5. Does this problem occur with the latest Rawhide kernel? To install the
Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
``sudo dnf update --enablerepo=rawhide kernel``:

Yes, see 2).

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
issue occurred on a previous boot, use the journalctl ``-b`` flag.

I can't get any kernel log after a failed resume. The logs are not written to the disk. The previous boot journal ends with a suspend message.

@jforbes Since I narrowed down the kernel changes to a single day, I assume there's a high chance to get this fixed. I can try to bisect even individual commits, if required. But communication with kernel devs is my biggest worry. Can you try to reach out to the best person on my behalf, or at least advise me how to do that (who, how)? Thanks a lot.

Comment 1 Kamil Páral 2023-08-09 11:14:53 UTC

Created attachment 1982541 [details]
lspci

Comment 2 Justin M. Forbes 2023-08-09 13:23:08 UTC

> @jforbes Since I narrowed down the kernel changes to a single
> day, I assume there's a high chance to get this fixed. I can try to bisect
> even individual commits, if required. But communication with kernel devs is
> my biggest worry. Can you try to reach out to the best person on my behalf,
> or at least advise me how to do that (who, how)? Thanks a lot.

While helpful, a single day in the merge window (rc0) is not a trivial number of commits. That day in particular was over 2700.  If it is strictly in the thunderbolt code, there are few (4).  Unfortunately  thunderbolt interacts with USB and PCI as well, bringing the total commits closer to 300.  A bisect would be helpful.  If you don't have the time to do so, I can reach out, but if you are willing, Thunderbolt tends to react to bugs on bugzilla.kernel.org and they have bot which interfaces with linux-usb.org. Select USB as the component if filing a bugzilla there.  Either way, let me know what you do here. We can either track the upstream to get a fix backported sooner, or I can act as an intermediary.

Comment 3 Kamil Páral 2023-08-09 13:38:36 UTC

> That day in particular was over 2700.

Ouch, I had no idea. I'll try to do the bisect and then file a bug in bugzilla.kernel.org.
I found fedbisect [1], but it hasn't been touched in 6 years. Is it still the tool for this job, or is there some other fedora-specific tool/guide elsewhere?
Thanks a lot for advice.

[1] https://pagure.io/fedbisect

Comment 4 Justin M. Forbes 2023-08-09 16:21:21 UTC

That would probably be a poor tool for the job. I highly recommend doing a bisect the upstream way as it is massively faster than building rpms for each one. We spent more time doing packaging bits than we do building the actual kernel.  https://docs.kernel.org/admin-guide/bug-bisect.html has a quick guide.  Your starting good is 6e98b09da931 and your starting bad is 33afd4b76393

Comment 5 Kamil Páral 2023-08-15 07:22:30 UTC

Justin, I finally bisected this to be caused by the following commit. I verified that it fails to resume in 5/5 attempts, and the last tested good commit successfully resumes in 5/5 attempts. So I'm quite certain this is the source of regression. It's a change in drivers/pci/pci-driver.c. Should I still report it upstream according to your instructions in comment 2, or (since this is in PCI and not Thunderbolt) report it upstream differently? Thanks!


e8b908146d44310473e43b3382eca126e12d279c is the first bad commit
commit e8b908146d44310473e43b3382eca126e12d279c
Author: Mika Westerberg <mika.westerberg.com>
Date:   Tue Apr 4 08:27:13 2023 +0300

    PCI/PM: Increase wait time after resume
    
    PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to
    config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional
    reset and this same delay is prescribed when coming out of D3cold (as that
    involves reset too).
    
    A device that requires more than 1 second to initialize after reset may
    respond to config requests with Request Retry Status completions (sec
    2.3.1), and we accommodate that in Linux with a 60 second cap
    (PCIE_RESET_READY_POLL_MS).
    
    Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code
    path, not in the resume path.  However, a device has surfaced, namely Intel
    Titan Ridge xHCI, which requires a longer delay also in the resume code
    path.
    
    Make the resume code path to use this same extended delay as the reset
    path.
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728
    Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com
    Reported-by: Chris Chiu <chris.chiu>
    Signed-off-by: Mika Westerberg <mika.westerberg.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas>
    Cc: Lukas Wunner <lukas>

 drivers/pci/pci-driver.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comment 6 Kamil Páral 2023-08-15 07:22:55 UTC

Created attachment 1983351 [details]
git bisect log

Comment 7 Kamil Páral 2023-08-15 08:00:41 UTC

I also tested this with a different laptop, Thinkpad P1 gen 3. It resumes just fine with that dock. So this is not a general issue, but there's some connection between the dock and T480s which makes it exhibit the problem.

Comment 8 Justin M. Forbes 2023-08-15 11:37:01 UTC

Nice work finding the commit!  I would likely email linux-pci.org and CC the Signed-off-by and Reported-by emails on that commit. Explain the bisection and the symptoms.

Comment 9 Kamil Páral 2023-08-21 12:39:09 UTC

I sent the email report (with some additional debugging details) to the linux-pci kernel list (and regressions list CCed, per upstream instructions):
https://lore.kernel.org/linux-pci/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u
https://lore.kernel.org/regressions/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u

I also updated the kernel bugzilla ticket which the regression commit linked to:
https://bugzilla.kernel.org/show_bug.cgi?id=216728

Thanks for pointers, Justin.

Comment 10 Kamil Páral 2023-08-22 16:27:46 UTC

Turns out the resume is not broken, but delayed with additional 60 seconds. I'll continue the discussion in upstream kernel lists, but add required attachments here, so that I can link them from kernel lists (I assume attaching them directly to the email is not appreciated).

Comment 11 Kamil Páral 2023-08-22 16:28:28 UTC

Created attachment 1984636 [details]
dmesg after delayed resume

Comment 12 Kamil Páral 2023-08-22 16:28:49 UTC

Created attachment 1984637 [details]
journal after delayed resume

Comment 13 Kamil Páral 2023-08-22 16:29:11 UTC

Created attachment 1984638 [details]
lspci -vv before suspend

Comment 14 Kamil Páral 2023-08-22 16:29:37 UTC

Created attachment 1984639 [details]
lspci -vv after delayed resume

Comment 15 Kamil Páral 2023-08-23 07:55:36 UTC

Created attachment 1984726 [details]
dmesg after fast resume

Comment 16 Kamil Páral 2023-08-23 08:08:30 UTC

Created attachment 1984727 [details]
fwupdmgr get-devices output

Comment 17 Kamil Páral 2023-08-23 08:10:25 UTC

Created attachment 1984728 [details]
fwupdmgr get-devices output

Comment 18 Kamil Páral 2023-08-23 12:58:39 UTC

Created attachment 1984785 [details]
dmesg with TB assist mode

Comment 19 Kamil Páral 2023-08-23 12:59:13 UTC

Created attachment 1984786 [details]
dmesg with devices unavailable

Comment 20 Kamil Páral 2023-08-23 13:59:34 UTC

Created attachment 1984802 [details]
acpidump

Comment 21 Kamil Páral 2023-08-23 13:59:57 UTC

Created attachment 1984803 [details]
dmesg with thunderbolt.dyndbg=+p

Comment 22 Kamil Páral 2023-08-25 08:13:11 UTC

Created attachment 1985262 [details]
dmesg with TB user security level

Comment 23 Kamil Páral 2023-08-25 11:25:05 UTC

After extensive debugging in upstream kernel (see comment 9), this is most probably related to the laptop's firmware. When the Thunderbolt security level is set to Secure Connection, it fails to properly reconnect to the dock after resume. With the latest kernels, that causes a ~60 second delay. When the security is lowered to User Authorization, it works fine. Kernel developers suggested to contact Lenovo and ask them to look into this and possibly publish a Thunderbolt firmware update.

@mpearson Hi Mark, is this something you'd be interested in looking into? I don't know what Lenovo's support plans are for Thinkpad T480s, but it still seems to receive updated firmware.

Comment 24 Mark Pearson 2023-08-28 13:36:07 UTC

Hi Kamil,

Ack - I've created internal ticket LO-2616 for tracking. We'll confirm we can reproduce the issue and follow up with the FW team and get their feedback.

Mark

Comment 25 Kamil Páral 2023-09-27 12:57:06 UTC

Created attachment 1990815 [details]
dmesh with TB secure security level and Wake by TB enabled

Comment 26 Kamil Páral 2023-09-27 14:14:40 UTC

Mark, I found out that this problem only happens when I have "Wake by Thunderbolt 3" option disabled (which is a non-default value). See more here:
https://lore.kernel.org/linux-pci/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#ma7b6e1740c042aa624cbf0cef63cd887ae5fd90d

Comment 27 Mark Pearson 2023-11-08 14:33:17 UTC

Thanks. Finally got feedback from the FW team and they are saying this is by design and that the Wake by TB setting should be enabled.
Not sure there is much else we can do here - let me know if we're missing anything important.

Mark

Comment 28 Kamil Páral 2023-11-10 08:41:33 UTC

Thanks a lot for your update, Mark. Sigh, this was a very long exercise regarding a BIOS option that could've been either not present (since the system doesn't work correctly when this is in a non-default value) or better explained :-/ Anyway, I'm going to close this bug - it only happens in non-default BIOS settings and it seems that there's no further action related to this. Thanks for looking into it.

Note You need to log in before you can comment on or make changes to this bug.