Bug 2230357 - resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume"
Summary: resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increa...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 38
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 2184978
TreeView+ depends on / blocked
 
Reported: 2023-08-09 11:12 UTC by Kamil Páral
Modified: 2023-08-15 11:37 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
lspci (2.87 KB, text/plain)
2023-08-09 11:14 UTC, Kamil Páral
no flags Details
git bisect log (2.38 KB, text/plain)
2023-08-15 07:22 UTC, Kamil Páral
no flags Details

Description Kamil Páral 2023-08-09 11:12:42 UTC
1. Please describe the problem:

Thinkpad T480s with Thinkpad Thunderbolt 3 Dock can no longer resume from sleep since kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39. There's only black screen, the laptop is not responsive, ssh can't connect, it has to be force-rebooted.

** This only happens if the dock is attached. When not attached, resume works correctly. **


2. What is the Version-Release number of the kernel:

Latest tested stable kernel: kernel-6.4.8-200.fc38.x86_64
Latest tested unstable kernel: kernel-6.5.0-0.rc5.20230808git14f9643dc90a.37.fc39

The resume is 100% broken in them.


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :


I narrowed the issue to a **single day** of kernel changes:

kernel-6.4.0-0.rc0.20230427git6e98b09da931.5.fc39  - works 100%
kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39  - broken 100%

(The latest kernel in 6.3 series, kernel-6.3.13-200.fc38, works 100% as well).


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Trivial to reproduce, and completely reliable. Have the dock connected and suspend the laptop, then try to resume it. In the older kernels, it works 100%, in the newer kernel, it breaks 100%.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes, see 2).


6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

I can't get any kernel log after a failed resume. The logs are not written to the disk. The previous boot journal ends with a suspend message.


@jforbes Since I narrowed down the kernel changes to a single day, I assume there's a high chance to get this fixed. I can try to bisect even individual commits, if required. But communication with kernel devs is my biggest worry. Can you try to reach out to the best person on my behalf, or at least advise me how to do that (who, how)? Thanks a lot.

Comment 1 Kamil Páral 2023-08-09 11:14:53 UTC
Created attachment 1982541 [details]
lspci

Comment 2 Justin M. Forbes 2023-08-09 13:23:08 UTC
> @jforbes Since I narrowed down the kernel changes to a single
> day, I assume there's a high chance to get this fixed. I can try to bisect
> even individual commits, if required. But communication with kernel devs is
> my biggest worry. Can you try to reach out to the best person on my behalf,
> or at least advise me how to do that (who, how)? Thanks a lot.

While helpful, a single day in the merge window (rc0) is not a trivial number of commits. That day in particular was over 2700.  If it is strictly in the thunderbolt code, there are few (4).  Unfortunately  thunderbolt interacts with USB and PCI as well, bringing the total commits closer to 300.  A bisect would be helpful.  If you don't have the time to do so, I can reach out, but if you are willing, Thunderbolt tends to react to bugs on bugzilla.kernel.org and they have bot which interfaces with linux-usb.org. Select USB as the component if filing a bugzilla there.  Either way, let me know what you do here. We can either track the upstream to get a fix backported sooner, or I can act as an intermediary.

Comment 3 Kamil Páral 2023-08-09 13:38:36 UTC
> That day in particular was over 2700.

Ouch, I had no idea. I'll try to do the bisect and then file a bug in bugzilla.kernel.org.
I found fedbisect [1], but it hasn't been touched in 6 years. Is it still the tool for this job, or is there some other fedora-specific tool/guide elsewhere?
Thanks a lot for advice.

[1] https://pagure.io/fedbisect

Comment 4 Justin M. Forbes 2023-08-09 16:21:21 UTC
That would probably be a poor tool for the job. I highly recommend doing a bisect the upstream way as it is massively faster than building rpms for each one. We spent more time doing packaging bits than we do building the actual kernel.  https://docs.kernel.org/admin-guide/bug-bisect.html has a quick guide.  Your starting good is 6e98b09da931 and your starting bad is 33afd4b76393

Comment 5 Kamil Páral 2023-08-15 07:22:30 UTC
Justin, I finally bisected this to be caused by the following commit. I verified that it fails to resume in 5/5 attempts, and the last tested good commit successfully resumes in 5/5 attempts. So I'm quite certain this is the source of regression. It's a change in drivers/pci/pci-driver.c. Should I still report it upstream according to your instructions in comment 2, or (since this is in PCI and not Thunderbolt) report it upstream differently? Thanks!


e8b908146d44310473e43b3382eca126e12d279c is the first bad commit
commit e8b908146d44310473e43b3382eca126e12d279c
Author: Mika Westerberg <mika.westerberg.com>
Date:   Tue Apr 4 08:27:13 2023 +0300

    PCI/PM: Increase wait time after resume
    
    PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to
    config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional
    reset and this same delay is prescribed when coming out of D3cold (as that
    involves reset too).
    
    A device that requires more than 1 second to initialize after reset may
    respond to config requests with Request Retry Status completions (sec
    2.3.1), and we accommodate that in Linux with a 60 second cap
    (PCIE_RESET_READY_POLL_MS).
    
    Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code
    path, not in the resume path.  However, a device has surfaced, namely Intel
    Titan Ridge xHCI, which requires a longer delay also in the resume code
    path.
    
    Make the resume code path to use this same extended delay as the reset
    path.
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728
    Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com
    Reported-by: Chris Chiu <chris.chiu>
    Signed-off-by: Mika Westerberg <mika.westerberg.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas>
    Cc: Lukas Wunner <lukas>

 drivers/pci/pci-driver.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comment 6 Kamil Páral 2023-08-15 07:22:55 UTC
Created attachment 1983351 [details]
git bisect log

Comment 7 Kamil Páral 2023-08-15 08:00:41 UTC
I also tested this with a different laptop, Thinkpad P1 gen 3. It resumes just fine with that dock. So this is not a general issue, but there's some connection between the dock and T480s which makes it exhibit the problem.

Comment 8 Justin M. Forbes 2023-08-15 11:37:01 UTC
Nice work finding the commit!  I would likely email linux-pci.org and CC the Signed-off-by and Reported-by emails on that commit. Explain the bisection and the symptoms.


Note You need to log in before you can comment on or make changes to this bug.