Bug 2230357
| Summary: | resume with a Thunderbolt dock broke with commit e8b908146d44 "PCI/PM: Increase wait time after resume" | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Kamil Páral <kparal> | ||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
| Status: | NEW --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 38 | CC: | acaringi, adscvr, airlied, alciregi, bskeggs, hdegoede, hpa, jarod, jforbes, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | Type: | Bug | |||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 2184978 | ||||||||
| Attachments: |
|
||||||||
|
Description
Kamil Páral
2023-08-09 11:12:42 UTC
Created attachment 1982541 [details]
lspci
> @jforbes Since I narrowed down the kernel changes to a single
> day, I assume there's a high chance to get this fixed. I can try to bisect
> even individual commits, if required. But communication with kernel devs is
> my biggest worry. Can you try to reach out to the best person on my behalf,
> or at least advise me how to do that (who, how)? Thanks a lot.
While helpful, a single day in the merge window (rc0) is not a trivial number of commits. That day in particular was over 2700. If it is strictly in the thunderbolt code, there are few (4). Unfortunately thunderbolt interacts with USB and PCI as well, bringing the total commits closer to 300. A bisect would be helpful. If you don't have the time to do so, I can reach out, but if you are willing, Thunderbolt tends to react to bugs on bugzilla.kernel.org and they have bot which interfaces with linux-usb.org. Select USB as the component if filing a bugzilla there. Either way, let me know what you do here. We can either track the upstream to get a fix backported sooner, or I can act as an intermediary.
> That day in particular was over 2700. Ouch, I had no idea. I'll try to do the bisect and then file a bug in bugzilla.kernel.org. I found fedbisect [1], but it hasn't been touched in 6 years. Is it still the tool for this job, or is there some other fedora-specific tool/guide elsewhere? Thanks a lot for advice. [1] https://pagure.io/fedbisect That would probably be a poor tool for the job. I highly recommend doing a bisect the upstream way as it is massively faster than building rpms for each one. We spent more time doing packaging bits than we do building the actual kernel. https://docs.kernel.org/admin-guide/bug-bisect.html has a quick guide. Your starting good is 6e98b09da931 and your starting bad is 33afd4b76393 Justin, I finally bisected this to be caused by the following commit. I verified that it fails to resume in 5/5 attempts, and the last tested good commit successfully resumes in 5/5 attempts. So I'm quite certain this is the source of regression. It's a change in drivers/pci/pci-driver.c. Should I still report it upstream according to your instructions in comment 2, or (since this is in PCI and not Thunderbolt) report it upstream differently? Thanks! e8b908146d44310473e43b3382eca126e12d279c is the first bad commit commit e8b908146d44310473e43b3382eca126e12d279c Author: Mika Westerberg <mika.westerberg.com> Date: Tue Apr 4 08:27:13 2023 +0300 PCI/PM: Increase wait time after resume PCIe r6.0 sec 6.6.1 prescribes that a device must be able to respond to config requests within 1.0 s (PCI_RESET_WAIT) after exiting conventional reset and this same delay is prescribed when coming out of D3cold (as that involves reset too). A device that requires more than 1 second to initialize after reset may respond to config requests with Request Retry Status completions (sec 2.3.1), and we accommodate that in Linux with a 60 second cap (PCIE_RESET_READY_POLL_MS). Previously we waited up to PCIE_RESET_READY_POLL_MS only in the reset code path, not in the resume path. However, a device has surfaced, namely Intel Titan Ridge xHCI, which requires a longer delay also in the resume code path. Make the resume code path to use this same extended delay as the reset path. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216728 Link: https://lore.kernel.org/r/20230404052714.51315-2-mika.westerberg@linux.intel.com Reported-by: Chris Chiu <chris.chiu> Signed-off-by: Mika Westerberg <mika.westerberg.com> Signed-off-by: Bjorn Helgaas <bhelgaas> Cc: Lukas Wunner <lukas> drivers/pci/pci-driver.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Created attachment 1983351 [details]
git bisect log
I also tested this with a different laptop, Thinkpad P1 gen 3. It resumes just fine with that dock. So this is not a general issue, but there's some connection between the dock and T480s which makes it exhibit the problem. Nice work finding the commit! I would likely email linux-pci.org and CC the Signed-off-by and Reported-by emails on that commit. Explain the bisection and the symptoms. |