Bug 1737046 - [bisected] 5.1.20-300.fc30 regression: no wakeup from suspend to RAM (also present in 5.2.5-200.fc30 testing and vanilla 5.3-rc2)
Summary: [bisected] 5.1.20-300.fc30 regression: no wakeup from suspend to RAM (also pr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 30
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1735786
TreeView+ depends on / blocked
 
Reported: 2019-08-02 12:34 UTC by Matthias Andree
Modified: 2020-05-01 12:35 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-01 12:35:44 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 204413 0 None None None 2019-08-07 19:26:36 UTC

Description Matthias Andree 2019-08-02 12:34:56 UTC
Description of problem:
kernel-core-5.1.20-300.fc30.x86_64 fails to wake from suspend (STR),
5.1.19-300.fc30 did fine on the same computer.
5.2.5-200.fc30 fails in a different way.

Version-Release number of selected component (if applicable):
5.1.20-300.fc30.x86_64

How reproducible:
always

Steps to Reproduce:
1. boot and log into GNOME desktop
2. click pause symbol to suspend the computer to RAM, wait until suspended
3. press key on keyboard, or power button

Actual results:
computer tries to wake up, HDD LED blinks a bit, but console does not wake. Other computer on network cannot ping the waking computer.

Expected results:
computer wakes up properly as it used to do with kernel-core-5.1.19-300.fc30.x86_64 and earlier kernels.

Additional info:
PM tracing was enabled, the next boot returned
[    0.827930] PM:   hash matches drivers/base/power/main.c:1021

It appears that suspend to disk still works.

Computer has an NVIDIA GeForce 1060 PCIe graphics board, but 5.1.19 and prior would suspend properly, and the 5.1.20 and 5.2.5 suspend issues also occur if nvidia kernel modules are renamed out of the way and nouveau remains blocked, so it's not an nvidia driver issue.

Comment 1 Matthias Andree 2019-08-02 12:41:10 UTC
Note this is reproducible without nvidia proprietary/binary drivers, and that for me, the new nvidia binary driver continues to suspend 5.1.19 properly.
Note that Bugzilla search didn't turn up 1735786 when I searched for regression or suspend bugs... sorry. Setting Depends:.

Comment 2 naaa 2019-08-02 15:38:39 UTC
Noting that with kernel 5.2.5, on 2nd resume, for me it is possible for the desktop GUI to show up but all windows and gnome are frozen. If browser is open with a page then can scroll browser page but cannot do anything else besides this. Must force reboot computer. Nouveau.

Comment 3 Matthias Andree 2019-08-02 16:17:06 UTC
I have "git bisect"ed this on the vanilla stable kernel, the stable/linux-5.1.y branch (because I have had starting points 5.1.19 and 5.1.20 there).
The failure-inducing commit on the branch is 3c795a8e3481e4dec071b5956e7177e816f6e7f1 (see below), which got picked from 
master's c2bf1fc212f7e6f25ace1af8f0b3ac061ea48ba5, (merged through cf2d213e49fdf47e4c10dc629a3659e0026a54b8, v5.3-rc1~167)
and also got picked to stable/linux-5.2.y 5817d78eba34f6c86f5462ae2c5212f80a013357 (v5.2.3~291).

Sasha Levin's signoff is only on the stable branches, not on master.

------------------------------------------------------------
commit 3c795a8e3481e4dec071b5956e7177e816f6e7f1 (refs/bisect/bad)
Author: Mika Westerberg <mika.westerberg.com>  2019-06-12 12:57:38
Committer: Greg Kroah-Hartman <gregkh>  2019-07-26 09:12:37
Parent: 70cc29dba925b8a99a4917c2b5fa6702d0d496d1 (bpf: fix callees pruning callers)
Child:  a98c15177f72ae3c0a736bb324e66c279bf94899 (net: netsec: initialize tx ring on ndo_open)
Branch: remotes/stable/linux-5.1.y
Follows: v5.1.19
Precedes: v5.1.20

    PCI: Add missing link delays required by the PCIe spec
    
    [ Upstream commit c2bf1fc212f7e6f25ace1af8f0b3ac061ea48ba5 ]
    
    Currently Linux does not follow PCIe spec regarding the required delays
    after reset. A concrete example is a Thunderbolt add-in-card that
    consists of a PCIe switch and two PCIe endpoints:
    
      +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller
                                      +-01.0-[04-36]-- DS hotplug port
                                      +-02.0-[37]----00.0 xHCI controller
                                      \-04.0-[38-6b]-- DS hotplug port
    
    The root port (1b.0) and the PCIe switch downstream ports are all PCIe
    gen3 so they support 8GT/s link speeds.
    
    We wait for the PCIe hierarchy to enter D3cold (runtime):
    
      pcieport 0000:00:1b.0: power state changed by ACPI to D3cold
    
    When it wakes up from D3cold, according to the PCIe 4.0 section 5.8 the
    PCIe switch is put to reset and its power is re-applied. This means that
    we must follow the rules in PCIe 4.0 section 6.6.1.
[...]
    Signed-off-by: Mika Westerberg <mika.westerberg.com>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki>
    Signed-off-by: Sasha Levin <sashal>

 drivers/pci/pci.c               | 29 +++++++++++++++++++----------
 drivers/pci/pci.h               |  1 +
 drivers/pci/pcie/portdrv_core.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 86 insertions(+), 10 deletions(-)

Comment 4 naaa 2019-08-02 16:48:14 UTC
It sounds like this was meant for PCIe 4.0 hardware? I have PCIe 3.0 motherboard https://www.msi.com/Motherboard/B350M-PRO-VDH/Specification and gpu should be as well. Interesting.

Comment 5 Matthias Andree 2019-08-02 18:11:25 UTC
MSI X370 SLI PLUS (alias MS-7A33) here, with Ryzen 7 1700 and Zotac-based NVIDIA GeForce 1060-6GB, so no PCIe 4.0 HW anywhere.

Comment 6 Justin M. Forbes 2019-08-03 16:19:56 UTC
Can someone test this with the 5.3 rc kernel and see if it is broken there, or if we need to pull in an additional patch?

Comment 7 Matthias Andree 2019-08-03 17:18:26 UTC
Justin, I did yesterday on a vanilla 5.3-rc2 so I could report where the error came from, see https://bugzilla.kernel.org/show_bug.cgi?id=204413#c2 (link was in external trackers already):
5.2.5 and 5.3-rc2 also needs a "git revert" of the offending patch for me.

Bjorn Helgaas pointed Mika Westerberg to my report, see https://www.spinics.net/lists/linux-pci/msg85535.html

I haven't tested with Fedora-derived kernels other than the broken kernel-core-5.1.20-300.fc30.x86_64 from the @updates yet, since the vanilla kernel fails in the same manner as Fedora's 5.1.20.

Let me know if (a) the Fedora kernel is worth testing nonetheless and (b) if yes, whether the instructions in https://fedoraproject.org/wiki/Building_a_custom_kernel#Building_a_kernel_from_the_exploded_git_trees are still current.

Note: PM testing per 01.org instructions was fruitless, you need to do the real thing, meaning "systemctl suspend" and wakeup to trigger the bug. Any pm_test setting but "none" will mask it.

Comment 8 naaa 2019-08-03 17:21:17 UTC
(In reply to Justin M. Forbes from comment #6)
> Can someone test this with the 5.3 rc kernel and see if it is broken there,
> or if we need to pull in an additional patch?

$ curl -s https://repos.fedorapeople.org/repos/thl/kernel-vanilla.repo | sudo tee /etc/yum.repos.d/kernel-vanilla.repo
$ dnf --enablerepo=kernel-vanilla-mainline update

reboot and use kernel 5.3.0-0.rc2.git4.1.vanilla.knurd.1.fc30 x86_64...

suspend 2x and 2nd suspend fails exactly the same for me. No difference.

Comment 9 Fedora Update System 2019-08-06 12:43:51 UTC
FEDORA-2019-a7f551b8c9 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-a7f551b8c9

Comment 10 Matthias Andree 2019-08-06 16:32:22 UTC
Ouch. Note that upstream has chosen to remove an offending commit, see <https://bugzilla.kernel.org/show_bug.cgi?id=204413#c12>

Comment 11 Matthias Andree 2019-08-06 18:00:07 UTC
Warning, Linux stable 5.2.6 and 5.2.7 still have this regression.

Comment 12 Matthias Andree 2019-08-06 19:36:00 UTC
kernel-core-5.2.6-200.fc30.x86_64 seems to work for me, apparently survives two suspend/resume cycles without ill effect.

Comment 13 naaa 2019-08-06 20:31:36 UTC
Yes, so far it does here as well =)

Comment 14 Fedora Update System 2019-08-07 01:07:42 UTC
kernel-5.2.6-200.fc30, kernel-headers-5.2.6-200.fc30, kernel-tools-5.2.6-200.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-a7f551b8c9

Comment 15 naaa 2019-08-07 18:10:34 UTC
Alright, so 5.2.6-200 hasn't had a suspend issue still. I consider it fixed.

Thank you  Matthias and all others who helped fix it :-)

Comment 16 Fedora Update System 2019-08-09 01:03:16 UTC
kernel-5.2.6-200.fc30, kernel-headers-5.2.6-200.fc30, kernel-tools-5.2.6-200.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Matthias Andree 2019-08-10 19:35:19 UTC
Fedora's 5.2.6-200.fc30.x86_64 can properly resume-from-suspend again for me (because Justin reverted the offending patch), the upstream vanilla kernel has the revert queued up for 5.2.9 and 5.3-rc4. Upstream debugging still ongoing, see kernel.org bug.

Comment 18 Ben Cotton 2020-04-30 21:22:46 UTC
This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 19 Matthias Andree 2020-05-01 11:00:06 UTC
5.6.6-200.fc31.x86_64 also works. What's necessary to close this bug with a proper status? "CLOSED-FIXED" isn't available to me (probably due to workflow restrictions in the bugzilla configuration)

Comment 20 Hans de Goede 2020-05-01 12:35:44 UTC
Since this is fixed by a (kernel) update, the proper resolution is ERRATA, I'll close it with this resolution right away.


Note You need to log in before you can comment on or make changes to this bug.