Bug 1853960 - 5.7 kernel regression: unable to suspend, AER errors without pci=noaer
Summary: 5.7 kernel regression: unable to suspend, AER errors without pci=noaer
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 32
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-06 00:18 UTC by Robert Hancock
Modified: 2020-08-08 02:13 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-08 02:13:47 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg from bootup, suspend attempt (190.65 KB, text/plain)
2020-07-06 00:18 UTC, Robert Hancock
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 208667 0 None None None 2020-07-23 00:43:46 UTC

Description Robert Hancock 2020-07-06 00:18:28 UTC
Created attachment 1699965 [details]
dmesg from bootup, suspend attempt

1. Please describe the problem:
My system (based on an Asus PRIME H270-PRO motherboard) fails to suspend properly under 5.7 kernels. It starts to suspend but then immediately wakes back up again.

I also noticed a bunch of PCIe AER error spam in dmesg that did not occur with 5.6-based kernels, for example:

[   12.909890] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[   12.909890] pcieport 0000:00:1c.0: AER:   device [8086:a292] error status/mask=00003000/00002000
[   12.909891] pcieport 0000:00:1c.0: AER:    [12] Timeout               
[   12.909896] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.909899] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.909900] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.909902] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.909903] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.909906] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.910012] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.910015] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[   12.910015] pcieport 0000:00:1c.0: AER:   device [8086:a292] error status/mask=00001000/00002000
[   12.910016] pcieport 0000:00:1c.0: AER:    [12] Timeout               
[   12.910020] pcieport 0000:00:1c.0: AER: Corrected error received: 0000:00:1c.0
[   12.910023] pcieport 0000:00:1c.0: AER: can't find device of ID00e0
[   12.910157] pcieport 0000:00:1c.0: AER: Multiple Corrected error received: 0000:00:1c.0

Device 1c.0 is a PCI Express root port, which is connected to an ASMedia PCIe to PCI bridge:

00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #3 (rev f0)
02:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)

2. What is the Version-Release number of the kernel:

kernel-5.7.7-200.fc32.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

kernel-5.7.6-201.fc32.x86_64 was the first version I have seen that had the problem. kernel-5.6.19-300.fc32.x86_64 works fine.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Fails every time on this system.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Have not tried

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Robert Hancock 2020-07-06 00:22:43 UTC
It appears that the AER errors are related to the suspend failure, as suspend works if the pci=noaer option is added to the kernel command line. I am guessing that these errors occurring during the suspend process are causing the machine to immediately wake up again.

Comment 2 Robert Hancock 2020-07-11 00:28:48 UTC
Reported to LKML: https://lkml.org/lkml/2020/7/10/1267

Comment 3 Robert Hancock 2020-07-22 00:02:00 UTC
As I posted on LKML, it seems that the issue may have been caused by an upstream change that went into the 5.7 stable series to enable PCIe ASPM on PCIe to PCI bridges:

commit 66ff14e59e8a30690755b08bc3042359703fb07a
Author: Kai-Heng Feng <kai.heng.feng>
Date:   Wed May 6 01:34:21 2020 +0800

    PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges
    
    7d715a6c1ae5 ("PCI: add PCI Express ASPM support") added the ability for
    Linux to enable ASPM, but for some undocumented reason, it didn't enable
    ASPM on links where the downstream component is a PCIe-to-PCI/PCI-X Bridge.
    
    Remove this exclusion so we can enable ASPM on these links.
    
    The Dell OptiPlex 7080 mentioned in the bugzilla has a TI XIO2001
    PCIe-to-PCI Bridge.  Enabling ASPM on the link leading to it allows the
    Intel SoC to enter deeper Package C-states, which is a significant power
    savings.
    
    [bhelgaas: commit log]
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207571
    Link: https://lore.kernel.org/r/20200505173423.26968-1-kai.heng.feng@canonical.com
    Signed-off-by: Kai-Heng Feng <kai.heng.feng>
    Signed-off-by: Bjorn Helgaas <bhelgaas>
    Reviewed-by: Mika Westerberg <mika.westerberg.com>

Disabling ASPM manually on this ASMedia bridge device as well as the PCIe root port it is connected to seems to resolve the problem:

setpci -s 00:1c.0 0x50.B=0x00
setpci -s 02:00.0 0x90.B=0x00

Comment 5 Robert Hancock 2020-07-31 07:14:37 UTC
Patch has been merged into mainline: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b361663c5a40c8bc758b7f7f2239f7a192180e7c

I have nominated it for stable kernels as well, as the previous patch that exposed the issue was added to stable.

Comment 6 Robert Hancock 2020-08-08 02:13:47 UTC
Fixed in build kernel-5.7.14-200.fc32: https://koji.fedoraproject.org/koji/buildinfo?buildID=1586714


Note You need to log in before you can comment on or make changes to this bug.