1679140 – Bad r8169 performance on some networks after suspend

Bug 1679140 - Bad r8169 performance on some networks after suspend

Summary: Bad r8169 performance on some networks after suspend

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	29
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-20 12:48 UTC by Loïc Yhuel
Modified:	2019-11-27 23:18 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-11-27 23:18:15 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci -vvv after a reboot (correct performance) (41.35 KB, text/plain) 2019-02-20 19:06 UTC, Loïc Yhuel	no flags	Details
lspci -vvv after a suspend (bad performance) (41.35 KB, text/plain) 2019-02-20 19:07 UTC, Loïc Yhuel	no flags	Details
View All

Description Loïc Yhuel 2019-02-20 12:48:49 UTC

1. Please describe the problem:
On my Dell Inspiron Gaming 15 7567, I have ethernet performance issues on some networks (which work properly when connected using an USB adapter), but only after a suspend (I never saw the issue after a reboot).

Probably depending on which network I am, with it could be :
 - ok
 - up/down a few times at 1Gbps, and ends up at 100Mbps : unplugging then plugging the cable again is enough to fix
 - link detected at 1Gbps, but very slow : often 20-50Mbps, sometimes 100Mbps
 - link detected at 1Gbps, but no traffic (had one time where forcing the link speed fixed the issue)

2. What is the Version-Release number of the kernel:
4.20.8-200.fc29.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
I think 4.16-4.17 were OK (but not 100% sure since there is a part of randomness), I'm not sure about 4.18-4.19.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
I almost always happens if I connect to the network at work after the laptop has been suspended/resumed. But everything works fine after at home.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
not yet tested

6. Are you running any modules that not shipped with directly Fedora's kernel?:
no

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
Boot on work network (works ok) :
[    1.790237] libphy: r8169: probed
[    1.790604] r8169 0000:02:00.0 eth0: RTL8168h/8111h, d4:81:d7:95:e6:ab, XID 54100800, IRQ 125
[    1.790605] r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[    1.800538] r8169 0000:02:00.0 enp2s0: renamed from eth0
[    5.984953] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[    9.622956] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[    9.627005] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off

On resume from suspend :
[101949.684539] r8169 0000:02:00.0 enp2s0: Link is Down
[101950.303035] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[101954.623975] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off
[101954.635536] r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off

Comment 1 Loïc Yhuel 2019-02-20 12:55:58 UTC

rmmod r8169, then modprobe r8169 doesn't help

/sys/class/net/enp2s0/phydev/phy_id : 0x001cc800

I tried comparing the output of "ethtool -d enps2s0" (not easy since it doesn't support my NIC so I have to look at the register addresses in the kernel), and I don't find anything conclusive.
Could the kernel make the phy less stable, or something like this ? I don't know if there is a way to dump low-level parameters ("ethtool --phy-statistics" isn't supported).

Comment 2 Steve 2019-02-20 17:09:41 UTC

[    1.790604] r8169 0000:02:00.0 eth0: RTL8168h/8111h, d4:81:d7:95:e6:ab, XID 54100800, IRQ 125

That is the same device as in:

Bug 1671958 - Slow RX speed with RTL8168

If you are able to build a kernel, you could try the diagnostic patch suggested in Bug 1671958, Comment 21.

If your BIOS supports it, you could also try "setting ASPM to 'off'" (Bug 1671958, Comment 64).

Comment 3 Loïc Yhuel 2019-02-20 17:50:48 UTC

ASPM should already be disabled :
[    0.448125] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    0.545240] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    0.545270] acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
[    0.552454] pci 0000:00:1c.5: ASPM: current common clock configuration is broken, reconfiguring

Btw, It wouldn't explain why I don't see the issue on all networks (unless there is an external factor).

Comment 4 Steve 2019-02-20 18:18:26 UTC

(In reply to Loïc Yhuel from comment #0)
...
> I almost always happens if I connect to the network at work after the laptop
> has been suspended/resumed. But everything works fine after at home.
...

(In reply to Loïc Yhuel from comment #3)
...
> Btw, It wouldn't explain why I don't see the issue on all networks (unless there is an external factor).

What do you have on the other end in both cases? And are your cables Cat 5e or better?

Comment 5 Loïc Yhuel 2019-02-20 19:04:21 UTC

(In reply to Steve from comment #4)
> (In reply to Loïc Yhuel from comment #0)
> ...
> > I almost always happens if I connect to the network at work after the laptop
> > has been suspended/resumed. But everything works fine after at home.
> ...
> 
> (In reply to Loïc Yhuel from comment #3)
> ...
> > Btw, It wouldn't explain why I don't see the issue on all networks (unless there is an external factor).
> 
> What do you have on the other end in both cases? And are your cables Cat 5e
> or better?

Cat 5e cables (I tried several ones).

At home, I'm using an Netgear gigabit switch.
At work, I'm either connected to a big Cisco switch (on a wall plug, so the cables are probably quite long), or to a small gigabit switch (either a TP-Link switch, or to the VoIP phone builtin one) connected to the Cisco one. Everything seems fine after a reboot, or using an USB adapter.

The only difference between the two configuration seems to be the flow control :
 - at home : r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
 - at work : r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off (but it's the same after a reboot with good performance, or after a suspend with the issue)

Comment 6 Loïc Yhuel 2019-02-20 19:06:46 UTC

Created attachment 1536822 [details]
lspci -vvv after a reboot (correct performance)

Comment 7 Loïc Yhuel 2019-02-20 19:07:28 UTC

Created attachment 1536823 [details]
lspci -vvv after a suspend (bad performance)

Comment 8 Loïc Yhuel 2019-02-20 23:19:10 UTC

Looking at pcie_no_aspm function, it seems the kernel traces at boot only mean the kernel will keep the HW as configured, so the traces are IHMO misleading (the kernel doesn't have ASPM control, which doesn't mean the UEFI didn't already enable it, and it seems to be the case here).

Comparing the lspci -vvv output, I see ASPM_L1.2 is disabled on the first boot, but gets enabled after a suspend/resume.
It seems to be a difference from the lspci outputs posted on the other bug.
If affects :
 - 00:1c.0 : PCI bridge connected to the RTL8168H
 - 02:00.0 : RTL8168H
 - 00:1d.0 : PCI bridge connected to the NVMe
 - 04:00.0 : NVMe (Toshiba America Info Systems XG4 NVMe SSD Controller)

Does that mean the UEFI boots with ASPM L1.1, but sets ASPM L1.2 on suspend or on resume ?
If it causes the performance issue, I wonder why I can't reproduce it at home (no issue, with the same lspci output), and if there is a way to explicitly disable ASPM_L1.2 flag.

I will try the different patches tomorrow, by just rebuilding r8169.ko from linux-stable (linux-4.20.y branch).
I can just say for now that :
 - at home, they don't break anything
 - With https://bugzilla.redhat.com/show_bug.cgi?id=1671958#c60 : "r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control" (if I understand correctly the UEFI enables ASPM, and doesn't allow the kernel to disable it)

Comment 9 Steve 2019-02-21 00:26:50 UTC

Adding Heiner Kallweit to CC list.

Heiner: Do you need any more info for this bug?

Comment 10 Loïc Yhuel 2019-02-21 10:03:21 UTC

With https://bugzilla.redhat.com/show_bug.cgi?id=1671958#c21 :
The issue is immediately fixed after unloading the module and loading the patched one, and survives a suspend/resume cycle.

With https://bugzilla.redhat.com/show_bug.cgi?id=1671958#c47 :
The issue is still present.

With https://bugzilla.redhat.com/show_bug.cgi?id=1671958#c60 :
The issue is still present ("r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control", so I assume it doesn't do anything on my machine).

So, if I understand correctly, enabling the ASPM on the Realtek chip causes my issue (not sure why it doesn't affect all networks), but only if the UEFI had activated ASPM with L1.2, which only happens after a suspend/resume cycle.
Is there a way to keep the "rtl_hw_aspm_clkreq_enable(tp, true)", but to restrict to ASPM L1.1 ?

Is there any additional test I can do ?

Comment 11 Heiner Kallweit 2019-02-21 14:17:01 UTC

(In reply to Steve from comment #9)
> Adding Heiner Kallweit to CC list.
> 
> Heiner: Do you need any more info for this bug?

The existing information is enough (at least with regard to what I need from users).
Cause of the issues seems to be that the chip RX FIFO is too small to buffer all incoming packets whilst PCIe exits deeper power-saving states (at least L1.2, not sure whether this applies to L1.1 too). The deeper the power-saving state, the longer the exit latency .. I also checked with Realtek and what they do in their Windows driver: They have some heuristics and dynamically switch off ASPM under higher load. What's a not too nice workaround.

Unclear is which chip versions are affected. So far we have reports about RTL8168h only, but this could be due to the fact that newer boards, which support L1 sub-states, often have this Realtek chip version on board. I have to check whether L1 sub-states can selectively be disabled. So far I'm only aware of the option to disable L1 completely.

Comment 12 Steve 2019-02-21 18:28:27 UTC

(In reply to Heiner Kallweit from comment #11)
...
> The existing information is enough (at least with regard to what I need from users).
...

Thanks for your informative explanation.

(In reply to Loïc Yhuel from comment #3)
...
> Btw, It wouldn't explain why I don't see the issue on all networks (unless there is an external factor).

Devices on both ends have to agree on the power state of the link:

Active State Power Management
October 22, 2009 
https://teledynelecroy.com/doc/active-state-power-management

ethtool shows some info about the "Link partner":

# ethtool <device name from "nmcli d">

Comment 13 Steve 2019-02-21 20:49:41 UTC

(In reply to Heiner Kallweit from comment #11)
...
> The deeper the power-saving state, the longer the exit latency .. I also checked with
> Realtek and what they do in their Windows driver: They have some heuristics
> and dynamically switch off ASPM under higher load. What's a not too nice workaround.
...

Not sure if this is helpful, but some PCIe devices support "Latency Tolerance Reporting". From the attached lspci output:

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
...
    Capabilities: [170 v1] Latency Tolerance Reporting
        Max snoop latency: 3145728ns
        Max no snoop latency: 3145728ns
...

And for my RTL8168g/8111g:

# lspci -s 03:00.0 -vvv
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 11)
...
    Capabilities: [170 v1] Latency Tolerance Reporting
        Max snoop latency: 71680ns
        Max no snoop latency: 71680ns
...

A web search for "Latency Tolerance Reporting" found this:

Using L1 Sub-States to Reduce Power Consumption in PCI Express-Based Devices
By Scott Knowlton, Sr. Product Marketing Manager, Synopsys, Inc.
https://www.synopsys.com/designware-ip/technical-bulletin/reduce-power-consumption.html

Comment 14 Loïc Yhuel 2019-02-21 20:54:18 UTC

(In reply to Heiner Kallweit from comment #11)
> Cause of the issues seems to be that the chip RX FIFO is too small to buffer
> all incoming packets whilst PCIe exits deeper power-saving states (at least
> L1.2, not sure whether this applies to L1.1 too). The deeper the
> power-saving state, the longer the exit latency .. I also checked with
> Realtek and what they do in their Windows driver: They have some heuristics
> and dynamically switch off ASPM under higher load. What's a not too nice
> workaround.
> 
So the network differences probably come from the rx pause support :
 - when it's enabled, the chip is probably able to ask the link partner to pause when the RX FIFO is full until the PCIe link is operational
 - when it isn't supported by the link partner, there is is risk of frame loss, which could explain the performance issues

I wonder if the resulting packet loss, causing for example the TCP sender to slow down, could be enough to trigger ASPM again, creating a vicious circle.

Comment 15 Steve 2019-02-22 02:20:06 UTC

(In reply to Loïc Yhuel from comment #14)
...
> So the network differences probably come from the rx pause support :
...

If that is implemented with "pause frames", ethtool will show what is supported by the link partner. Could you post the output for:

# ethtool <device name from "nmcli d">

with your two test networks?

For comparison, see Bug 1671958, Comment 44.

Comment 16 Loïc Yhuel 2019-02-26 09:44:41 UTC

(In reply to Steve from comment #15)
> (In reply to Loïc Yhuel from comment #14)
> ...
> > So the network differences probably come from the rx pause support :
> ...
> 
> If that is implemented with "pause frames", ethtool will show what is
> supported by the link partner. Could you post the output for:
> 
> # ethtool <device name from "nmcli d">
> 
> with your two test networks?
> 
> For comparison, see Bug 1671958, Comment 44.
ethtool enp2s0 | grep pause :
 - at home :
        Supported pause frame use: Symmetric Receive-only
        Advertised pause frame use: Symmetric Receive-only
        Link partner advertised pause frame use: Symmetric Receive-only
 - at work :
        Supported pause frame use: Symmetric Receive-only
        Advertised pause frame use: Symmetric Receive-only
        Link partner advertised pause frame use: No

Which matched with the kernel traces in Comment 5 :
The only difference between the two configuration seems to be the flow control :
 - at home : r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control rx/tx
 - at work : r8169 0000:02:00.0 enp2s0: Link is Up - 1Gbps/Full - flow control off

Comment 17 Steve 2019-02-27 14:24:13 UTC

(In reply to Loïc Yhuel from comment #16)
...
>  - at work :
...
>         Link partner advertised pause frame use: No
...

Thanks. Was that while connected directly to the Cisco switch?

Apparently flow control is disabled intentionally in some cases. Further:

"If you do choose to disable flow control, it makes the most sense to disable it on both endpoints. Mismatched configuration could potentially cause performance issues or other problems."

To flow or not to flow?
Justin Parisi
May 7, 2015
https://blogs.cisco.com/perspectives/to-flow-or-not-to-flow

Comment 18 Laura Abbott 2019-04-09 20:44:08 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.
 
Fedora XX has now been rebased to 5.0.6  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.
 
If you experience different issues, please open a new bug report for those.

Comment 19 Loïc Yhuel 2019-04-11 11:15:28 UTC

The issue is still present with kernel-5.0.7-200.fc29.x86_64.

Comment 20 jeffj1101 2019-05-06 21:50:25 UTC

I can confirm this issue is still present on F30. The problem is with the kernel not loading r8169 module when resuming from a suspend. The network link stays down. Therefore I get no network after resuming from a suspend (using Ethernet). Running 'modprobe r8169' solves the problem. Below is a snippet from the journalctl. 

Fedora 30
Kernel: 5.0.11-300.fc30.x86_64
NetworkManager.x86_64 ver: 1:1.16.0-1.fc30



May 06 22:02:25  avahi-daemon[765]: Joining mDNS multicast group on interface enp2s0.IPv6 with address <IP ADDRSS REMOVED FOR THIS BUG REPORT>.
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8710] manager: sleep: wake requested (sleeping: yes  enabled: yes)
May 06 22:02:25  avahi-daemon[765]: Registering new address record for <IP ADDRSS REMOVED FOR THIS BUG REPORT> on enp2s0.*.
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8711] device (enp2s0): state change: activated -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
May 06 22:02:25  avahi-daemon[765]: Withdrawing address record for <IP ADDRSS REMOVED FOR THIS BUG REPORT> on enp2s0.
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8769] dhcp4 (enp2s0): canceled DHCP transaction, DHCP client pid 1108
May 06 22:02:25  avahi-daemon[765]: Leaving mDNS multicast group on interface enp2s0.IPv6 with address <IP ADDRSS REMOVED FOR THIS BUG REPORT>.
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8769] dhcp4 (enp2s0): state changed bound -> done
May 06 22:02:25  avahi-daemon[765]: Interface enp2s0.IPv6 no longer relevant for mDNS.
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8779] dhcp6 (enp2s0): canceled DHCP transaction
May 06 22:02:25  dnsmasq[1089]: reading /etc/resolv.conf
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8779] dhcp6 (enp2s0): state changed terminated -> done
May 06 22:02:25  dnsmasq[1089]: using nameserver <IP ADDRSS REMOVED FOR THIS BUG REPORT>
May 06 22:02:25  systemd[1]: Stopped target Suspend.
May 06 22:02:25  dnsmasq[1089]: no servers found in /etc/resolv.conf, will retry
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.8973] manager: NetworkManager state is now CONNECTED_GLOBAL
May 06 22:02:25  NetworkManager[869]: <info>  [1557176545.9121] device (enp2s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'managed')
May 06 22:02:25  kernel: Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
May 06 22:02:25  systemd[1]: Starting Network Manager Script Dispatcher Service...
May 06 22:02:25  systemd[1]: Stopped target Bluetooth.
May 06 22:02:25  systemd[1]: Reached target Bluetooth.
May 06 22:02:26  kernel: r8169 0000:02:00.0 enp2s0: Link is Down
May 06 22:02:26  NetworkManager[869]: <warn>  [1557176546.0404] dns-sd-resolved[0x558e2e723420]: Failed: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not activate remote peer.
May 06 22:02:26  NetworkManager[869]: <warn>  [1557176546.0404] dns-sd-resolved[0x558e2e723420]: Failed: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not activate remote peer.
May 06 22:02:26  NetworkManager[869]: <warn>  [1557176546.0405] dns-sd-resolved[0x558e2e723420]: Failed: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not activate remote peer.
May 06 22:02:26  NetworkManager[869]: <warn>  [1557176546.0405] dns-sd-resolved[0x558e2e723420]: Failed: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not activate remote peer.
May 06 22:02:27  kernel: r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off
May 06 22:02:27  kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6671] device (enp2s0): carrier: link connected
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6678] device (enp2s0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6694] policy: auto-activating connection 'enp2s0' (<IP ADDRSS REMOVED FOR THIS BUG REPORT>)
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6709] device (enp2s0): Activation: starting connection 'enp2s0' (<IP ADDRSS REMOVED FOR THIS BUG REPORT>)
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6711] device (enp2s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6721] manager: NetworkManager state is now CONNECTING
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6724] device (enp2s0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
May 06 22:02:27  audit: NETFILTER_CFG table=raw family=2 entries=28
May 06 22:02:27  audit: NETFILTER_CFG table=mangle family=2 entries=43
May 06 22:02:27  audit: NETFILTER_CFG table=nat family=2 entries=58
May 06 22:02:27  audit: NETFILTER_CFG table=filter family=2 entries=107
May 06 22:02:27  audit: NETFILTER_CFG table=raw family=10 entries=30
May 06 22:02:27  audit: NETFILTER_CFG table=mangle family=10 entries=42
May 06 22:02:27  audit: NETFILTER_CFG table=nat family=10 entries=53
May 06 22:02:27  audit: NETFILTER_CFG table=filter family=10 entries=98
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6887] device (enp2s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6892] dhcp4 (enp2s0): activation: beginning transaction (timeout in 45 seconds)
May 06 22:02:27  NetworkManager[869]: <info>  [1557176547.6915] dhcp4 (enp2s0): dhclient started with pid 1460
May 06 22:02:27  avahi-daemon[765]: Joining mDNS multicast group on interface enp2s0.IPv6 with address <IP ADDRSS REMOVED FOR THIS BUG REPORT>.
May 06 22:02:27  avahi-daemon[765]: New relevant interface enp2s0.IPv6 for mDNS.
May 06 22:02:27  avahi-daemon[765]: Registering new address record for <IP ADDRSS REMOVED FOR THIS BUG REPORT> on enp2s0.*.
May 06 22:02:27  dhclient[1460]: DHCPREQUEST on enp2s0 to <IP ADDRSS REMOVED FOR THIS BUG REPORT> port 67 (xid=0x919f5a7e)
May 06 22:02:28  kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 06 22:02:28  kernel: ata1.00: configured for UDMA/100
May 06 22:02:28  systemd[1]: sssd-kcm.service: Succeeded.
May 06 22:02:28  audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=sssd-kcm comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 06 22:02:28  systemd[1]: Starting SSSD Kerberos Cache Manager...
May 06 22:02:28  ModemManager[776]: <info>  Couldn't check support for device '/sys/devices/pci0000:00/0000:00:03.1/0000:02:00.0': not supported by any plugin
May 06 22:02:28  systemd[1]: Started Load/Save RF Kill Switch Status.
May 06 22:02:28  audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 06 22:02:28  audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res>
May 06 22:02:28  nm-dispatcher[1450]: req:1 'down' [enp2s0]: new request (3 scripts)
May 06 22:02:28  systemd[1]: Started Network Manager Script Dispatcher Service.
May 06 22:02:28  nm-dispatcher[1450]: req:1 'down' [enp2s0]: start running ordered scripts...
May 06 22:02:29  chronyd[782]: Forward time jump detected!

Comment 21 Loïc Yhuel 2019-05-06 21:55:38 UTC

(In reply to jeffj1101 from comment #20)
> I can confirm this issue is still present on F30. The problem is with the
> kernel not loading r8169 module when resuming from a suspend. The network
> link stays down. Therefore I get no network after resuming from a suspend
> (using Ethernet). Running 'modprobe r8169' solves the problem. Below is a
> snippet from the journalctl. 
> 
This is a different issue.

Comment 22 Loïc Yhuel 2019-05-16 11:33:15 UTC

The problem is still present with 5.0.16-200.fc29.x86_64, so the linux-stable commits "r8169: disable ASPM again" and "r8169: disable default rx interrupt coalescing on RTL8168" do not fix the issue for me.
It's probably due to the lack of ASPM control : "r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control".

Commenting the "tl_hw_aspm_clkreq_enable(tp, true);" in rtl_hw_start_8168h_1 still works.

Comment 23 Heiner Kallweit 2019-05-16 17:48:02 UTC

You can try to disable ASPM in the BIOS. Or set kernel boot parameter pcie_aspm.policy=performance.
Same you can achieve by "echo performance > /sys/module/pcie_aspm/parameters/policy".

"Commenting the "tl_hw_aspm_clkreq_enable(tp, true);" in rtl_hw_start_8168h_1 still works."
Changing the code this way would prevent unaffected users who want ASPM from enabling it (by an upcoming sysfs attribute).

Comment 24 Loïc Yhuel 2019-05-16 18:22:56 UTC

(In reply to Heiner Kallweit from comment #23)
> You can try to disable ASPM in the BIOS.
There is no option for this.

> Or set kernel boot parameter
> pcie_aspm.policy=performance.
It has no effect (the dmesg doesn't change, except the cmdline of course), probably due to Linux having no control over ASPM :
# dmesg | grep -i aspm
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.0.16-200.fc29.x86_64 root=UUID=5e7506af-3135-48f0-8612-e0da08ead4f2 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 vconsole.keymap=fr LANG=en_US.UTF-8 rhgb quiet selinux=0 audit=0 libahci.ignore_sss=1 raid=noautodetect nouveau.noaccel=1 pcie_aspm.policy=performance
[    0.350793] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.0.16-200.fc29.x86_64 root=UUID=5e7506af-3135-48f0-8612-e0da08ead4f2 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 vconsole.keymap=fr LANG=en_US.UTF-8 rhgb quiet selinux=0 audit=0 libahci.ignore_sss=1 raid=noautodetect nouveau.noaccel=1 pcie_aspm.policy=performance
[    0.471527] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it
[    0.569377] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI]
[    0.569408] acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
[    0.576670] pci 0000:00:1c.5: ASPM: current common clock configuration is broken, reconfiguring
[    1.957109] r8169 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control

> Same you can achieve by "echo performance >
> /sys/module/pcie_aspm/parameters/policy".
echo: write error: Operation not permitted
pcie_aspm_set_policy returns -EPERM

Comment 25 Heiner Kallweit 2019-05-16 18:57:27 UTC

(In reply to Loïc Yhuel from comment #24)
> (In reply to Heiner Kallweit from comment #23)
> # dmesg | grep -i aspm
> [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.0.16-200.fc29.x86_64
> root=UUID=5e7506af-3135-48f0-8612-e0da08ead4f2 ro rd.md=0 rd.lvm=0 rd.dm=0
> rd.luks=0 vconsole.keymap=fr LANG=en_US.UTF-8 rhgb quiet selinux=0 audit=0
> libahci.ignore_sss=1 raid=noautodetect nouveau.noaccel=1
> pcie_aspm.policy=performance
> [    0.350793] Kernel command line:
> BOOT_IMAGE=/boot/vmlinuz-5.0.16-200.fc29.x86_64
> root=UUID=5e7506af-3135-48f0-8612-e0da08ead4f2 ro rd.md=0 rd.lvm=0 rd.dm=0
> rd.luks=0 vconsole.keymap=fr LANG=en_US.UTF-8 rhgb quiet selinux=0 audit=0
> libahci.ignore_sss=1 raid=noautodetect nouveau.noaccel=1
> pcie_aspm.policy=performance
> [    0.471527] ACPI FADT declares the system doesn't support PCIe ASPM, so
> disable it
> [    0.569377] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
> ClockPM Segments MSI]
> [    0.569408] acpi PNP0A08:00: _OSC failed (AE_ERROR); disabling ASPM
> [    0.576670] pci 0000:00:1c.5: ASPM: current common clock configuration is
> broken, reconfiguring

All this doesn't look good. Seems like your BIOS is completely broken with regard to ASPM.

Comment 26 Loïc Yhuel 2019-05-16 20:49:01 UTC

(In reply to Heiner Kallweit from comment #25)
> All this doesn't look good. Seems like your BIOS is completely broken with
> regard to ASPM.
I tried booting the Fedora 30 ISO in legacy mode just in case, but I see the same errors as in UEFI mode.

Now (not sure if it's the latest BIOS or the kernel changes) ASPM_L1.2 is enabled on the RTL8168 even on cold boot, so the bug appears without having to suspend then resume.
At least it's more consistent...

Comment 27 Justin M. Forbes 2019-08-20 17:39:41 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 28 Loïc Yhuel 2019-08-26 09:06:50 UTC

The issue is still present with 5.2.9-100.fc29.x86_64.

Comment 29 Heiner Kallweit 2019-08-26 13:38:23 UTC

Can you check with a 5.3-rc kernel?

Comment 30 Loïc Yhuel 2019-08-27 10:03:21 UTC

With 5.3.0-0.rc5.git0.1.fc31.x86_64, I cannot reproduce the issue any more.

I guess your "r8169: don't activate ASPM in chip if OS can't control ASPM" fixed it, thanks.

Comment 31 Ben Cotton 2019-10-31 18:54:40 UTC

This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 32 Ben Cotton 2019-11-27 23:18:15 UTC

Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

airlied
bskeggs
dennyvatwork
hdegoede
hkallweit1
ichavero
itamar
jarodwilson
jeffj1101
jeremy
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
mchehab
mjg59
steved
y9t7sypezp