Bug 1737207 - r8169 frequently drops network link on kernel >= 5.1
Summary: r8169 frequently drops network link on kernel >= 5.1
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 29
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-03 23:44 UTC by Michael Chapman
Modified: 2019-11-27 23:28 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-27 23:28:27 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Michael Chapman 2019-08-03 23:44:55 UTC
On all kernel-5.1.x versions, the r8169 NIC frequently drops then reacquires its link:

[Aug 3 23:49] r8169 0000:05:00.0 lan: Link is Down
[  +3.499100] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[Aug 3 23:55] r8169 0000:05:00.0 lan: Link is Down
[  +3.509606] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[Aug 3 23:56] r8169 0000:05:00.0 lan: Link is Down
[  +3.499138] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[ +25.505626] r8169 0000:05:00.0 lan: Link is Down
[  +3.498342] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
...
[Aug 4 04:14] r8169 0000:05:00.0 lan: Link is Down
[  +3.509697] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[ +34.337795] r8169 0000:05:00.0 lan: Link is Down
[  +3.498381] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[Aug 4 04:15] r8169 0000:05:00.0 lan: Link is Down
[  +3.498463] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off

These errors started at around six hours of uptime, continued for about five hours, then suddenly stopped. It's been a few hours since then, and I'm not sure if they will reappear.

This problem has been encountered on these kernels:

  kernel-5.1.15-200.fc29.x86_64
  kernel-5.1.16-200.fc29.x86_64
  kernel-5.1.21-200.fc29.x86_64

The last known good kernel is:

  kernel-5.0.18-200.fc29.x86_64

I note that there's a couple of other bug reports related to r8169 and suspend-to-RAM (e.g. bug #1580079, bug #1679140), however I am getting these errors *without* suspending my machine.

Comment 1 Michael Chapman 2019-08-03 23:57:01 UTC
> It's been a few hours since then, and I'm not sure if they will reappear.

And literally the moment I posted this bug, the problems started again:

[Aug 4 09:45] r8169 0000:05:00.0 lan: Link is Down
[  +3.508534] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[Aug 4 09:46] r8169 0000:05:00.0 lan: Link is Down
[  +3.509945] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
[ +12.851776] r8169 0000:05:00.0 lan: Link is Down
[  +3.498260] r8169 0000:05:00.0 lan: Link is Up - 100Mbps/Full - flow control off
...

There's no obvious correlation with any local activity. This machine does a small amount of network traffic almost all of the time.

Comment 2 Michael Chapman 2019-08-04 00:49:33 UTC
Forgot to provide the hardware details:

05:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15)

Comment 3 Heiner Kallweit 2019-08-04 08:47:34 UTC
This issue seems to be system-dependent. Best would be if you could bisect it.

Comment 4 Heiner Kallweit 2019-08-05 08:56:05 UTC
What you could try in addition:
1. Disable EEE: ethtool --set-eee <if> eee off
2. Check a more recent kernel (5.2, 5.3-rc, linux-next)

Comment 5 Michael Chapman 2019-08-07 07:21:15 UTC
So just a quick update, I am in the middle of bisecting the problem across v5.0..v5.1.15. That's a lot of commits though, and it will take a while -- especially since it seems like I have to leave the machine for many hours to see whether the problem appears.

I might put aside the current bisection and start a new one only considering changes in drivers/net/ethernet/realtek/. That might help me narrow the range down more quickly.

> 1. Disable EEE: ethtool --set-eee <if> eee off

I've only been able to try this once when the problem was happening. It did immediately make the problem stop... but I'm not entirely sure if this was permanent.

My bisection is *not* currently covering this commit:

    commit b6c7fa401625d949e5e370f32e74f22c3bbaed51
    Author: Heiner Kallweit <hkallweit1>
    Date:   Fri Jan 25 20:39:42 2019 +0100

        r8169: enable EEE per default on chip versions from RTL8168g

        Enable EEE per default on chip versions from RTL8168g.

        Signed-off-by: Heiner Kallweit <hkallweit1>
        Signed-off-by: David S. Miller <davem>

which I'm guessing you think might be where the problem started. But that could just be because I've erroneously marked some bad commits as good, having not waited long enough for the problems to appear.

> 2. Check a more recent kernel (5.2, 5.3-rc, linux-next)

I've yet to try this.

Comment 6 Heiner Kallweit 2019-08-07 08:10:32 UTC
If the issue is really caused by EEE, then it could also be the link partner who's to blame. Especially early EEE implementations of different vendors often had compatibility problems.
So you could check with a different link partner, or simply add the command to disable EEE to a startup script.

Comment 7 Michael Chapman 2019-08-09 07:30:28 UTC
OK, the bisection homed in on that commit this time. So yes, looks like the issue is due to EEE.

My specific hardware appears to be an 8168h:

   [    2.123709] r8169 0000:05:00.0 eth0: RTL8168h/8111h, 1c:1b:0d:73:8c:84, XID 541, IRQ 124
   [    2.123921] r8169 0000:05:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]

but I'm guessing if this problem hasn't been reported by anyone else the problem is really with my switch. Disabling EEE manually does appear to make the problem go away.

Comment 8 Heiner Kallweit 2019-08-10 16:53:38 UTC
I checked EEE settings against the vendor driver r8168 and it has additional settings for RTL8168h.
Could you please apply the following patch, re-enable EEE and check whether issue has gone.
On kernel versions before 5.3 you would have to adjust the patch due to the renaming r8169.c -> r8169_main.c


diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index ad219f87a..7cd9135d0 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -2420,6 +2420,16 @@ static void rtl8168g_config_eee_phy(struct rtl8169_private *tp)
 	phy_modify_paged(tp->phydev, 0x0a43, 0x11, 0, BIT(4));
 }
 
+static void rtl8168h_config_eee_phy(struct rtl8169_private *tp)
+{
+	struct phy_device *phydev = tp->phydev;
+
+	rtl8168g_config_eee_phy(tp);
+
+	phy_modify_paged(phydev, 0xa4a, 0x11, 0x0000, 0x0200);
+	phy_modify_paged(phydev, 0xa42, 0x14, 0x0000, 0x0080);
+}
+
 static void rtl8169s_hw_phy_config(struct rtl8169_private *tp)
 {
 	static const struct phy_reg phy_reg_init[] = {
@@ -3487,7 +3497,7 @@ static void rtl8168h_1_hw_phy_config(struct rtl8169_private *tp)
 	phy_modify_paged(tp->phydev, 0x0a44, 0x11, BIT(7), 0);
 
 	rtl8168g_disable_aldps(tp);
-	rtl8168g_config_eee_phy(tp);
+	rtl8168h_config_eee_phy(tp);
 	rtl_enable_eee(tp);
 }
 
-- 
2.22.0

Comment 9 Michael Chapman 2019-08-11 00:26:08 UTC
No, this patch did not help.

Comment 10 Heiner Kallweit 2019-08-11 08:51:49 UTC
OK. Thanks for testing anyway.

Comment 11 Justin M. Forbes 2019-08-20 17:45:48 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 12 Michael Chapman 2019-09-05 12:02:30 UTC
Any chance for a module parameter to disable EEE on this driver completely?

Comment 13 Michael Chapman 2019-09-05 12:05:24 UTC
Whoops, submitted before finishing my comment.

Any chance for a module parameter to disable EEE on this driver completely? Unfortunately it's very difficult to get `ethtool --set-eee $if eee off` called at the right time. I was hoping I'd be able to run it from udev, but that is not reliable. This setting can't be applied while the link is down, and the EEE state is always re-enabled whenever the link is brought up.

Comment 14 Heiner Kallweit 2019-09-25 19:58:43 UTC
Typically all network managers provide an option to run scripts after having brought up the network, e.g. to set system time from network.
I agree that it's not too nice that on a link down/up cycle EEE is reset to what is supported, and doesn't respect a user setting.
Can you test the following on top of linux-next? It should prevent this.

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 74f81fe03..b5694c248 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -680,6 +680,7 @@ struct rtl8169_private {
 	struct rtl8169_counters *counters;
 	struct rtl8169_tc_offsets tc_offset;
 	u32 saved_wolopts;
+	int eee_adv;
 
 	const char *fw_name;
 	struct rtl_fw *rtl_fw;
@@ -2067,6 +2068,10 @@ static int rtl8169_set_eee(struct net_device *dev, struct ethtool_eee *data)
 	}
 
 	ret = phy_ethtool_set_eee(tp->phydev, data);
+
+	if (!ret)
+		tp->eee_adv = phy_read_mmd(dev->phydev, MDIO_MMD_AN,
+					   MDIO_AN_EEE_ADV);
 out:
 	pm_runtime_put_noidle(d);
 	return ret;
@@ -2097,7 +2102,16 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
 static void rtl_enable_eee(struct rtl8169_private *tp)
 {
 	struct phy_device *phydev = tp->phydev;
-	int supported = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
+	int supported;
+
+	/* respect EEE advertisement the user may have set */
+	if (tp->eee_adv >= 0) {
+		phy_write_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV,
+			      tp->eee_adv);
+		return;
+	}
+
+	supported = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_PCS_EEE_ABLE);
 
 	if (supported > 0)
 		phy_write_mmd(phydev, MDIO_MMD_AN, MDIO_AN_EEE_ADV, supported);
@@ -7069,6 +7083,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	tp->pci_dev = pdev;
 	tp->msg_enable = netif_msg_init(debug.msg_enable, R8169_MSG_DEFAULT);
 	tp->supports_gmii = ent->driver_data == RTL_CFG_NO_GBIT ? 0 : 1;
+	tp->eee_adv = -1;
 
 	/* Get the *optional* external "ether_clk" used on some boards */
 	rc = rtl_get_ether_clk(tp);
-- 
2.23.0

Comment 15 Michael Chapman 2019-09-26 07:44:45 UTC
(In reply to Heiner Kallweit from comment #14)
> Typically all network managers provide an option to run scripts after having
> brought up the network, e.g. to set system time from network.

Not systemd-networkd, and my experience with that project tells me their response will be "we want to apply these settings _before_ bringing the link up, and the kernel shouldn't touch it just because the link was cycled; therefore, this is a kernel bug".

> I agree that it's not too nice that on a link down/up cycle EEE is reset to
> what is supported, and doesn't respect a user setting.
> Can you test the following on top of linux-next? It should prevent this.

Thanks, I'll give it a go!

Comment 16 Ben Cotton 2019-10-31 18:48:49 UTC
This message is a reminder that Fedora 29 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '29'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 29 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Ben Cotton 2019-11-27 23:28:27 UTC
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.