1650984 – RTL8111 NIC not working with kernel 4.19.2

Bug 1650984 - RTL8111 NIC not working with kernel 4.19.2

Summary: RTL8111 NIC not working with kernel 4.19.2

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	29
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (7):	1654736 1654858 1657458 1661680 1661943 1662862 1662910 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-18 13:26 UTC by Norbert Jurkeit
Modified:	2019-08-21 13:35 UTC (History)
CC List:	40 users (show)
Fixed In Version:	kernel-4.19.6-300.fc29 kernel-4.19.6-200.fc28
Clone Of:
Environment:
Last Closed:	2019-08-21 13:35:53 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
journalctl of 4.19.6-200 when generic PHY refuses to work (100.99 KB, text/plain) 2018-12-10 03:08 UTC, Alex Villacís Lasso	no flags	Details
journalctl of 4.19.6-200 when generic PHY works (94.62 KB, text/plain) 2018-12-10 03:09 UTC, Alex Villacís Lasso	no flags	Details
lsmod on failed startup (6.41 KB, text/plain) 2018-12-15 11:30 UTC, Frank Crawford	no flags	Details
lsmod following r8169 reload (6.41 KB, text/plain) 2018-12-15 11:32 UTC, Frank Crawford	no flags	Details
dmesg of newest kernel with load/unload cycle (68.32 KB, text/plain) 2018-12-25 14:52 UTC, oli	no flags	Details
dmesg after updating to kernel 4.19.12-301.rhbz1650984_2.fc29.x86_64 (68.16 KB, text/plain) 2018-12-26 09:02 UTC, oli	no flags	Details
dmesg-4.19.10-300.fc29.x86_64 (5.99 KB, text/plain) 2018-12-26 11:08 UTC, Elia Devito	no flags	Details
dmesg-4.19.10-300.fc29.x86_64 (4.37 KB, text/plain) 2018-12-26 11:09 UTC, Elia Devito	no flags	Details
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64 (4.37 KB, text/plain) 2018-12-26 11:10 UTC, Elia Devito	no flags	Details
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64 (6.81 KB, text/plain) 2018-12-26 11:10 UTC, Elia Devito	no flags	Details
dmesg of 4.19.12-301.rhbz1650984_4.fc29.x86_64 (68.46 KB, text/plain) 2018-12-27 13:59 UTC, oli	no flags	Details
dmesg of 4.19.12-301.rhbz1650984_3.fc29.x86_64 (68.00 KB, text/plain) 2018-12-27 14:01 UTC, oli	no flags	Details
dmesg for 4.20.4-200.fc29.x86_64 (68.64 KB, text/plain) 2019-01-26 00:37 UTC, oli	no flags	Details
Show Obsolete (1) View All

Description Norbert Jurkeit 2018-11-18 13:26:24 UTC

My desktop PC has 2 Ethernet interfaces, the first one located on the motherboard and listed by lshw as "RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller" (RTL8111DL according to motherboard user manual) and the second located on a PCI card listed as "RTL8169 PCI Gigabit Ethernet Controller". Both interfaces use driver r8169.

After upgrade to kernel-4.19.2-300.fc29.x86_64 the first interface no longer works. According to ifconfig the interface is up, but not running and both LEDs besides the connector stay off. As a result, NetworkManager cannot establish a connection. The systemd journal does not contain any suspicious messages.

When I choose previous kernel 4.18.18 from Grub menu everything works fine again, so a hardware damage is unlikely.

This bug is very similar to bug 1628202 except that now the first NIC is affected. Seems that driver r8169 is broken again.

Comment 1 Michael Wiktowy 2018-11-20 14:56:26 UTC

I have the identical problem with the same network controller with the latest kernel upgrade. I only have one NIC on motherboard though.

Booting to kernel.x86_64 4.19.2-300.fc29 does not work.
Reverting to kernel.x86_64 4.18.18-300.fc29 works again.

'lspci | grep net' output:
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 01)

Logs don't indicate anything with kernel 4.19 other than:

Nov 18 21:36:05 desktop kernel: libphy: r8169: probed
Nov 18 21:36:05 desktop kernel: r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24
Nov 18 21:36:05 desktop kernel: r8169 0000:03:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]

Where as NIC module load with 4.18 kernel is:

Nov 17 21:34:18 desktop kernel: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
Nov 17 21:34:18 desktop kernel: r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
Nov 17 21:34:18 desktop kernel: r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24
Nov 17 21:34:18 desktop kernel: r8169 0000:03:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]

Comment 2 Jeremy Cline 2018-11-20 16:24:46 UTC

It looks like there were a number of changes in v4.19 for r8169. To start with, please attach the complete dmesg output for each boot.

It would be very helpful if you bisected[0] the kernel between v4.18 and v4.19 to figure out which change introduced the regression.

[0] https://docs.fedoraproject.org/en-US/quick-docs/kernel/troubleshooting/index.html#_bisecting_the_kernel

Comment 3 Marc Dionne 2018-11-20 17:25:15 UTC

I have the same issue on an old machine that needs the r8169 driver.

The problem is that new r8169 driver in 4.19 has an indirect dependency on the "realtek" phy driver that's not encoded anywhere, which is probably an upstream kernel bug.  For my custom kernel I work around it by building 'realtek' in the kernel rather than as a module (CONFIG_REALTEK_PHY=y).

Once the machine is booted, the network can be enabled by loading realtek and reloading the r8169 module:

$ modprobe realtek
$ rmmod r8169
$ modprobe r8169

Comment 4 Norbert Jurkeit 2018-11-20 18:15:26 UTC

@Marc: The commands listed by you also work for me. Many thanks for that!

@Jeremy: As Marc was already able to pinpoint the fault you probably no longer need dmesg outputs or bisect. Please let me know if I should provide them, though (bisect may take some time however as I haven't done this before).

Comment 5 Jeremy Cline 2018-11-20 19:16:18 UTC

Norbert, Marc,

I recommend reporting this to upstream. The folks and lists to email should be:

* Realtek linux nic maintainers <nic_swsd>
* "David S. Miller" <davem>
* netdev.org
* linux-kernel.org

Comment 6 Michael Wiktowy 2018-11-21 01:36:10 UTC

As a more permanent temporary work-around:
'echo realtek > /etc/modules-load.d/realtek.conf'
causes systemd to load the module on boot.

Comment 7 Norbert Jurkeit 2018-11-21 10:01:06 UTC

That's strange... I did some more tests (without the proposed workarounds) and noticed that the fault only occurs after reboot from a previous session with kernel 4.18.18. It does not occur (at least for me) after power-on and in subsequent reboots of kernel 4.19.2. I also noticed that while the fault is present sometimes it is sufficient to issue either "modprobe realtek" or "rmmod r8169; modprobe r8169". Obviously the behavior is erratic as if some registers were not fully initialized by the new driver.

Anyway, I will write an email to the proposed list and ask them to take care of the issue.

Comment 8 ikixxx@gmail.com 2018-11-21 11:28:49 UTC

I have the same problem (02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 01))

In my case the fault is always there, cold boot or reboot it doesn't matter.

Proposed workarounds work.

Comment 9 Michael Wiktowy 2018-11-21 17:12:08 UTC

I've had similar intermittent functioning too. When I made the realtek.conf systemd loader, I removed it to see if the NIC would stay down and to my surprise, it stayed working. After rebooting into Windows it stopped working again so I put the conf file back in place. So it would appear that whatever functioning the realtek module enables is somewhat sticky.

Comment 10 Heiner Kallweit 2018-11-21 20:07:08 UTC

The issue seems to affect certain chip-versions / systems only. So I'd like to get an overview. Could everybody facing this issue please provide the following info:

- dmesg line with the chip XID (just grep for "XID")
- dmesg line with the used PHY driver (grep for "attached PHY driver")
  (if the error occurs and when loading the realtek module manually)
- PHY ID (value of /sys/class/net/<ifname>/phydev/phy_id)

Comment 11 Michael Wiktowy 2018-11-21 20:15:34 UTC

This is for a system where the network is working due to systemd manual realtek module load above:

[root@desktop ~]# dmesg |grep XID
[    3.448502] r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24

[root@desktop ~]# dmesg |grep "attached PHY driver"
[    9.640822] RTL8211B Gigabit Ethernet r8169-300:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE)

[root@desktop ~]# cat /sys/class/net/enp3s0/phydev/phy_id 
0x001cc912

Like was indicated, even after removing this conf file, the fix is persistent through warm and cold reboots. I will try booting to another kernel/OS to trigger it to not work again and see if there is any difference with above dmesg commands.

Comment 12 Marc Dionne 2018-11-21 20:39:41 UTC

For my case:

[    2.385162] r8169 0000:06:00.0 eth0: RTL8168c/8111c, 00:24:21:32:03:4b, XID 3c4000c0, IRQ 24

When the network failed to start:

[    6.773503] Generic PHY r8169-600:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

but later on after manually loading the 'realtek' driver:

[15710.839318] RTL8211B Gigabit Ethernet r8169-600:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

/sys/class/net/enp3s0/phydev/phy_id is 0x001cc912, but that's after manually loading 'realtek', so I suspect it would be different at boot when in the failed state.

Comment 13 Norbert Jurkeit 2018-11-21 21:49:07 UTC

(In reply to Heiner Kallweit from comment #10)

> - dmesg line with the chip XID (just grep for "XID")

[    3.580112] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 00:25:22:06:b2:1c, XID 281000c0, IRQ 30
[    3.585926] r8169 0000:06:00.0 eth1: RTL8169sb/8110sb, 00:11:6b:97:15:b3, XID 10000000, IRQ 21

Please note that my machine has 2 Ethernet interfaces. The first one is affected by this issue, the second works fine.


> - dmesg line with the used PHY driver (grep for "attached PHY driver")

When the error does NOT occur (after cold start in my case):

[   27.896411] Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[   28.032308] Generic PHY r8169-600:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

>   (if the error occurs and when loading the realtek module manually)

[   24.084694] Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[   24.221617] Generic PHY r8169-600:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)
[  501.749429] RTL8211B Gigabit Ethernet r8169-100:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[  501.939784] RTL8211C Gigabit Ethernet r8169-600:00: attached PHY driver [RTL8211C Gigabit Ethernet] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

When the error occurs I can also restore proper operation with just "rmmod r8169; modprobe r8169" without "modprobe realtek". But even then the 2 extra entries for RTL8211B and RTL8211C appear.


> - PHY ID (value of /sys/class/net/<ifname>/phydev/phy_id)

0x001cc912


Unfortunately no other OS besides Fedora is installed on my machine, but according to Comment 9 I would expect that booting Windows before kernel 4.19.2 also invokes the issue.

Comment 14 Michael Wiktowy 2018-11-22 04:46:23 UTC

To contrast with my working config in Comment 11, FWIW, here are those dmesg outputs after removing manual module loading work-around, rebooting into dual-boot Windows and then booting back and having networking fail:

[root@desktop ~]# dmesg |grep XID
[    3.465029] r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24
[root@desktop ~]# dmesg |grep "attached PHY driver"
[    9.753340] Generic PHY r8169-300:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE)
[root@desktop ~]# cat /sys/class/net/enp3s0/phydev/phy_id 
0x001cc912

Comment 15 Heiner Kallweit 2018-11-22 20:34:58 UTC

The following seems to fix the issue. At least it works for Marc. If some of you also have the option to compile an own kernel and test, this would be appreciated.

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 55202a0ac..aeccb2323 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2254,6 +2254,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
        new_driver->mdiodrv.driver.probe = phy_probe;
        new_driver->mdiodrv.driver.remove = phy_remove;
        new_driver->mdiodrv.driver.owner = owner;
+       new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;

        retval = driver_register(&new_driver->mdiodrv.driver);
        if (retval) {

Comment 16 Norbert Jurkeit 2018-11-23 17:57:29 UTC

(In reply to Heiner Kallweit from comment #15)

This patch also works for me.

I managed to build a Fedora kernel of version 4.19.3 first without, then with the patch applied. As expected the issue is also present in the unpatched kernel 4.19.3, but is fixed for me by including the additional instruction (i.e. the RTL8111 based interface not only comes up correctly after a cold start, but also after reboot from older kernel 4.18.18).

Many thanks for finding a solution so quickly!

Comment 17 Heiner Kallweit 2018-11-29 11:54:35 UTC

Fixed with commit c85ddecae6e5 ("net: phy: add workaround for issue where PHY driver doesn't bind to the device"). It's available in linux-next and it's expected to be included in 4.19.6.

Comment 18 Jeremy Cline 2018-11-29 22:12:28 UTC

It doesn't look to have made it into v4.19.6, so I'll add it as a patch for the Fedora v4.19.6 build. Thanks!

Comment 19 Jeremy Cline 2018-11-30 21:04:57 UTC

*** Bug 1654736 has been marked as a duplicate of this bug. ***

Comment 20 Fedora Update System 2018-12-03 18:51:42 UTC

kernel-tools-4.19.6-300.fc29 kernel-headers-4.19.6-300.fc29 kernel-4.19.6-300.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-ce033b4abb

Comment 21 Fedora Update System 2018-12-03 18:52:55 UTC

kernel-headers-4.19.6-200.fc28 kernel-tools-4.19.6-200.fc28 kernel-4.19.6-200.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-18c53c4a64

Comment 22 Fedora Update System 2018-12-04 03:38:14 UTC

kernel-4.19.6-200.fc28, kernel-headers-4.19.6-200.fc28, kernel-tools-4.19.6-200.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-18c53c4a64

Comment 23 Fedora Update System 2018-12-04 05:02:30 UTC

kernel-4.19.6-300.fc29, kernel-headers-4.19.6-300.fc29, kernel-tools-4.19.6-300.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-ce033b4abb

Comment 24 Norbert Jurkeit 2018-12-04 19:39:41 UTC

I hate to say this, but neither kernel-4.19.6-300.fc29.x86_64 nor kernel-4.19.6-200.fc28.x86_64 from updates-testing fixes the issue for me.

I then built a kernel the same way as 2 weeks ago using following instructions:

$ fedpkg clone -a kernel
$ cd kernel
$ git checkout -b my_kernel origin/f29

In file kernel.spec replace
    # define builid .local
by
    ## define buildid .local
    %define buildid .my_kernel

$ fedpkg local

This time however I didn't need to add the patch from Comment 15 as it was already in place.

The produced kernel-4.19.6-300.my_kernel.fc29.x86_64 does fix the issue, similar to kernel-4.19.3-300.my_kernel.fc29.x86_64 before.

Did probably something go wrong in koji build or are different settings used there?

Comment 25 Jeremy Cline 2018-12-04 20:33:17 UTC

Hmm. There's no configuration difference, Koji builds from the same repository. The build logs look fine to me, too.

If you repeatedly boot into the updates-testing kernel does it fail every time? When you repeatedly boot into your locally built kernel does it work every time?

Comment 26 Norbert Jurkeit 2018-12-04 21:51:44 UTC

(In reply to Jeremy Cline from comment #25)
> If you repeatedly boot into the updates-testing kernel does it fail every
> time? When you repeatedly boot into your locally built kernel does it work
> every time?

Yes and yes. So far I rebooted 3 times into both kernels each from a session with kernel 4.18.18 and in all cases the RTL8111 interface came up with the self-compiled kernel but never with the kernel from updates-testing.

Are perhaps different gcc versions in use? Mine is currently gcc-8.2.1-5.fc29.x86_64.rpm. This should not make any difference, but I remember a case where a major version upgrade caused trouble with LVM2.

Comment 27 Fedora Update System 2018-12-06 03:56:56 UTC

kernel-4.19.6-300.fc29, kernel-headers-4.19.6-300.fc29, kernel-tools-4.19.6-300.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 28 Jeremy Cline 2018-12-06 14:36:46 UTC

(In reply to Norbert Jurkeit from comment #26)
> (In reply to Jeremy Cline from comment #25)
> > If you repeatedly boot into the updates-testing kernel does it fail every
> > time? When you repeatedly boot into your locally built kernel does it work
> > every time?
> 
> Yes and yes. So far I rebooted 3 times into both kernels each from a session
> with kernel 4.18.18 and in all cases the RTL8111 interface came up with the
> self-compiled kernel but never with the kernel from updates-testing.
> 
> Are perhaps different gcc versions in use? Mine is currently
> gcc-8.2.1-5.fc29.x86_64.rpm. This should not make any difference, but I
> remember a case where a major version upgrade caused trouble with LVM2.

That kernel was built with the same version of gcc, so it's not that. I'm really not sure what could cause the discrepancy. Does v4.19.7 (just submitted to updates-testing) exhibit the same behavior?

Comment 29 Norbert Jurkeit 2018-12-06 16:29:13 UTC

(In reply to Jeremy Cline from comment #28)

> That kernel was built with the same version of gcc, so it's not that. I'm
> really not sure what could cause the discrepancy. Does v4.19.7 (just
> submitted to updates-testing) exhibit the same behavior?

kernel-4.19.7-300.fc29.x86_64.rpm just downloaded from koji neither fixes the issue for me.

I would appreciate to hear from other users hit by this issue whether the provided compiled kernels work for them. I also have no clue and would like to know what makes my self-compiled kernel behave differently, but if the issue is fixed for everyone else it is not worth to put more effort into this.

Comment 30 Marc Dionne 2018-12-06 16:52:15 UTC

FWIW I've been getting network reliably with kernel-4.19.6-300.fc29 from the repositories, as well as with locally built kernels from mainline.

Comment 31 rune 2018-12-06 19:10:37 UTC

Using kernel 19.6-300.fc29.x86_64 from the stable repo without any workarounds (NIC does not work, so not fixed for me):

# dmesg |grep XID
[    5.136382] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 00:25:22:2f:bb:9c, XID 281000c0, IRQ 31

# dmesg |grep "attached PHY driver"
[   23.020448] Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)

After adding realtek to modules-load.d and a reboot (NIC now works): 

# dmesg |grep XID
[    5.075131] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 00:25:22:2f:bb:9c, XID 281000c0, IRQ 31

# dmesg |grep "attached PHY driver"
[   48.382706] RTL8211B Gigabit Ethernet r8169-100:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
   
# cat /sys/class/net/enp1s0/phydev/phy_id 
0x001cc912

Comment 32 Fedora Update System 2018-12-07 02:14:27 UTC

kernel-4.19.6-200.fc28, kernel-headers-4.19.6-200.fc28, kernel-tools-4.19.6-200.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Comment 33 Alex Villacís Lasso 2018-12-10 03:07:18 UTC

I have a Fedora 28 machine that is also affected by this bug. Updating to 4.19.6-200 does NOT fix the bug.

I have noticed that this bug is unreliable - sometimes the generic PHY driver works, sometimes it does not. When it did not work, unloading and reloading r8169 managed to make it work.

Comment 34 Alex Villacís Lasso 2018-12-10 03:08:21 UTC

Created attachment 1512939 [details]
journalctl of 4.19.6-200 when generic PHY refuses to work

Comment 35 Alex Villacís Lasso 2018-12-10 03:09:08 UTC

Created attachment 1512940 [details]
journalctl of 4.19.6-200 when generic PHY works

Comment 36 Heiner Kallweit 2018-12-10 10:33:35 UTC

The patch as mentioned in comment 15 is included in mainline since 4.19.7. With the patch the dedicated PHY driver (RTL8211B) should be loaded immediately.
Possible would be also that the fix does fix the issue for most people, but not for all. But this seems to be less likely than a build issue with the Fedora-built kernel, see comment 24.
Maybe affected people could try to build a mainline 4.19.8 (latest available version) kernel and test with this one.

Comment 37 Frank Crawford 2018-12-11 12:19:27 UTC

Unfortunately the fix also doesn't correct the issue for me either.  I've tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.

I do have a slight difference to the other users in that I have two NICs in my system, both using the r8169 module, one of which works, the other of which doesn't.  They are slightly different models, so that is the likely issue.

The output from journalctl for the failed boot was:

Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900800, IRQ 29
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 enp5s0: renamed from eth0
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0: renamed from eth1
Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-500:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-700:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
Dec 11 22:46:42 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx

I haven't yet had a chance to try unloading and reloading the r8169 modules, but will try that in the next few days.

Comment 38 Heiner Kallweit 2018-12-11 16:15:28 UTC

(In reply to Frank Crawford from comment #37)
> Unfortunately the fix also doesn't correct the issue for me either.  I've
> tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.
> 
This doesn't rule out there's a build issue with the Fedora kernel. Could you build a mainline kernel or build a kernel as described in comment 24 and test?

> I do have a slight difference to the other users in that I have two NICs in
> my system, both using the r8169 module, one of which works, the other of
> which doesn't.  They are slightly different models, so that is the likely
> issue.
> 
> The output from journalctl for the failed boot was:
> 
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0:
> RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0:
> jumbo features [frames: 4080 bytes, tx checksumming: ko]
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1:
> RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900800, IRQ 29
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1:
> jumbo features [frames: 9200 bytes, tx checksumming: ko]
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 enp5s0:
> renamed from eth0
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0:
> renamed from eth1
> Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-500:00:
> attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
> Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-700:00:
> attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
> Dec 11 22:46:42 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0:
> Link is Up - 1Gbps/Full - flow control rx/tx
> 
RTL8168evl has an integrated RTL8211E PHY which is compatible with genphy. Therefore it works even if genphy is used instead of the dedicated RTL8211E PHY driver.
RTL8168B has an integrated RTL8211B PHY that causes problems when used with the genphy driver.

> I haven't yet had a chance to try unloading and reloading the r8169 modules,
> but will try that in the next few days.

Comment 39 Frank Crawford 2018-12-12 04:19:14 UTC

(In reply to Heiner Kallweit from comment #38)
> (In reply to Frank Crawford from comment #37)
> > Unfortunately the fix also doesn't correct the issue for me either.  I've
> > tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.
> > 
> This doesn't rule out there's a build issue with the Fedora kernel. Could
> you build a mainline kernel or build a kernel as described in comment 24 and
> test?

Okay.  I was planning to try and build a mainline kernel soon anyway (I keep running out of time) so will prioritise that this weekend.

Comment 40 Frank Crawford 2018-12-12 10:23:29 UTC

(In reply to Heiner Kallweit from comment #38)
> (In reply to Frank Crawford from comment #37)
> > Unfortunately the fix also doesn't correct the issue for me either.  I've
> > tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.
> > 
> This doesn't rule out there's a build issue with the Fedora kernel. Could
> you build a mainline kernel or build a kernel as described in comment 24 and
> test?

Okay, I built a mainline kernel today (4.19.8) and it still has the same problems.  I also had a little time to test if forcing the realtek module fixes the issue (i.e. rmmod r8169; modprobe realtek; modprobe r8169) and it did correct the problem.

The output from the boot is below, but really just shows the same details:

[    3.432656] libphy: r8169: probed
[    3.432818] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
[    3.432821] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]
[    3.435652] libphy: r8169: probed
[    3.436033] r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900800, IRQ 29
[    3.436035] r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[    3.496722] r8169 0000:05:00.0 enp5s0: renamed from eth0
[    3.535335] r8169 0000:07:00.0 enp7s0: renamed from eth1
[   10.872345] Generic PHY r8169-500:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[   10.977828] Generic PHY r8169-700:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[   14.201181] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx
[  108.646084] libphy: r8169: probed
[  108.646705] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
[  108.646708] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]
[  108.649246] libphy: r8169: probed
[  108.650080] r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900880, IRQ 29
[  108.650083] r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[  108.655795] r8169 0000:05:00.0 enp5s0: renamed from eth0
[  108.681706] r8169 0000:07:00.0 enp7s0: renamed from eth1
[  108.717321] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[  108.822097] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[  111.144179] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx
[  111.960253] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx

Comment 41 Heiner Kallweit 2018-12-12 10:56:13 UTC

Hmm, even if there would be a timing issue I'd expect that at least for the second network interface the proper PHY driver is loaded.
Is after the first failed attempt module "realtek" loaded at all? A "lsmod" output would be helpful.

Comment 42 Frank Crawford 2018-12-12 11:20:37 UTC

My apologies, it probably is a bit confusing that I just dumped the data there.

The output up to 15secs, was the normal boot, at that point I did "rmmod r8169; modprobe realtek; modprobe r8169", which leads to the rest of the output.

When I get a chance to reboot I'll get an lsmod and upload it (this is my main server, hence the two NICs).

Comment 43 Jeremy Cline 2018-12-12 14:28:47 UTC

*** Bug 1654736 has been marked as a duplicate of this bug. ***

Comment 44 Alex Villacís Lasso 2018-12-12 17:27:00 UTC

(In reply to Alex Villacís Lasso from comment #33)
> I have a Fedora 28 machine that is also affected by this bug. Updating to
> 4.19.6-200 does NOT fix the bug.
> 
> I have noticed that this bug is unreliable - sometimes the generic PHY
> driver works, sometimes it does not. When it did not work, unloading and
> reloading r8169 managed to make it work.

I must note - in my particular case I did NOT "modprobe realtek" between unloading and reloading r8169. Yet the reload fixed things in that instance. Everyone here should make note of whether the explicit "modprobe realtek" is required in your particular cases between unloading and reloading r8169.

Comment 45 Heiner Kallweit 2018-12-12 19:30:37 UTC

Right, it shouldn't have been necessary to do a "modprobe realtek". The module is loaded at boot, but there is (was) a timing issue causing driver and PHY device probe to collide. Therefore the genphy driver was used.

Comment 46 Frank Crawford 2018-12-15 11:30:35 UTC

Created attachment 1514587 [details]
lsmod on failed startup

lsmod output on fresh reboot

Comment 47 Frank Crawford 2018-12-15 11:32:09 UTC

Created attachment 1514588 [details]
lsmod following r8169 reload

The output of lsmod following "rmmod r8169; modprobe r8169".

Comment 48 Heiner Kallweit 2018-12-15 11:39:09 UTC

(In reply to Frank Crawford from comment #46)
> Created attachment 1514587 [details]
> lsmod on failed startup
> 
> lsmod output on fresh reboot

Even in this case the "realtek" module is loaded. This matches the pattern observed by the other affected users.

Comment 49 Frank Crawford 2018-12-15 11:45:47 UTC

Okay, I rebooted the system a couple of times, and you are correct that doing a "modprobe realtek" isn't necessary, just "rmmod r8169; modprobe r8169" is sufficient.

I've loaded up the modprobes from before and after the reload, and unfortunately there is little difference, only the location of the r8169 module in the list.

Also, FYI, while I know it shouldn't be necessary or do anything, if I add the line "softdep 48169 pre: realtek" to /etc/modprobe.d/local.conf then every thing seems to come up correctly.  Now I know that that has a typo "48169" should have been "r8169", which makes this even weirder.

Is it possible the issue is in the realtek module and not in the r8169 module?

Comment 50 Frank Crawford 2018-12-15 11:49:57 UTC

Also, I just looked at the dmesg output from that session, and both times the NIC appears to have bound to the realtek driver, but it didn't come up during the initial boot, but did on the reload!  See the output below:

[    3.091212] libphy: r8169: probed
[    3.091752] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 3
8000000, IRQ 28
[    3.091761] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx c
hecksumming: ko]
[    3.093858] libphy: r8169: probed
[    3.094328] r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, X
ID 2c900800, IRQ 29
[    3.094330] r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx c
hecksumming: ko]
[    3.122238] r8169 0000:07:00.0 enp7s0: renamed from eth1
[    3.133382] r8169 0000:05:00.0 enp5s0: renamed from eth0
[   13.762322] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8
211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[   13.844037] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8
211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[   14.015943] r8169 0000:07:00.0 enp7s0: Link is Down
[   17.080825] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control
 rx/tx
[   84.669233] libphy: r8169: probed
[   84.670011] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 3
8000000, IRQ 28
[   84.670015] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx c
hecksumming: ko]
[   84.676368] r8169 0000:05:00.0 enp5s0: renamed from eth0
[   84.678825] libphy: r8169: probed
[   84.687877] r8169 0000:07:00.0 eth0: RTL8168evl/8111evl, 50:e5:49:45:87:2a, X
ID 2c900880, IRQ 29
[   84.687884] r8169 0000:07:00.0 eth0: jumbo features [frames: 9200 bytes, tx c
hecksumming: ko]
[   84.693005] r8169 0000:07:00.0 enp7s0: renamed from eth0
[   84.704510] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8
211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[   84.786049] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8
211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[   86.890229] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control
 rx/tx
[   87.915586] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control
 rx/tx

Comment 51 Heiner Kallweit 2018-12-15 12:05:34 UTC

If the dedicated RTL8211B PHY driver is loaded and your network still doesn't come up, then there's a different problem. But it doesn't seem to be in the drivers because the user from comment 1 has the same chip version and for him the issue seems to be fixed.

Comment 52 Frank Crawford 2018-12-15 12:27:45 UTC

(In reply to Heiner Kallweit from comment #51)
> If the dedicated RTL8211B PHY driver is loaded and your network still
> doesn't come up, then there's a different problem. But it doesn't seem to be
> in the drivers because the user from comment 1 has the same chip version and
> for him the issue seems to be fixed.

I'll play around with the system a bit more and see what I can find, since it all be the same underlying issue, just different manifestation.  As you noted, the current fix is a workaround, not a fix for the real issue, which is still not known.

Comment 53 Jeremy Cline 2018-12-18 15:29:56 UTC

*** Bug 1657458 has been marked as a duplicate of this bug. ***

Comment 54 Norbert Jurkeit 2018-12-20 13:10:41 UTC

(In reply to Heiner Kallweit from comment #51)
> If the dedicated RTL8211B PHY driver is loaded and your network still
> doesn't come up, then there's a different problem. But it doesn't seem to be
> in the drivers because the user from comment 1 has the same chip version and
> for him the issue seems to be fixed.

I did some more testing with the newest Fedora kernel-4.19.10-300.fc29.x86_64 (installed as binary from the repo as well as compiled by myself) and for the first time with a mainline kernel built from linux-4.19.10.tar.xz and the config file taken from the Fedora kernel. In opposite to my success reports in Comment 16 and Comment 24 neither of them fixes the issue for me, i.e. after reboot from kernel 4.18.18 I still have to issue "rmmod r8169; modprobe r8169" to make the interface work.

According to my experience the behavior is stable within a particular kernel, but varies between kernel versions and sometimes even between builds of the same version as described in Comment 24.

Comment 55 Heiner Kallweit 2018-12-20 19:00:36 UTC

Thanks for the feedback, then the first attempt to work around the issue wasn't good enough. Something is asynchronous and disturbs.
Could you try the following patch? It will apply to latest linux-next only, for older kernel versions you could simply add an unconditional msleep(20) at the end of phy_device_create().


diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 7d5d69860..d6f464aa5 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -557,6 +557,7 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 {
 	struct phy_device *dev;
 	struct mdio_device *mdiodev;
+	bool delay = false;
 
 	/* We allocate the device, and initialize the default values */
 	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
@@ -613,14 +614,19 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 			if (!(c45_ids->devices_in_package & (1 << i)))
 				continue;
 
-			request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
-				       MDIO_ID_ARGS(c45_ids->device_ids[i]));
+			if (!request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
+					MDIO_ID_ARGS(c45_ids->device_ids[i])))
+				delay = true;
 		}
 	} else {
-		request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
-			       MDIO_ID_ARGS(phy_id));
+		if (!request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
+				    MDIO_ID_ARGS(phy_id)))
+			delay = true;
 	}
 
+	if (delay)
+		msleep(20);
+
 	device_initialize(&mdiodev->dev);
 
 	return dev;
-- 
2.20.1

Comment 56 Heiner Kallweit 2018-12-21 09:21:20 UTC

One more potential fix to test:
Please add a call to wait_for_device_probe() at the end of phy_device_create(), directly before the return statement.
This way we may be able to avoid a dumb delay.

Comment 57 Norbert Jurkeit 2018-12-21 13:03:49 UTC

Success! I again tested with mainline kernel 4.19.10 and a msleep(20) statement inserted just before the return statement of function phy_device_create seems to fix the issue for me.

At first I concluded from the patch in Comment 55 that msleep(20) needs to be inserted before device_initialize(&mdiodev->dev), but that did not help.

I also tried wait_for_device_probe() instead of msleep(20), but then the boot process gets stuck after message "Reached target Initrd Raid Device" (the root fs including /boot is indeed located on a MD Raid device).

Comment 58 Heiner Kallweit 2018-12-21 13:10:07 UTC

Thanks for testing! Strange that it makes a difference whether the msleep() is before or after the device_initialize(). Have to check this.
I'd like to ask you for one more test. If you just use async_synchronize_full() instead of wait_for_device_probe(), does this help?

Comment 59 Heiner Kallweit 2018-12-21 13:46:44 UTC

Sorry, may I ask you for one more test:
If the device_initialize() call in phy_device_create() is moved to directly after dev_set_name(), does this help?

Comment 60 Norbert Jurkeit 2018-12-21 14:40:53 UTC

async_synchronize_full() as suggested in Comment 58 lets the boot process finish normally, but does not fix the issue. To avoid a compiler error I also added "#include <linux/async.h>"; please let me know if this was wrong.

I just started a build of the Fedora kernel to see whether the msleep(20) patch also works there. Unfortunately the procedure known to me takes nearly 4 hours and in case of interruption I have to start all over, but after that I will do the test from Comment 59.

Comment 61 Norbert Jurkeit 2018-12-21 18:52:42 UTC

(In reply to Heiner Kallweit from comment #59)
> Sorry, may I ask you for one more test:
> If the device_initialize() call in phy_device_create() is moved to directly
> after dev_set_name(), does this help?

Good news. This modification tested with the mainline kernel also fixes the issue for me.

The test of the msleep(20) patch applied to the Fedora kernel of same version 4.19.10 was also successful.

Comment 62 Heiner Kallweit 2018-12-21 19:04:06 UTC

Very good, then it seems we have a fix w/o having to add a dumb delay. I would submit this fix as part of a patch removing the recently added preliminary fix (which fixed it in som ecases but not in general).
The following would be the patch, could you please test it? Then I would mention you as "Tested-by:" in the patch.

---
 drivers/net/phy/phy_device.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 7d5d69860..3cacf9fe9 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -589,6 +589,7 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 		dev->c45_ids = *c45_ids;
 	dev->irq = bus->irq[addr];
 	dev_set_name(&mdiodev->dev, PHY_ID_FMT, bus->id, addr);
+	device_initialize(&mdiodev->dev);
 
 	dev->state = PHY_DOWN;
 
@@ -621,8 +622,6 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 			       MDIO_ID_ARGS(phy_id));
 	}
 
-	device_initialize(&mdiodev->dev);
-
 	return dev;
 }
 EXPORT_SYMBOL(phy_device_create);
@@ -2255,14 +2254,6 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 	new_driver->mdiodrv.driver.remove = phy_remove;
 	new_driver->mdiodrv.driver.owner = owner;
 
-	/* The following works around an issue where the PHY driver doesn't bind
-	 * to the device, resulting in the genphy driver being used instead of
-	 * the dedicated driver. The root cause of the issue isn't known yet
-	 * and seems to be in the base driver core. Once this is fixed we may
-	 * remove this workaround.
-	 */
-	new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
-
 	retval = driver_register(&new_driver->mdiodrv.driver);
 	if (retval) {
 		pr_err("%s: Error %d in registering driver\n",
-- 
2.20.1

Comment 63 Norbert Jurkeit 2018-12-21 20:34:59 UTC

The patch works for me, i.e. I copied the text to patchfile, ran "patch phy_device.c patchfile", checked the contents, then "make", "make modules_install", "make install" and finally rebooted 3 times from kernel 4.18.18 and checked the network to come up.

However I wonder whether other users should test the patch too before it is submitted to the mainline kernel. We have seen in Comment 30 and also on bodhi.fedoraproject.org that the previous fix with PROBE_FORCE_SYNCHRONOUS has fixed the issue for some users. I don't understand how the driver works but wouldn't exclude the chance that the previous fix is still needed for them.

Comment 64 Heiner Kallweit 2018-12-21 20:40:02 UTC

Thanks for testing. And it's right, I'd appreciate if more affected users could test whether the patch fixes the issue for them.

Comment 65 Heiner Kallweit 2018-12-23 11:06:29 UTC

(In reply to Frank Crawford from comment #52)
> (In reply to Heiner Kallweit from comment #51)
> > If the dedicated RTL8211B PHY driver is loaded and your network still
> > doesn't come up, then there's a different problem. But it doesn't seem to be
> > in the drivers because the user from comment 1 has the same chip version and
> > for him the issue seems to be fixed.
> 
> I'll play around with the system a bit more and see what I can find, since
> it all be the same underlying issue, just different manifestation.  As you
> noted, the current fix is a workaround, not a fix for the real issue, which
> is still not known.

You are one of the affected users where the first fix attempt didn't help. Could you test the latest fix from comment 62 ?

Comment 66 Frank Crawford 2018-12-23 11:32:07 UTC

(In reply to Heiner Kallweit from comment #65)
> You are one of the affected users where the first fix attempt didn't help.
> Could you test the latest fix from comment 62 ?

Heiner, yes, I will within the next day, and let you know the results.

Comment 67 Frank Crawford 2018-12-24 11:11:41 UTC

(In reply to Heiner Kallweit from comment #65)
> You are one of the affected users where the first fix attempt didn't help.
> Could you test the latest fix from comment 62 ?

Heiner, okay, I had a chance to patch and rebuild it with 4.19.12 and then rebooted a few times from different states, and it has come up fine every time.

So, yes, I'm willing to say this patch seems to have fixed my issue.

Comment 68 Heiner Kallweit 2018-12-24 11:34:29 UTC

Great, thanks for testing. Then I'll submit this patch.

Comment 69 Hans de Goede 2018-12-24 12:03:29 UTC

*** Bug 1661680 has been marked as a duplicate of this bug. ***

Comment 70 Hans de Goede 2018-12-24 20:46:52 UTC

*** Bug 1661943 has been marked as a duplicate of this bug. ***

Comment 71 Hans de Goede 2018-12-24 20:47:03 UTC

*** Bug 1654858 has been marked as a duplicate of this bug. ***

Comment 72 oli 2018-12-25 00:21:31 UTC

if someone could provide a .rpm, i would test it too :)

Comment 73 Hans de Goede 2018-12-25 10:18:47 UTC

(In reply to oli from comment #72)
> if someone could provide a .rpm, i would test it too :)

I've started a kernel scratch-build with the patch added here:

https://koji.fedoraproject.org/koji/taskinfo?taskID=31646312

This should be done in a couple of hours, instructions for testing kernels directly from koji are here:
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Comment 74 oli 2018-12-25 14:04:34 UTC

I just tested the new kernel. It did not fix the problem with my nic
If you need more information, please tell me what i should do.

Comment 75 Heiner Kallweit 2018-12-25 14:18:16 UTC

(In reply to oli from comment #74)
> I just tested the new kernel. It did not fix the problem with my nic
> If you need more information, please tell me what i should do.

We don't even know what your problem is. Please add:
- description of problem
- last kernel version w/o this problem
- does removing / reloading module r8169 help?
- output of "dmesg | grep r8169"

Comment 76 oli 2018-12-25 14:42:42 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1661943

i will try to reload the nic module and see what happens

Comment 77 oli 2018-12-25 14:46:17 UTC

thanks, when i do rmmod r8169 && modprobe r8169 the device gets up and running as it should - at least its a workaround, thanks

Comment 78 Heiner Kallweit 2018-12-25 14:48:10 UTC

What's still missing is a dmesg from the test kernel incl. module remove/reload cycle.

Comment 79 oli 2018-12-25 14:52:46 UTC

Created attachment 1516661 [details]
dmesg of newest kernel with load/unload cycle

Comment 80 Heiner Kallweit 2018-12-25 15:13:42 UTC

(In reply to Hans de Goede from comment #73)
> (In reply to oli from comment #72)
> > if someone could provide a .rpm, i would test it too :)
> 
> I've started a kernel scratch-build with the patch added here:
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=31646312
> 
> This should be done in a couple of hours, instructions for testing kernels
> directly from koji are here:
> https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Hans, could you please build a test kernel with the following patch instead?
The delay is the less sophisticated, but obviously safer method to prevent the issue.
For another affected user it helped, would like to see whether it fixes the issue also for oli.


diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 26c41ede5..09e0ad928 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -600,6 +600,9 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 
 	device_initialize(&mdiodev->dev);
 
+	/* prevent timing issue causing phy driver not bound sometimes */
+	msleep(20);
+
 	return dev;
 }
 EXPORT_SYMBOL(phy_device_create);
-- 
2.20.1

Comment 81 Hans de Goede 2018-12-25 22:33:37 UTC

(In reply to Heiner Kallweit from comment #80)
> Hans, could you please build a test kernel with the following patch instead?
> The delay is the less sophisticated, but obviously safer method to prevent
> the issue.
> For another affected user it helped, would like to see whether it fixes the
> issue also for oli.

Ok, a new kernel with that patch is building here now:

https://koji.fedoraproject.org/koji/taskinfo?taskID=31661328

This should be done in a couple of hours.

Comment 82 oli 2018-12-26 09:00:14 UTC

I downloaded that kernel, installed as instructed from https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt, removed my ethhack.service (called ethhack.sh on startup) and ethhack.sh (rmmod and modprobe of r8169) and rebooted. 
I instantly got a connection. To make sure it was that, i rebooted the old scratch kernel ... and it was working there too? I dont understand it.
So i bootet the old official fedora kernel, this worked now too.
I dont understand why, but for me this is solved. Thanks!

Comment 83 oli 2018-12-26 09:02:48 UTC

Created attachment 1516774 [details]
dmesg after updating to kernel 4.19.12-301.rhbz1650984_2.fc29.x86_64

Comment 84 Heiner Kallweit 2018-12-26 10:07:25 UTC

(In reply to oli from comment #83)
> Created attachment 1516774 [details]
> dmesg after updating to kernel 4.19.12-301.rhbz1650984_2.fc29.x86_64

Thanks for testing. Still strange that your attached dmesg from the latest test kernel includes the following line:
Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
I would expect a dedicated PHY driver here. If you remove and reload module r8169, which PHY driver is loaded?

Comment 85 Elia Devito 2018-12-26 11:08:11 UTC

not fixed for me, seem that my machine load Generic PHY every time (on boot and after r8169 module reload)

Comment 86 Elia Devito 2018-12-26 11:08:50 UTC

Created attachment 1516819 [details]
dmesg-4.19.10-300.fc29.x86_64

Comment 87 Elia Devito 2018-12-26 11:09:24 UTC

Created attachment 1516820 [details]
dmesg-4.19.10-300.fc29.x86_64

Comment 88 Elia Devito 2018-12-26 11:10:11 UTC

Created attachment 1516821 [details]
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64

Comment 89 Elia Devito 2018-12-26 11:10:36 UTC

Created attachment 1516822 [details]
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64

Comment 90 Heiner Kallweit 2018-12-26 11:21:54 UTC

(In reply to Elia Devito from comment #85)
> not fixed for me, seem that my machine load Generic PHY every time (on boot
> and after r8169 module reload)

For RTL8168h it's correct that the genphy driver is used. You describe a totally different issue.
The issue we talk about here affects (AFAIK only) chips with a RTL8211C PHY because this PHY isn't compatible with genphy. Link doesn't come up in this case.
You have an unstable link. It's less likely that this is a driver issue, same chip version works totally fine here with latest driver.
How does your system behave with a 4.18 kernel? And if you switch cable and/or switch?

Comment 91 oli 2018-12-26 12:47:53 UTC

[root@localhost oli]# rmmod r8169 && modprobe r8169
(latest lines from dmesg regarding nic driver) :
[13736.406887] libphy: r8169: probed
[13736.407479] r8169 0000:02:00.0 eth0: RTL8168d/8111d, 00:25:22:0c:80:0a, XID 281000c0, IRQ 24
[13736.407482] r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[13736.415866] r8169 0000:02:00.0 enp2s0: renamed from eth0
[13736.434338] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[13736.435042] RTL8211B Gigabit Ethernet r8169-200:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[13736.564882] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[13738.074544] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off
[13738.074554] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready

> which PHY driver is loaded?
can you help me a bit? i tested lsmod | grep phy; but i did not get any output
is the information provided in the dmesg pasted above?

Comment 92 oli 2018-12-26 12:50:28 UTC

whats also pretty wierd to me, it works now with all kernels which i do not understand. but i deleted the ethhack.sh (which did rmmod && modprobe) and also the service for the startup, so either its kinda random bug (timing?) or it works now... somehow
i will try this evening to power off the machine completly and then start an old kernel which should not work and verify that again

Comment 93 Heiner Kallweit 2018-12-26 13:00:05 UTC

(In reply to oli from comment #91)
> [root@localhost oli]# rmmod r8169 && modprobe r8169
> (latest lines from dmesg regarding nic driver) :
> [13736.406887] libphy: r8169: probed
> [13736.407479] r8169 0000:02:00.0 eth0: RTL8168d/8111d, 00:25:22:0c:80:0a,
> XID 281000c0, IRQ 24
> [13736.407482] r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes,
> tx checksumming: ko]
> [13736.415866] r8169 0000:02:00.0 enp2s0: renamed from eth0
> [13736.434338] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
> [13736.435042] RTL8211B Gigabit Ethernet r8169-200:00: attached PHY driver
> [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)

That's the line I was looking for. (BTW: in comment 90 I meant RTL8211B).
Now the dedicated RTL8211B PHY driver is correctly loaded.

As explanation for others involved in the discussion here:
r8169 supports ~ 50 chip versions. The integrated PHY's partially use the same ID as existing external Realtek PHY's, the PHY's in later chip versions use a generic Realtek ID and therefore the generic PHY driver is used.

Comment 94 Heiner Kallweit 2018-12-26 18:00:20 UTC

Still the latest fix attempt doesn't fully cure the issue. According the log in comment 83 again genphy driver was bound instead of the dedicated PHY driver.

And the root cause of the issue seems to be well hidden. I'd appreciate if affected people could test with an original 4.19 (w/o any of the fix attempts) and just the following patch applied.
I'd be curious whether the warning is triggered and the proper dedicated driver is bound.

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 8ac10af17..7f9a01a34 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -804,7 +804,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 		if (ret == 0)
 			ret = 1;
 		else {
-			dev->driver = NULL;
+			WARN(1, "%s\n", dev_name(dev));
 			ret = 0;
 		}
 	} else {
-- 
2.20.1

Comment 95 oli 2018-12-27 00:22:51 UTC

@Heiner

I can test kernels, but i need a .rpm 
Unfortunately, i drive back home in about 12h, then i have no access to this machine anymore and can not test it.

Comment 96 Hans de Goede 2018-12-27 11:20:49 UTC

(In reply to Heiner Kallweit from comment #94)
> Still the latest fix attempt doesn't fully cure the issue. According the log
> in comment 83 again genphy driver was bound instead of the dedicated PHY
> driver.
> 
> And the root cause of the issue seems to be well hidden. I'd appreciate if
> affected people could test with an original 4.19 (w/o any of the fix
> attempts) and just the following patch applied.
> I'd be curious whether the warning is triggered and the proper dedicated
> driver is bound.
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 8ac10af17..7f9a01a34 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -804,7 +804,7 @@ static int __device_attach(struct device *dev, bool
> allow_async)
>  		if (ret == 0)
>  			ret = 1;
>  		else {
> -			dev->driver = NULL;
> +			WARN(1, "%s\n", dev_name(dev));
>  			ret = 0;
>  		}
>  	} else {
> -- 
> 2.20.1

For people who want to give this a test, a scratch kernel-build with the above change is building here now:
https://koji.fedoraproject.org/koji/taskinfo?taskID=31688858

As usual it takes a couple of hours for this to finish.

Comment 97 Hans de Goede 2018-12-27 11:33:26 UTC

I've started another scratch-build of the kernel here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=31688943

With an attempt to workaround this. Since getting this fixed properly seems to be tricky and quite a lot of people are hitting this, this kernel build simply builds the realtek phy driver into the kernel by setting CONFIG_REALTEK_PHY=y, which will hopefully work around the probe issue.

So if you are hitting this bug, please give both the build from comment 96 a try to help Heiner debug this, as well as give the build from this comment a try. If setting CONFIG_REALTEK_PHY=y is a valid workaround then I will push that to the Fedora kernels for now as a workaround.

Comment 98 oli 2018-12-27 13:59:36 UTC

Created attachment 1517059 [details]
dmesg of 4.19.12-301.rhbz1650984_4.fc29.x86_64

Comment 99 oli 2018-12-27 14:01:04 UTC

Created attachment 1517072 [details]
dmesg of 4.19.12-301.rhbz1650984_3.fc29.x86_64

Comment 100 oli 2018-12-27 14:01:51 UTC

have to go now, sorry that i can not test anymore (until about middle of jan.). and thanks for your help/engagment!

Comment 101 Heiner Kallweit 2018-12-27 14:13:30 UTC

oli, thanks for testing! Results indicate that:
- building realtek phy driver module into the kernel instead of module avoids the issue
- the code path I mentioned in comment 94 is not related to the issue

Comment 102 Norbert Jurkeit 2018-12-27 14:22:20 UTC

The test kernels provided by Hans gave me following results:

From Comment 73, relocated device_initialize statement: NOT OK, although it did work with my self-built mainline kernel 4.19.10
From Comment 81, additional msleep statement: OK
From Comment 96, WARN statement: OK (what does this do anyway?)
From Comment 97, with built-in realtek phy driver: OK

I then also tried next stock Fedora kernel 4.19.12-301.fc29.x86_64 which is available in koji for some days and presumably contains none of the above patches. To my surprise it also works for me, i.e. network comes up after reboot from kernel 4.18.18.

It seems to be a matter of luck whether one of the patches works, is necessary at all or even breaks things.

Comment 103 Hans de Goede 2018-12-27 15:12:06 UTC

Heiner, since this seems to be a race condition triggering somewhere, building in the realtek-phy module seems like a reasonable workaround for now from a downstream (Fedora) perspective, do you agree?

For upstream it would be nice to eventually get to the bottom of this.

Comment 104 Heiner Kallweit 2018-12-27 15:39:49 UTC

(In reply to Hans de Goede from comment #103)
> Heiner, since this seems to be a race condition triggering somewhere,
> building in the realtek-phy module seems like a reasonable workaround for
> now from a downstream (Fedora) perspective, do you agree?
> 
> For upstream it would be nice to eventually get to the bottom of this.

Yes, building the realtek module in seems to avoid the issue. Fine with me for downstream.
Unfortunately on my systems I can't reproduce the problem. Root cause seems to be neither in r8169 nor phylib, but somewhere in the driver base core.
I'll come back once I have something else to test or a status update.

Comment 105 Hans de Goede 2018-12-27 16:25:21 UTC

(In reply to Heiner Kallweit from comment #104)
> Yes, building the realtek module in seems to avoid the issue. Fine with me
> for downstream.

Ok, I've added this change to the Fedora-kernel configs now. This will get picked up by the next official Fedora kernel build.

Comment 106 Heiner Kallweit 2018-12-27 19:57:47 UTC

Because I can't reproduce the issue here I need some further debug information from affected systems.
The following patch (applies to 4.19.12) emits quite some debug info. I hope the debug output overhead doesn't impact timings in a way that the issue doesn't occur any longer (similar to Heisenberg uncertainty principle). Please attach full dmesg output to this bug ticket (only if issue occurs and Generic PHY driver is loaded when it should not).

Hans, would be great if you could again build a test kernel for the ones who need a rpm to test. Thanks a lot.

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index edfc9f0b1..845b6019c 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -450,6 +450,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 	bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
 			   !drv->suppress_bind_attrs;
 
+	pr_info("hk: in %s\n", __func__);
+
 	if (defer_all_probes) {
 		/*
 		 * Value of defer_all_probes can be set only by
@@ -468,7 +470,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 		return ret;
 
 	atomic_inc(&probe_count);
-	pr_debug("bus: '%s': %s: probing driver %s with device %s\n",
+	pr_info("bus: '%s': %s: probing driver %s with device %s\n",
 		 drv->bus->name, __func__, drv->name, dev_name(dev));
 	WARN_ON(!list_empty(&dev->devres_head));
 
@@ -537,12 +539,14 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 	goto done;
 
 probe_failed:
+	pr_info("hk: really_probe probe_fail, ret = %d\n", ret);
 	dma_deconfigure(dev);
 dma_failed:
 	if (dev->bus)
 		blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 					     BUS_NOTIFY_DRIVER_NOT_BOUND, dev);
 pinctrl_bind_failed:
+	pr_info("hk: really_probe pinctrl_fail, ret = %d\n", ret);
 	device_links_no_driver(dev);
 	devres_release_all(dev);
 	driver_sysfs_remove(dev);
@@ -736,6 +740,7 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 		return -EBUSY;
 
 	ret = driver_match_device(drv, dev);
+	pr_info("hk: dev_match: drv = %s dev = %s, ret %d\n", drv->name, dev_name(dev), ret);
 	if (ret == 0) {
 		/* no match */
 		return 0;
@@ -748,6 +753,7 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 	} /* ret > 0 means positive match */
 
 	async_allowed = driver_allows_async_probing(drv);
+	pr_info("hk: async_allowed = %d\n", async_allowed);
 
 	if (async_allowed)
 		data->have_async = true;
@@ -772,6 +778,7 @@ static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
 	if (dev->parent)
 		pm_runtime_get_sync(dev->parent);
 
+	pr_info("hk: call __device_attach_driver from __device_attach_async_helper\n");
 	bus_for_each_drv(dev->bus, NULL, &data, __device_attach_driver);
 	dev_dbg(dev, "async probe completed\n");
 
@@ -812,6 +819,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 		if (dev->parent)
 			pm_runtime_get_sync(dev->parent);
 
+		pr_info("hk: call __device_attach_driver from __device_attach\n");
 		ret = bus_for_each_drv(dev->bus, NULL, &data,
 					__device_attach_driver);
 		if (!ret && allow_async && data.have_async) {
@@ -878,6 +886,7 @@ static int __driver_attach(struct device *dev, void *data)
 	 */
 
 	ret = driver_match_device(drv, dev);
+	pr_info("hk: dr_match: drv = %s dev = %s, ret %d\n", drv->name, dev_name(dev), ret);
 	if (ret == 0) {
 		/* no match */
 		return 0;
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 20d1be2b0..9349dea75 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -457,6 +457,8 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 	 */
 	request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT, MDIO_ID_ARGS(phy_id));
 
+	pr_info("hk: request_module returned\n");
+
 	device_initialize(&mdiodev->dev);
 
 	return dev;
@@ -1927,14 +1929,6 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 	new_driver->mdiodrv.driver.remove = phy_remove;
 	new_driver->mdiodrv.driver.owner = owner;
 
-	/* The following works around an issue where the PHY driver doesn't bind
-	 * to the device, resulting in the genphy driver being used instead of
-	 * the dedicated driver. The root cause of the issue isn't known yet
-	 * and seems to be in the base driver core. Once this is fixed we may
-	 * remove this workaround.
-	 */
-	new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
-
 	retval = driver_register(&new_driver->mdiodrv.driver);
 	if (retval) {
 		pr_err("%s: Error %d in registering driver\n",
@@ -1943,7 +1937,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 		return retval;
 	}
 
-	pr_debug("%s: Registered new driver\n", new_driver->name);
+	pr_info("%s: Registered new driver\n", new_driver->name);
 
 	return 0;
 }
-- 
2.20.1

Comment 107 Jeremy Cline 2018-12-28 04:49:14 UTC

(In reply to Heiner Kallweit from comment #106)
> Because I can't reproduce the issue here I need some further debug
> information from affected systems.
> The following patch (applies to 4.19.12) emits quite some debug info. I hope
> the debug output overhead doesn't impact timings in a way that the issue
> doesn't occur any longer (similar to Heisenberg uncertainty principle).
> Please attach full dmesg output to this bug ticket (only if issue occurs and
> Generic PHY driver is loaded when it should not).
> 
> Hans, would be great if you could again build a test kernel for the ones who
> need a rpm to test. Thanks a lot.
> 

I started a build with the debug patch and CONFIG_REALTEK_PHY=m at https://koji.fedoraproject.org/koji/taskinfo?taskID=31702356

Comment 108 Hans de Goede 2019-01-02 08:16:52 UTC

Note the workaround using CONFIG_REALTEK_PHY=y is available in the official Fedora kernel builds now, starting with the 4.19.13 builds, which are in update-testing atm.

Comment 109 Yanko Kaneti 2019-01-02 11:33:18 UTC

*** Bug 1662910 has been marked as a duplicate of this bug. ***

Comment 110 Hans de Goede 2019-01-02 12:47:19 UTC

*** Bug 1662862 has been marked as a duplicate of this bug. ***

Comment 111 Heiner Kallweit 2019-01-06 17:23:44 UTC

The following new proposal for a workaround is basically the same as described in comment 49, just as part of the driver code.
It ensures the the PHY driver module is loaded before r8169. I'd appreciate test results.


diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 7fe918026..c6b98024b 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -708,6 +708,7 @@ module_param(use_dac, int, 0);
 MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
 module_param_named(debug, debug.msg_enable, int, 0);
 MODULE_PARM_DESC(debug, "Debug verbosity level (0=none, ..., 16=all)");
+MODULE_SOFTDEP("pre: realtek");
 MODULE_LICENSE("GPL");
 MODULE_FIRMWARE(FIRMWARE_8168D_1);
 MODULE_FIRMWARE(FIRMWARE_8168D_2);
-- 
2.20.1

Comment 112 Norbert Jurkeit 2019-01-07 17:14:17 UTC

I tested the patch from comment 111 with mainline kernels 4.19.10 and 4.19.13 using default setting CONFIG_REALTEK_PHY=m and it makes the issue vanish in both cases. However due to previous experience I am cautious to say that this patch provides the final fix.

My recent observation that the issue only occurs when the Ethernet link is powered down before boot of a 4.19.* kernel should also be mentioned here. According to the LAN port status LED of my Wifi router this power-down happens during reboot from kernel 4.18.18, but not during reboot from kernel 4.19.* or when the PC is initially switched on.

Comment 113 Marc Dionne 2019-01-07 17:35:03 UTC

After being unable to reproduce for a while, even with kernels that used to fail consistently, I ran into this again today while booting into a 5.0-rc1 kernel for the first time.

I experimented with MODULE_SOFTDEP as described in comment 111 back in december, and it seemed to help at the time; but that's still just another workaround that doesn't explain what's going on.

Another thing I should note is that in my case, the "realtek" module does NOT show up in the lsmod output when the network fails to start (some of the mails on the netdev thread suggest that's different for some users).

Comment 114 Heiner Kallweit 2019-01-07 18:40:17 UTC

(In reply to Marc Dionne from comment #113)
[...]
> 
> Another thing I should note is that in my case, the "realtek" module does
> NOT show up in the lsmod output when the network fails to start (some of the
> mails on the netdev thread suggest that's different for some users).

Thanks for the feedback! That's interesting. In such a case (network fails to start and "realtek" module isn't loaded) what's the value of /sys/class/net/<if>/phydev/phy_id ?
If it's the correct value, then request_module() in phy_device_create() seems to fail. And unfortunately we don't check the return value yet.
If you can reproduce the problem, could you add a debug output printing the return value of request_module() ? Or do you need a patch for it?

Comment 115 Heiner Kallweit 2019-01-07 19:12:05 UTC

(In reply to Heiner Kallweit from comment #114)
> (In reply to Marc Dionne from comment #113)
> [...]
> > 
> > Another thing I should note is that in my case, the "realtek" module does
> > NOT show up in the lsmod output when the network fails to start (some of the
> > mails on the netdev thread suggest that's different for some users).
> 
> Thanks for the feedback! That's interesting. In such a case (network fails
> to start and "realtek" module isn't loaded) what's the value of
> /sys/class/net/<if>/phydev/phy_id ?
> If it's the correct value, then request_module() in phy_device_create()
> seems to fail. And unfortunately we don't check the return value yet.
> If you can reproduce the problem, could you add a debug output printing the
> return value of request_module() ? Or do you need a patch for it?

To rule out that the module is loaded but the init function fails: Do you have any "Error xx in registering driver" message in the syslog?

Comment 116 Marc Dionne 2019-01-07 19:36:27 UTC

(In reply to Heiner Kallweit from comment #114)
> (In reply to Marc Dionne from comment #113)
> [...]
> > 
> > Another thing I should note is that in my case, the "realtek" module does
> > NOT show up in the lsmod output when the network fails to start (some of the
> > mails on the netdev thread suggest that's different for some users).
> 
> Thanks for the feedback! That's interesting. In such a case (network fails
> to start and "realtek" module isn't loaded) what's the value of
> /sys/class/net/<if>/phydev/phy_id ?
> If it's the correct value, then request_module() in phy_device_create()
> seems to fail. And unfortunately we don't check the return value yet.
> If you can reproduce the problem, could you add a debug output printing the
> return value of request_module() ? Or do you need a patch for it?

I can add some debugging, but unfortunately, after fixing the initial boot with a rmmod/modprobe of r8169, the machine now consistently boots with networking.  (btw the ID for my device is 0x1cc912)

There's obviously some other variable that affects the behaviour.  Looks like failure is more likely for a newly installed kernel, or booting into a different kernel than the previous boot.  After a successful boot that gets networking (or failed but manually fixed), the following boots pretty much always succeed, until something changes again.  Some possibilities would be the device state, although at least in my case, powering down completely doesn't seem to have any effect, or perhaps something in user space that would cache the dependency somewhere, helping subsequent boots.

Comment 117 Marc Dionne 2019-01-07 19:39:19 UTC

(In reply to Heiner Kallweit from comment #115)
> (In reply to Heiner Kallweit from comment #114)
> > (In reply to Marc Dionne from comment #113)
> > [...]
> > > 
> > > Another thing I should note is that in my case, the "realtek" module does
> > > NOT show up in the lsmod output when the network fails to start (some of the
> > > mails on the netdev thread suggest that's different for some users).
> > 
> > Thanks for the feedback! That's interesting. In such a case (network fails
> > to start and "realtek" module isn't loaded) what's the value of
> > /sys/class/net/<if>/phydev/phy_id ?
> > If it's the correct value, then request_module() in phy_device_create()
> > seems to fail. And unfortunately we don't check the return value yet.
> > If you can reproduce the problem, could you add a debug output printing the
> > return value of request_module() ? Or do you need a patch for it?
> 
> To rule out that the module is loaded but the init function fails: Do you
> have any "Error xx in registering driver" message in the syslog?

No there is no trace of such a message during a boot where the network failed to start.

Comment 118 Heiner Kallweit 2019-01-07 19:42:03 UTC

(In reply to Marc Dionne from comment #116)
> (In reply to Heiner Kallweit from comment #114)
> > (In reply to Marc Dionne from comment #113)
> > [...]
> > > 
> > > Another thing I should note is that in my case, the "realtek" module does
> > > NOT show up in the lsmod output when the network fails to start (some of the
> > > mails on the netdev thread suggest that's different for some users).
> > 
> > Thanks for the feedback! That's interesting. In such a case (network fails
> > to start and "realtek" module isn't loaded) what's the value of
> > /sys/class/net/<if>/phydev/phy_id ?
> > If it's the correct value, then request_module() in phy_device_create()
> > seems to fail. And unfortunately we don't check the return value yet.
> > If you can reproduce the problem, could you add a debug output printing the
> > return value of request_module() ? Or do you need a patch for it?
> 
> I can add some debugging, but unfortunately, after fixing the initial boot
> with a rmmod/modprobe of r8169, the machine now consistently boots with
> networking.  (btw the ID for my device is 0x1cc912)
> 
Background of the question was to rule out that for whatever reason a wrong PHY ID is read, thus preventing module realtek from being loaded.
Do you see this ID also in case of network failure?

Comment 119 oli 2019-01-07 19:46:21 UTC

From Friday evening (11.1) until sunday i can test new kernels - i am then back on the affected machine. 
If anyone wants me to test something, please give me the link to a .rpm i should test, then i will test it and report.

Comment 120 Marc Dionne 2019-01-07 19:54:40 UTC

(In reply to Heiner Kallweit from comment #118)
> (In reply to Marc Dionne from comment #116)
> > (In reply to Heiner Kallweit from comment #114)
> > > (In reply to Marc Dionne from comment #113)
> > > [...]
> > > > 
> > > > Another thing I should note is that in my case, the "realtek" module does
> > > > NOT show up in the lsmod output when the network fails to start (some of the
> > > > mails on the netdev thread suggest that's different for some users).
> > > 
> > > Thanks for the feedback! That's interesting. In such a case (network fails
> > > to start and "realtek" module isn't loaded) what's the value of
> > > /sys/class/net/<if>/phydev/phy_id ?
> > > If it's the correct value, then request_module() in phy_device_create()
> > > seems to fail. And unfortunately we don't check the return value yet.
> > > If you can reproduce the problem, could you add a debug output printing the
> > > return value of request_module() ? Or do you need a patch for it?
> > 
> > I can add some debugging, but unfortunately, after fixing the initial boot
> > with a rmmod/modprobe of r8169, the machine now consistently boots with
> > networking.  (btw the ID for my device is 0x1cc912)
> > 
> Background of the question was to rule out that for whatever reason a wrong
> PHY ID is read, thus preventing module realtek from being loaded.
> Do you see this ID also in case of network failure?

I won't be able to answer that, or see whether request_module has failed, until another failure occurs.  That's the "unfortunate" part.. :)

Comment 121 Heiner Kallweit 2019-01-08 21:58:35 UTC

(In reply to Heiner Kallweit from comment #111)
> The following new proposal for a workaround is basically the same as
> described in comment 49, just as part of the driver code.
> It ensures the the PHY driver module is loaded before r8169. I'd appreciate
> test results.
> 
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c
> b/drivers/net/ethernet/realtek/r8169.c
> index 7fe918026..c6b98024b 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -708,6 +708,7 @@ module_param(use_dac, int, 0);
>  MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
>  module_param_named(debug, debug.msg_enable, int, 0);
>  MODULE_PARM_DESC(debug, "Debug verbosity level (0=none, ..., 16=all)");
> +MODULE_SOFTDEP("pre: realtek");
>  MODULE_LICENSE("GPL");
>  MODULE_FIRMWARE(FIRMWARE_8168D_1);
>  MODULE_FIRMWARE(FIRMWARE_8168D_2);
> -- 
> 2.20.1

This change has been committed to the "net" tree and is queued for stable, means it should be included in 5.0-rc2 and later in the stable kernel versions.

Comment 122 Daniele De Lorenzi 2019-01-10 07:47:59 UTC

I still have issue with the latest kernel 4.19.13-300.fc29 In my case the issue is slightly different from the described one, after the boot at random time (~10/15 min) I loose internet connectivity and I cannot even ping my default gateway (both LEDs on the NIC stop working).
I resolve with rmmod/modprobe of r8169.
I don't know if this can be a useful info (maybe this trigger the issue) but in my case the networking during the boot is slowly to start due to the STP of the switch that take the NIC port in learning state.

Comment 123 Marc Dionne 2019-01-15 14:44:05 UTC

(In reply to Heiner Kallweit from comment #118)
> (In reply to Marc Dionne from comment #116)
> > (In reply to Heiner Kallweit from comment #114)
> > > (In reply to Marc Dionne from comment #113)
> > > [...]
> > > > 
> > > > Another thing I should note is that in my case, the "realtek" module does
> > > > NOT show up in the lsmod output when the network fails to start (some of the
> > > > mails on the netdev thread suggest that's different for some users).
> > > 
> > > Thanks for the feedback! That's interesting. In such a case (network fails
> > > to start and "realtek" module isn't loaded) what's the value of
> > > /sys/class/net/<if>/phydev/phy_id ?
> > > If it's the correct value, then request_module() in phy_device_create()
> > > seems to fail. And unfortunately we don't check the return value yet.
> > > If you can reproduce the problem, could you add a debug output printing the
> > > return value of request_module() ? Or do you need a patch for it?
> > 
> > I can add some debugging, but unfortunately, after fixing the initial boot
> > with a rmmod/modprobe of r8169, the machine now consistently boots with
> > networking.  (btw the ID for my device is 0x1cc912)
> > 
> Background of the question was to rule out that for whatever reason a wrong
> PHY ID is read, thus preventing module realtek from being loaded.
> Do you see this ID also in case of network failure?

Just had another failure, after rebooting into a newly built kernel.  The correct ID is shown in /sys/class/net/<if>/phydev/phy_id, even when the device is associated with the generic PHY driver.

Marc

Comment 124 Heiner Kallweit 2019-01-15 21:22:59 UTC

Thanks for the update! Then it looks like request_module() sometimes fails, unfortunately we don't check the return code yet. I'll add this to the code.

Comment 125 Heiner Kallweit 2019-01-24 08:27:02 UTC

The change described in comment 111 is included in 4.20.4. Therefore it would be interesting to hear whether anybody still has the described issue with 4.20.4.

Comment 126 Hans de Goede 2019-01-24 09:31:25 UTC

(In reply to Heiner Kallweit from comment #125)
> The change described in comment 111 is included in 4.20.4. Therefore it
> would be interesting to hear whether anybody still has the described issue
> with 4.20.4.

Note testing this will require doing a manual build of 4.20.4 with CONFIG_REALTEK_PHY=m as Fedora currently sets that to 'y' to work around this issue.

Comment 127 Jeremy Cline 2019-01-24 11:12:18 UTC

(In reply to Hans de Goede from comment #126)
> (In reply to Heiner Kallweit from comment #125)
> > The change described in comment 111 is included in 4.20.4. Therefore it
> > would be interesting to hear whether anybody still has the described issue
> > with 4.20.4.
> 
> Note testing this will require doing a manual build of 4.20.4 with
> CONFIG_REALTEK_PHY=m as Fedora currently sets that to 'y' to work around
> this issue.

It actually looks like the config accidentally got flipped back to CONFIG_REALTEK_PHY=m with the rebase to v4.20.3, so folks can test the official v4.20.4 build.

Comment 128 oli 2019-01-26 00:36:54 UTC

I just tested 4.20.4-200.fc29.x86_64, seems ok to me - network works. but i can not guarantee that i can reproduce the problem, because since i rmmod the module a couple of weeks ago, since then it somehow works and i can not reproduce it even with a "broken" kernel which should lead to the behaviour.

Comment 129 oli 2019-01-26 00:37:22 UTC

Created attachment 1523606 [details]
dmesg for 4.20.4-200.fc29.x86_64

Comment 130 Norbert Jurkeit 2019-01-26 10:01:26 UTC

(In reply to Heiner Kallweit from comment #125)
> The change described in comment 111 is included in 4.20.4. Therefore it
> would be interesting to hear whether anybody still has the described issue
> with 4.20.4.

With stock Fedora kernel-4.20.3-200.fc29.x86_64 the interface does not come up after reboot from an old kernel which powers down the link while with kernel-4.20.4-200.fc29.x86_64 it does. So besides comment 112 this is another indication that this change might fix the issue for me.

Comment 131 Turgut Kalfaoglu 2019-02-01 01:57:30 UTC

Same issue on an Acer Nitro 5 laptop. Ethernet will not activate under Fedora 29 kernel 4.20.4-200.fc29.x86_64

Workaround did not work.

# dmesg |grep XID
[    3.696682] r8169 0000:02:00.1 eth0: RTL8411, 98:28:a6:28:f8:6a, XID 5c800800, IRQ 54
[  560.935157] r8169 0000:02:00.1 eth0: RTL8411, 98:28:a6:28:f8:6a, XID 5c800880, IRQ 54

# dmesg |grep "attached PHY"
[   26.900765] Generic PHY r8169-201:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-201:00, irq=IGNORE)
[  560.982478] Generic PHY r8169-201:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-201:00, irq=IGNORE)

# more /sys/class/net/enp2s0f1/phydev/phy_id
0x001cc800

Comment 132 Turgut Kalfaoglu 2019-02-01 02:02:40 UTC

Attached is what happens after I do a 
# rmmod r8169; modprobe r8169



Feb 01 05:01:06 tklaptop2 kernel: r8169 0000:02:00.1 enp2s0f1: Link is Down
Feb 01 05:01:06 tklaptop2 dhclient[4401]: receive_packet failed on enp2s0f1: Network is down
Feb 01 05:01:06 tklaptop2 avahi-daemon[977]: Interface enp2s0f1.IPv6 no longer relevant for mDNS.
Feb 01 05:01:06 tklaptop2 avahi-daemon[977]: Leaving mDNS multicast group on interface enp2s0f1.IPv6 with address fe80::7a2c:503c:2f9d:8238.
Feb 01 05:01:06 tklaptop2 avahi-daemon[977]: Withdrawing address record for fe80::7a2c:503c:2f9d:8238 on enp2s0f1.
Feb 01 05:01:06 tklaptop2 NetworkManager[1000]: <info>  [1548986466.5131] device (enp2s0f1): state change: ip-config -> unmanaged (reason 'removed', sys-iface-state: 'removed')
Feb 01 05:01:06 tklaptop2 NetworkManager[1000]: <info>  [1548986466.5455] dhcp4 (enp2s0f1): canceled DHCP transaction, DHCP client pid 4401
Feb 01 05:01:06 tklaptop2 NetworkManager[1000]: <info>  [1548986466.5455] dhcp4 (enp2s0f1): state changed unknown -> done
Feb 01 05:01:06 tklaptop2 kernel: libphy: r8169: probed
Feb 01 05:01:06 tklaptop2 kernel: r8169 0000:02:00.1 eth0: RTL8411, 98:28:a6:28:f8:6a, XID 5c800880, IRQ 54
Feb 01 05:01:06 tklaptop2 kernel: r8169 0000:02:00.1 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
Feb 01 05:01:06 tklaptop2 NetworkManager[1000]: <info>  [1548986466.5752] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/8)
Feb 01 05:01:06 tklaptop2 systemd-udevd[4782]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 01 05:01:06 tklaptop2 kernel: r8169 0000:02:00.1 enp2s0f1: renamed from eth0
Feb 01 05:01:06 tklaptop2 NetworkManager[1000]: <info>  [1548986466.6109] device (enp2s0f1): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Feb 01 05:01:06 tklaptop2 kernel: IPv6: ADDRCONF(NETDEV_UP): enp2s0f1: link is not ready
Feb 01 05:01:06 tklaptop2 kernel: Generic PHY r8169-201:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-201:00, irq=IGNORE)
Feb 01 05:01:06 tklaptop2 kernel: IPv6: ADDRCONF(NETDEV_UP): enp2s0f1: link is not ready
Feb 01 05:01:09 tklaptop2 ModemManager[987]: <info>  Couldn't check support for device '/sys/devices/pci0000:00/0000:00:01.6/0000:02:00.1': not supported by any plugin
Feb 01 05:01:10 tklaptop2 kernel: r8169 0000:02:00.1 enp2s0f1: Link is Up - 1Gbps/Full - flow control rx/tx
Feb 01 05:01:10 tklaptop2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f1: link becomes ready
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1274] device (enp2s0f1): carrier: link connected
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1281] device (enp2s0f1): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1295] policy: auto-activating connection 'Wired connection 1' (30b5ba9f-544d-433f-92c5-4800918916e1)
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1312] device (enp2s0f1): Activation: starting connection 'Wired connection 1' (30b5ba9f-544d-433f-92c5-4800918916e1)
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1314] device (enp2s0f1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1325] device (enp2s0f1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1333] device (enp2s0f1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1339] dhcp4 (enp2s0f1): activation: beginning transaction (timeout in 45 seconds)
Feb 01 05:01:10 tklaptop2 NetworkManager[1000]: <info>  [1548986470.1370] dhcp4 (enp2s0f1): dhclient started with pid 4788
Feb 01 05:01:10 tklaptop2 avahi-daemon[977]: Joining mDNS multicast group on interface enp2s0f1.IPv6 with address fe80::7a2c:503c:2f9d:8238.
Feb 01 05:01:10 tklaptop2 avahi-daemon[977]: New relevant interface enp2s0f1.IPv6 for mDNS.
Feb 01 05:01:10 tklaptop2 avahi-daemon[977]: Registering new address record for fe80::7a2c:503c:2f9d:8238 on enp2s0f1.*.
Feb 01 05:01:10 tklaptop2 dhclient[4788]: DHCPDISCOVER on enp2s0f1 to 255.255.255.255 port 67 interval 5 (xid=0xb2a3c01c)
Feb 01 05:01:15 tklaptop2 dhclient[4788]: DHCPDISCOVER on enp2s0f1 to 255.255.255.255 port 67 interval 13 (xid=0xb2a3c01c)
Feb 01 05:01:28 tklaptop2 dhclient[4788]: DHCPDISCOVER on enp2s0f1 to 255.255.255.255 port 67 interval 10 (xid=0xb2a3c01c)

(goes on waiting)

Comment 133 Turgut Kalfaoglu 2019-02-01 02:20:28 UTC

I just discovered something: 

If I reboot the machine, the ethernet cannot connect.
If I turn it OFF and then ON, it then connects!

I just did several reboots and off/on cycles to test it out.
I suspect an initialization step is missing in this card's driver that needs to be added or corrected.

Comment 134 Heiner Kallweit 2019-02-01 06:43:50 UTC

(In reply to Turgut Kalfaoglu from comment #131)
> Same issue on an Acer Nitro 5 laptop. Ethernet will not activate under
> Fedora 29 kernel 4.20.4-200.fc29.x86_64
> 
> Workaround did not work.
> 
> # dmesg |grep XID
> [    3.696682] r8169 0000:02:00.1 eth0: RTL8411, 98:28:a6:28:f8:6a, XID
> 5c800800, IRQ 54
> [  560.935157] r8169 0000:02:00.1 eth0: RTL8411, 98:28:a6:28:f8:6a, XID
> 5c800880, IRQ 54
> 
> # dmesg |grep "attached PHY"
> [   26.900765] Generic PHY r8169-201:00: attached PHY driver [Generic PHY]
> (mii_bus:phy_addr=r8169-201:00, irq=IGNORE)
> [  560.982478] Generic PHY r8169-201:00: attached PHY driver [Generic PHY]
> (mii_bus:phy_addr=r8169-201:00, irq=IGNORE)
> 
> # more /sys/class/net/enp2s0f1/phydev/phy_id
> 0x001cc800

No, not same issue. Symptoms are different and also root cause. For this PHY ID there is no dedicated driver, it's supposed to be used with the genphy driver.
This looks quite weird:
systemd-udevd[4782]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Apart from that link comes up normally.

What you could test to rule out an ASPM issue:
set kernel parameter pcie_aspm.policy=performance

Please create a separate bug ticket and add following info:
- full dmesg output from a cold boot and a reboot
- chip register diff (diff of "ethtool -d enp2s0f1")
- last working and first failing kernel version
- if it's a regression, a bisect would be much appreciated

Comment 135 Turgut Kalfaoglu 2019-02-01 10:29:07 UTC

Thank you very much -- I created a new ticket with the information you asked..

https://bugzilla.redhat.com/show_bug.cgi?id=1671691

Comment 136 old_man999 2019-02-17 22:10:30 UTC

All I see at the moment is that https://github.com/fintecheando/r8168 works with recent 4.20 kernel while r8169 still fails.
(VENDOR: 10ec DEVICE: 8168)

Comment 137 old_man999 2019-02-17 22:18:04 UTC

kernel: r8169 0000:26:00.0 eth0: RTL8168e/8111e, , XID 2c200000, IRQ 37

Comment 138 Ionut Radu 2019-02-23 17:24:02 UTC

My realtek NIC is not functioning after each kernel update, then it will eventually function until the next kernel update

04:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)

I see some errors in the journal:

Feb 22 23:39:12 mendol4 kernel: ------------[ cut here ]------------
Feb 22 23:39:12 mendol4 kernel: NETDEV WATCHDOG: enp4s0f1 (r8169): transmit queue 0 timed out
Feb 22 23:39:12 mendol4 kernel: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x1f3/0x200
Feb 22 23:39:12 mendol4 kernel: Modules linked in: ccm fuse rfcomm bnep binfmt_misc arc4 sunrpc ath10k_pci ath10k_core intel_rapl snd_hda_codec_hdmi snd_soc_skl x86_pkg>
Feb 22 23:39:12 mendol4 kernel:  intel_lpss elan_i2c acpi_pad pcc_cpufreq i915 nouveau kvmgt mdev vfio rtsx_pci_sdmmc mmc_core kvm mxm_wmi ttm irqbypass i2c_algo_bit cr>
Feb 22 23:39:12 mendol4 kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.20.10-200.fc29.x86_64 #1
Feb 22 23:39:12 mendol4 kernel: Hardware name: Acer Aspire E5-575G/Ironman_SK, BIOS V1.47 09/06/2018
Feb 22 23:39:12 mendol4 kernel: RIP: 0010:dev_watchdog+0x1f3/0x200
Feb 22 23:39:12 mendol4 kernel: Code: 00 48 63 4d e0 eb 93 4c 89 e7 c6 05 10 8c b2 00 01 e8 01 86 fc ff 89 d9 4c 89 e6 48 c7 c7 b0 77 18 96 48 89 c2 e8 97 60 8a ff <0f>>
Feb 22 23:39:12 mendol4 kernel: RSP: 0018:ffff998b26a03e98 EFLAGS: 00010282
Feb 22 23:39:12 mendol4 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000083f
Feb 22 23:39:12 mendol4 kernel: RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
Feb 22 23:39:12 mendol4 kernel: RBP: ffff998b236de480 R08: 0000000000000050 R09: 0000000000000003
Feb 22 23:39:12 mendol4 kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff998b236de000
Feb 22 23:39:12 mendol4 kernel: R13: 0000000000000000 R14: ffff998b26a03ee8 R15: 0000000000000000
Feb 22 23:39:12 mendol4 kernel: FS:  0000000000000000(0000) GS:ffff998b26a00000(0000) knlGS:0000000000000000
Feb 22 23:39:12 mendol4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 23:39:12 mendol4 kernel: CR2: 00007fc2e6b96000 CR3: 000000002920a006 CR4: 00000000003606f0
Feb 22 23:39:12 mendol4 kernel: Call Trace:
Feb 22 23:39:12 mendol4 kernel:  <IRQ>
Feb 22 23:39:12 mendol4 kernel:  ? pfifo_fast_dequeue+0x160/0x160
Feb 22 23:39:12 mendol4 kernel:  call_timer_fn+0x2b/0x130
Feb 22 23:39:12 mendol4 kernel:  run_timer_softirq+0x3ad/0x3e0
Feb 22 23:39:12 mendol4 kernel:  ? tick_sched_timer+0x37/0x70
Feb 22 23:39:12 mendol4 kernel:  ? __hrtimer_run_queues+0x110/0x280
Feb 22 23:39:12 mendol4 kernel:  __do_softirq+0xe3/0x30a
Feb 22 23:39:12 mendol4 kernel:  ? ktime_get+0x36/0xa0
Feb 22 23:39:12 mendol4 kernel:  irq_exit+0x100/0x110
Feb 22 23:39:12 mendol4 kernel:  smp_apic_timer_interrupt+0x74/0x140
Feb 22 23:39:12 mendol4 kernel:  apic_timer_interrupt+0xf/0x20
Feb 22 23:39:12 mendol4 kernel:  </IRQ>
Feb 22 23:39:12 mendol4 kernel: RIP: 0010:cpuidle_enter_state+0xb9/0x330
Feb 22 23:39:12 mendol4 kernel: Code: e8 5c c2 96 ff 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 4c 02 00 00 31 ff e8 8e 82 9c ff fb 66 0f 1f 44 00 00 <85>>
Feb 22 23:39:12 mendol4 kernel: RSP: 0018:ffffffff96203e80 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
lines 6438-6475

Comment 139 Steven Mercurio 2019-07-03 23:16:05 UTC

This is sort of related.  I have a dual 1G port express card that also is a 8111E/8168 chipset (Startech EC20005).  I am on FC28 kernel 5.0.16.

Any chance there is any drivers I can test to see if they work?   Right now the card only recognizes one port using r8169.  It saw 2 very briefly but not any longer.

Comment 140 Heiner Kallweit 2019-07-04 07:58:57 UTC

(In reply to Steven Mercurio from comment #139)
> This is sort of related.  I have a dual 1G port express card that also is a
> 8111E/8168 chipset (Startech EC20005).  I am on FC28 kernel 5.0.16.
> 
> Any chance there is any drivers I can test to see if they work?   Right now
> the card only recognizes one port using r8169.  It saw 2 very briefly but
> not any longer.

This isn't related at all and hijacking a bug for generic questions isn't really appreciated. In your case I suppose it's about the PCIe switch on the card.
And how is somebody supposed to say anything w/o at least some more detailed information like verbose lspci output and full syslog.

Comment 141 Norbert Jurkeit 2019-07-04 09:19:58 UTC

I don't have an overview who might still be affected by this bug but in my case it has not occurred with any Fedora kernel since 4.20.4, probably due to commit "r8169: load Realtek PHY driver module before r8169".

Many thanks to Heiner and all others who worked on this!

Comment 142 Justin M. Forbes 2019-08-20 17:46:23 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 143 Norbert Jurkeit 2019-08-21 09:40:53 UTC

(In reply to Justin M. Forbes from comment #142)

As already stated in comment 141, I haven't encountered the issue since Fedora kernel 4.20.4, i.e the network interface is always initialised correctly, even if it was powered down during reboot. This also applies to the latest available kernels 5.2.7-100.fc29 (currently in updates) and 5.2.9-100.fc29 (currently in updates-testing).

As far as I know the RTL8111 chip has been manufactured over many years in different variants and I can speak only for mine which is now 10 years old. Unless issues persist for other users this bug can be closed.

Comment 144 Justin M. Forbes 2019-08-21 13:35:53 UTC

Thanks for the update.

Note You need to log in before you can comment on or make changes to this bug.

airlied
alexvillacislasso
amfrazee
bskeggs
dark-vex
eliadevito
ewk
fedora-kernel-ethernet-realtek
frank
fukidid
gregorio.gervasio
hdegoede
hkallweit1
ichavero
itamar
jarodwilson
jcline
jforbes
jglisse
john.j5live
jonathan
josef
joshuaconley
kernel-maint
lapotinval
linville
malc996
marc.c.dionne
mchehab
michael.wiktowy
mjg59
norbert.jurkeit
old_man999
olze
russell
smercuri
steved
tritri24
turgut
yaneti