Bug 1650984 - RTL8111 NIC not working with kernel 4.19.2
Summary: RTL8111 NIC not working with kernel 4.19.2
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
(Show other bugs)
Version: 29
Hardware: x86_64 Unspecified
unspecified
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords: Reopened
: 1654736 1654858 1657458 1661680 1661943 1662862 1662910 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-18 13:26 UTC by Norbert Jurkeit
Modified: 2019-01-15 21:22 UTC (History)
37 users (show)

Fixed In Version: kernel-4.19.6-300.fc29 kernel-4.19.6-200.fc28
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-12-06 03:56:56 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
journalctl of 4.19.6-200 when generic PHY refuses to work (100.99 KB, text/plain)
2018-12-10 03:08 UTC, Alex Villacís Lasso
no flags Details
journalctl of 4.19.6-200 when generic PHY works (94.62 KB, text/plain)
2018-12-10 03:09 UTC, Alex Villacís Lasso
no flags Details
lsmod on failed startup (6.41 KB, text/plain)
2018-12-15 11:30 UTC, Frank Crawford
no flags Details
lsmod following r8169 reload (6.41 KB, text/plain)
2018-12-15 11:32 UTC, Frank Crawford
no flags Details
dmesg of newest kernel with load/unload cycle (68.32 KB, text/plain)
2018-12-25 14:52 UTC, oli
no flags Details
dmesg after updating to kernel 4.19.12-301.rhbz1650984_2.fc29.x86_64 (68.16 KB, text/plain)
2018-12-26 09:02 UTC, oli
no flags Details
dmesg-4.19.10-300.fc29.x86_64 (5.99 KB, text/plain)
2018-12-26 11:08 UTC, Elia Devito
no flags Details
dmesg-4.19.10-300.fc29.x86_64 (4.37 KB, text/plain)
2018-12-26 11:09 UTC, Elia Devito
no flags Details
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64 (4.37 KB, text/plain)
2018-12-26 11:10 UTC, Elia Devito
no flags Details
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64 (6.81 KB, text/plain)
2018-12-26 11:10 UTC, Elia Devito
no flags Details
dmesg of 4.19.12-301.rhbz1650984_4.fc29.x86_64 (68.46 KB, text/plain)
2018-12-27 13:59 UTC, oli
no flags Details
dmesg of 4.19.12-301.rhbz1650984_3.fc29.x86_64 (68.00 KB, text/plain)
2018-12-27 14:01 UTC, oli
no flags Details

Description Norbert Jurkeit 2018-11-18 13:26:24 UTC
My desktop PC has 2 Ethernet interfaces, the first one located on the motherboard and listed by lshw as "RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller" (RTL8111DL according to motherboard user manual) and the second located on a PCI card listed as "RTL8169 PCI Gigabit Ethernet Controller". Both interfaces use driver r8169.

After upgrade to kernel-4.19.2-300.fc29.x86_64 the first interface no longer works. According to ifconfig the interface is up, but not running and both LEDs besides the connector stay off. As a result, NetworkManager cannot establish a connection. The systemd journal does not contain any suspicious messages.

When I choose previous kernel 4.18.18 from Grub menu everything works fine again, so a hardware damage is unlikely.

This bug is very similar to bug 1628202 except that now the first NIC is affected. Seems that driver r8169 is broken again.

Comment 1 Michael Wiktowy 2018-11-20 14:56:26 UTC
I have the identical problem with the same network controller with the latest kernel upgrade. I only have one NIC on motherboard though.

Booting to kernel.x86_64 4.19.2-300.fc29 does not work.
Reverting to kernel.x86_64 4.18.18-300.fc29 works again.

'lspci | grep net' output:
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 01)

Logs don't indicate anything with kernel 4.19 other than:

Nov 18 21:36:05 desktop kernel: libphy: r8169: probed
Nov 18 21:36:05 desktop kernel: r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24
Nov 18 21:36:05 desktop kernel: r8169 0000:03:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]

Where as NIC module load with 4.18 kernel is:

Nov 17 21:34:18 desktop kernel: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
Nov 17 21:34:18 desktop kernel: r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
Nov 17 21:34:18 desktop kernel: r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24
Nov 17 21:34:18 desktop kernel: r8169 0000:03:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]

Comment 2 Jeremy Cline 2018-11-20 16:24:46 UTC
It looks like there were a number of changes in v4.19 for r8169. To start with, please attach the complete dmesg output for each boot.

It would be very helpful if you bisected[0] the kernel between v4.18 and v4.19 to figure out which change introduced the regression.

[0] https://docs.fedoraproject.org/en-US/quick-docs/kernel/troubleshooting/index.html#_bisecting_the_kernel

Comment 3 Marc Dionne 2018-11-20 17:25:15 UTC
I have the same issue on an old machine that needs the r8169 driver.

The problem is that new r8169 driver in 4.19 has an indirect dependency on the "realtek" phy driver that's not encoded anywhere, which is probably an upstream kernel bug.  For my custom kernel I work around it by building 'realtek' in the kernel rather than as a module (CONFIG_REALTEK_PHY=y).

Once the machine is booted, the network can be enabled by loading realtek and reloading the r8169 module:

$ modprobe realtek
$ rmmod r8169
$ modprobe r8169

Comment 4 Norbert Jurkeit 2018-11-20 18:15:26 UTC
@Marc: The commands listed by you also work for me. Many thanks for that!

@Jeremy: As Marc was already able to pinpoint the fault you probably no longer need dmesg outputs or bisect. Please let me know if I should provide them, though (bisect may take some time however as I haven't done this before).

Comment 5 Jeremy Cline 2018-11-20 19:16:18 UTC
Norbert, Marc,

I recommend reporting this to upstream. The folks and lists to email should be:

* Realtek linux nic maintainers <nic_swsd@realtek.com>
* "David S. Miller" <davem@davemloft.net>
* netdev@vger.kernel.org
* linux-kernel@vger.kernel.org

Comment 6 Michael Wiktowy 2018-11-21 01:36:10 UTC
As a more permanent temporary work-around:
'echo realtek > /etc/modules-load.d/realtek.conf'
causes systemd to load the module on boot.

Comment 7 Norbert Jurkeit 2018-11-21 10:01:06 UTC
That's strange... I did some more tests (without the proposed workarounds) and noticed that the fault only occurs after reboot from a previous session with kernel 4.18.18. It does not occur (at least for me) after power-on and in subsequent reboots of kernel 4.19.2. I also noticed that while the fault is present sometimes it is sufficient to issue either "modprobe realtek" or "rmmod r8169; modprobe r8169". Obviously the behavior is erratic as if some registers were not fully initialized by the new driver.

Anyway, I will write an email to the proposed list and ask them to take care of the issue.

Comment 8 ikixxx@gmail.com 2018-11-21 11:28:49 UTC
I have the same problem (02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 01))

In my case the fault is always there, cold boot or reboot it doesn't matter.

Proposed workarounds work.

Comment 9 Michael Wiktowy 2018-11-21 17:12:08 UTC
I've had similar intermittent functioning too. When I made the realtek.conf systemd loader, I removed it to see if the NIC would stay down and to my surprise, it stayed working. After rebooting into Windows it stopped working again so I put the conf file back in place. So it would appear that whatever functioning the realtek module enables is somewhat sticky.

Comment 10 Heiner Kallweit 2018-11-21 20:07:08 UTC
The issue seems to affect certain chip-versions / systems only. So I'd like to get an overview. Could everybody facing this issue please provide the following info:

- dmesg line with the chip XID (just grep for "XID")
- dmesg line with the used PHY driver (grep for "attached PHY driver")
  (if the error occurs and when loading the realtek module manually)
- PHY ID (value of /sys/class/net/<ifname>/phydev/phy_id)

Comment 11 Michael Wiktowy 2018-11-21 20:15:34 UTC
This is for a system where the network is working due to systemd manual realtek module load above:

[root@desktop ~]# dmesg |grep XID
[    3.448502] r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24

[root@desktop ~]# dmesg |grep "attached PHY driver"
[    9.640822] RTL8211B Gigabit Ethernet r8169-300:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE)

[root@desktop ~]# cat /sys/class/net/enp3s0/phydev/phy_id 
0x001cc912

Like was indicated, even after removing this conf file, the fix is persistent through warm and cold reboots. I will try booting to another kernel/OS to trigger it to not work again and see if there is any difference with above dmesg commands.

Comment 12 Marc Dionne 2018-11-21 20:39:41 UTC
For my case:

[    2.385162] r8169 0000:06:00.0 eth0: RTL8168c/8111c, 00:24:21:32:03:4b, XID 3c4000c0, IRQ 24

When the network failed to start:

[    6.773503] Generic PHY r8169-600:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

but later on after manually loading the 'realtek' driver:

[15710.839318] RTL8211B Gigabit Ethernet r8169-600:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

/sys/class/net/enp3s0/phydev/phy_id is 0x001cc912, but that's after manually loading 'realtek', so I suspect it would be different at boot when in the failed state.

Comment 13 Norbert Jurkeit 2018-11-21 21:49:07 UTC
(In reply to Heiner Kallweit from comment #10)

> - dmesg line with the chip XID (just grep for "XID")

[    3.580112] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 00:25:22:06:b2:1c, XID 281000c0, IRQ 30
[    3.585926] r8169 0000:06:00.0 eth1: RTL8169sb/8110sb, 00:11:6b:97:15:b3, XID 10000000, IRQ 21

Please note that my machine has 2 Ethernet interfaces. The first one is affected by this issue, the second works fine.


> - dmesg line with the used PHY driver (grep for "attached PHY driver")

When the error does NOT occur (after cold start in my case):

[   27.896411] Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[   28.032308] Generic PHY r8169-600:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

>   (if the error occurs and when loading the realtek module manually)

[   24.084694] Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[   24.221617] Generic PHY r8169-600:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)
[  501.749429] RTL8211B Gigabit Ethernet r8169-100:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
[  501.939784] RTL8211C Gigabit Ethernet r8169-600:00: attached PHY driver [RTL8211C Gigabit Ethernet] (mii_bus:phy_addr=r8169-600:00, irq=IGNORE)

When the error occurs I can also restore proper operation with just "rmmod r8169; modprobe r8169" without "modprobe realtek". But even then the 2 extra entries for RTL8211B and RTL8211C appear.


> - PHY ID (value of /sys/class/net/<ifname>/phydev/phy_id)

0x001cc912


Unfortunately no other OS besides Fedora is installed on my machine, but according to Comment 9 I would expect that booting Windows before kernel 4.19.2 also invokes the issue.

Comment 14 Michael Wiktowy 2018-11-22 04:46:23 UTC
To contrast with my working config in Comment 11, FWIW, here are those dmesg outputs after removing manual module loading work-around, rebooting into dual-boot Windows and then booting back and having networking fail:

[root@desktop ~]# dmesg |grep XID
[    3.465029] r8169 0000:03:00.0 eth0: RTL8168b/8111b, 00:24:8c:0d:92:d1, XID 38000000, IRQ 24
[root@desktop ~]# dmesg |grep "attached PHY driver"
[    9.753340] Generic PHY r8169-300:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-300:00, irq=IGNORE)
[root@desktop ~]# cat /sys/class/net/enp3s0/phydev/phy_id 
0x001cc912

Comment 15 Heiner Kallweit 2018-11-22 20:34:58 UTC
The following seems to fix the issue. At least it works for Marc. If some of you also have the option to compile an own kernel and test, this would be appreciated.

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 55202a0ac..aeccb2323 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2254,6 +2254,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
        new_driver->mdiodrv.driver.probe = phy_probe;
        new_driver->mdiodrv.driver.remove = phy_remove;
        new_driver->mdiodrv.driver.owner = owner;
+       new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;

        retval = driver_register(&new_driver->mdiodrv.driver);
        if (retval) {

Comment 16 Norbert Jurkeit 2018-11-23 17:57:29 UTC
(In reply to Heiner Kallweit from comment #15)

This patch also works for me.

I managed to build a Fedora kernel of version 4.19.3 first without, then with the patch applied. As expected the issue is also present in the unpatched kernel 4.19.3, but is fixed for me by including the additional instruction (i.e. the RTL8111 based interface not only comes up correctly after a cold start, but also after reboot from older kernel 4.18.18).

Many thanks for finding a solution so quickly!

Comment 17 Heiner Kallweit 2018-11-29 11:54:35 UTC
Fixed with commit c85ddecae6e5 ("net: phy: add workaround for issue where PHY driver doesn't bind to the device"). It's available in linux-next and it's expected to be included in 4.19.6.

Comment 18 Jeremy Cline 2018-11-29 22:12:28 UTC
It doesn't look to have made it into v4.19.6, so I'll add it as a patch for the Fedora v4.19.6 build. Thanks!

Comment 19 Jeremy Cline 2018-11-30 21:04:57 UTC
*** Bug 1654736 has been marked as a duplicate of this bug. ***

Comment 20 Fedora Update System 2018-12-03 18:51:42 UTC
kernel-tools-4.19.6-300.fc29 kernel-headers-4.19.6-300.fc29 kernel-4.19.6-300.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-ce033b4abb

Comment 21 Fedora Update System 2018-12-03 18:52:55 UTC
kernel-headers-4.19.6-200.fc28 kernel-tools-4.19.6-200.fc28 kernel-4.19.6-200.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-18c53c4a64

Comment 22 Fedora Update System 2018-12-04 03:38:14 UTC
kernel-4.19.6-200.fc28, kernel-headers-4.19.6-200.fc28, kernel-tools-4.19.6-200.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-18c53c4a64

Comment 23 Fedora Update System 2018-12-04 05:02:30 UTC
kernel-4.19.6-300.fc29, kernel-headers-4.19.6-300.fc29, kernel-tools-4.19.6-300.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-ce033b4abb

Comment 24 Norbert Jurkeit 2018-12-04 19:39:41 UTC
I hate to say this, but neither kernel-4.19.6-300.fc29.x86_64 nor kernel-4.19.6-200.fc28.x86_64 from updates-testing fixes the issue for me.

I then built a kernel the same way as 2 weeks ago using following instructions:

$ fedpkg clone -a kernel
$ cd kernel
$ git checkout -b my_kernel origin/f29

In file kernel.spec replace
    # define builid .local
by
    ## define buildid .local
    %define buildid .my_kernel

$ fedpkg local

This time however I didn't need to add the patch from Comment 15 as it was already in place.

The produced kernel-4.19.6-300.my_kernel.fc29.x86_64 does fix the issue, similar to kernel-4.19.3-300.my_kernel.fc29.x86_64 before.

Did probably something go wrong in koji build or are different settings used there?

Comment 25 Jeremy Cline 2018-12-04 20:33:17 UTC
Hmm. There's no configuration difference, Koji builds from the same repository. The build logs look fine to me, too.

If you repeatedly boot into the updates-testing kernel does it fail every time? When you repeatedly boot into your locally built kernel does it work every time?

Comment 26 Norbert Jurkeit 2018-12-04 21:51:44 UTC
(In reply to Jeremy Cline from comment #25)
> If you repeatedly boot into the updates-testing kernel does it fail every
> time? When you repeatedly boot into your locally built kernel does it work
> every time?

Yes and yes. So far I rebooted 3 times into both kernels each from a session with kernel 4.18.18 and in all cases the RTL8111 interface came up with the self-compiled kernel but never with the kernel from updates-testing.

Are perhaps different gcc versions in use? Mine is currently gcc-8.2.1-5.fc29.x86_64.rpm. This should not make any difference, but I remember a case where a major version upgrade caused trouble with LVM2.

Comment 27 Fedora Update System 2018-12-06 03:56:56 UTC
kernel-4.19.6-300.fc29, kernel-headers-4.19.6-300.fc29, kernel-tools-4.19.6-300.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Comment 28 Jeremy Cline 2018-12-06 14:36:46 UTC
(In reply to Norbert Jurkeit from comment #26)
> (In reply to Jeremy Cline from comment #25)
> > If you repeatedly boot into the updates-testing kernel does it fail every
> > time? When you repeatedly boot into your locally built kernel does it work
> > every time?
> 
> Yes and yes. So far I rebooted 3 times into both kernels each from a session
> with kernel 4.18.18 and in all cases the RTL8111 interface came up with the
> self-compiled kernel but never with the kernel from updates-testing.
> 
> Are perhaps different gcc versions in use? Mine is currently
> gcc-8.2.1-5.fc29.x86_64.rpm. This should not make any difference, but I
> remember a case where a major version upgrade caused trouble with LVM2.

That kernel was built with the same version of gcc, so it's not that. I'm really not sure what could cause the discrepancy. Does v4.19.7 (just submitted to updates-testing) exhibit the same behavior?

Comment 29 Norbert Jurkeit 2018-12-06 16:29:13 UTC
(In reply to Jeremy Cline from comment #28)

> That kernel was built with the same version of gcc, so it's not that. I'm
> really not sure what could cause the discrepancy. Does v4.19.7 (just
> submitted to updates-testing) exhibit the same behavior?

kernel-4.19.7-300.fc29.x86_64.rpm just downloaded from koji neither fixes the issue for me.

I would appreciate to hear from other users hit by this issue whether the provided compiled kernels work for them. I also have no clue and would like to know what makes my self-compiled kernel behave differently, but if the issue is fixed for everyone else it is not worth to put more effort into this.

Comment 30 Marc Dionne 2018-12-06 16:52:15 UTC
FWIW I've been getting network reliably with kernel-4.19.6-300.fc29 from the repositories, as well as with locally built kernels from mainline.

Comment 31 rune 2018-12-06 19:10:37 UTC
Using kernel 19.6-300.fc29.x86_64 from the stable repo without any workarounds (NIC does not work, so not fixed for me):

# dmesg |grep XID
[    5.136382] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 00:25:22:2f:bb:9c, XID 281000c0, IRQ 31

# dmesg |grep "attached PHY driver"
[   23.020448] Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)

After adding realtek to modules-load.d and a reboot (NIC now works): 

# dmesg |grep XID
[    5.075131] r8169 0000:01:00.0 eth0: RTL8168d/8111d, 00:25:22:2f:bb:9c, XID 281000c0, IRQ 31

# dmesg |grep "attached PHY driver"
[   48.382706] RTL8211B Gigabit Ethernet r8169-100:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
   
# cat /sys/class/net/enp1s0/phydev/phy_id 
0x001cc912

Comment 32 Fedora Update System 2018-12-07 02:14:27 UTC
kernel-4.19.6-200.fc28, kernel-headers-4.19.6-200.fc28, kernel-tools-4.19.6-200.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Comment 33 Alex Villacís Lasso 2018-12-10 03:07:18 UTC
I have a Fedora 28 machine that is also affected by this bug. Updating to 4.19.6-200 does NOT fix the bug.

I have noticed that this bug is unreliable - sometimes the generic PHY driver works, sometimes it does not. When it did not work, unloading and reloading r8169 managed to make it work.

Comment 34 Alex Villacís Lasso 2018-12-10 03:08 UTC
Created attachment 1512939 [details]
journalctl of 4.19.6-200 when generic PHY refuses to work

Comment 35 Alex Villacís Lasso 2018-12-10 03:09 UTC
Created attachment 1512940 [details]
journalctl of 4.19.6-200 when generic PHY works

Comment 36 Heiner Kallweit 2018-12-10 10:33:35 UTC
The patch as mentioned in comment 15 is included in mainline since 4.19.7. With the patch the dedicated PHY driver (RTL8211B) should be loaded immediately.
Possible would be also that the fix does fix the issue for most people, but not for all. But this seems to be less likely than a build issue with the Fedora-built kernel, see comment 24.
Maybe affected people could try to build a mainline 4.19.8 (latest available version) kernel and test with this one.

Comment 37 Frank Crawford 2018-12-11 12:19:27 UTC
Unfortunately the fix also doesn't correct the issue for me either.  I've tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.

I do have a slight difference to the other users in that I have two NICs in my system, both using the r8169 module, one of which works, the other of which doesn't.  They are slightly different models, so that is the likely issue.

The output from journalctl for the failed boot was:

Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900800, IRQ 29
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 enp5s0: renamed from eth0
Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0: renamed from eth1
Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-500:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-700:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
Dec 11 22:46:42 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx

I haven't yet had a chance to try unloading and reloading the r8169 modules, but will try that in the next few days.

Comment 38 Heiner Kallweit 2018-12-11 16:15:28 UTC
(In reply to Frank Crawford from comment #37)
> Unfortunately the fix also doesn't correct the issue for me either.  I've
> tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.
> 
This doesn't rule out there's a build issue with the Fedora kernel. Could you build a mainline kernel or build a kernel as described in comment 24 and test?

> I do have a slight difference to the other users in that I have two NICs in
> my system, both using the r8169 module, one of which works, the other of
> which doesn't.  They are slightly different models, so that is the likely
> issue.
> 
> The output from journalctl for the failed boot was:
> 
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0:
> RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 eth0:
> jumbo features [frames: 4080 bytes, tx checksumming: ko]
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: libphy: r8169: probed
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1:
> RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900800, IRQ 29
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 eth1:
> jumbo features [frames: 9200 bytes, tx checksumming: ko]
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:05:00.0 enp5s0:
> renamed from eth0
> Dec 11 22:46:32 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0:
> renamed from eth1
> Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-500:00:
> attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
> Dec 11 22:46:38 bits.crawford.emu.id.au kernel: Generic PHY r8169-700:00:
> attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
> Dec 11 22:46:42 bits.crawford.emu.id.au kernel: r8169 0000:07:00.0 enp7s0:
> Link is Up - 1Gbps/Full - flow control rx/tx
> 
RTL8168evl has an integrated RTL8211E PHY which is compatible with genphy. Therefore it works even if genphy is used instead of the dedicated RTL8211E PHY driver.
RTL8168B has an integrated RTL8211B PHY that causes problems when used with the genphy driver.

> I haven't yet had a chance to try unloading and reloading the r8169 modules,
> but will try that in the next few days.

Comment 39 Frank Crawford 2018-12-12 04:19:14 UTC
(In reply to Heiner Kallweit from comment #38)
> (In reply to Frank Crawford from comment #37)
> > Unfortunately the fix also doesn't correct the issue for me either.  I've
> > tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.
> > 
> This doesn't rule out there's a build issue with the Fedora kernel. Could
> you build a mainline kernel or build a kernel as described in comment 24 and
> test?

Okay.  I was planning to try and build a mainline kernel soon anyway (I keep running out of time) so will prioritise that this weekend.

Comment 40 Frank Crawford 2018-12-12 10:23:29 UTC
(In reply to Heiner Kallweit from comment #38)
> (In reply to Frank Crawford from comment #37)
> > Unfortunately the fix also doesn't correct the issue for me either.  I've
> > tried both kernel-4.19.6-200.fc28.x86_64 and kernel-4.19.7-200.fc28.x86_64.
> > 
> This doesn't rule out there's a build issue with the Fedora kernel. Could
> you build a mainline kernel or build a kernel as described in comment 24 and
> test?

Okay, I built a mainline kernel today (4.19.8) and it still has the same problems.  I also had a little time to test if forcing the realtek module fixes the issue (i.e. rmmod r8169; modprobe realtek; modprobe r8169) and it did correct the problem.

The output from the boot is below, but really just shows the same details:

[    3.432656] libphy: r8169: probed
[    3.432818] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
[    3.432821] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]
[    3.435652] libphy: r8169: probed
[    3.436033] r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900800, IRQ 29
[    3.436035] r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[    3.496722] r8169 0000:05:00.0 enp5s0: renamed from eth0
[    3.535335] r8169 0000:07:00.0 enp7s0: renamed from eth1
[   10.872345] Generic PHY r8169-500:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[   10.977828] Generic PHY r8169-700:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[   14.201181] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx
[  108.646084] libphy: r8169: probed
[  108.646705] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 38000000, IRQ 28
[  108.646708] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx checksumming: ko]
[  108.649246] libphy: r8169: probed
[  108.650080] r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, XID 2c900880, IRQ 29
[  108.650083] r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[  108.655795] r8169 0000:05:00.0 enp5s0: renamed from eth0
[  108.681706] r8169 0000:07:00.0 enp7s0: renamed from eth1
[  108.717321] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[  108.822097] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[  111.144179] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx
[  111.960253] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control rx/tx

Comment 41 Heiner Kallweit 2018-12-12 10:56:13 UTC
Hmm, even if there would be a timing issue I'd expect that at least for the second network interface the proper PHY driver is loaded.
Is after the first failed attempt module "realtek" loaded at all? A "lsmod" output would be helpful.

Comment 42 Frank Crawford 2018-12-12 11:20:37 UTC
My apologies, it probably is a bit confusing that I just dumped the data there.

The output up to 15secs, was the normal boot, at that point I did "rmmod r8169; modprobe realtek; modprobe r8169", which leads to the rest of the output.

When I get a chance to reboot I'll get an lsmod and upload it (this is my main server, hence the two NICs).

Comment 43 Jeremy Cline 2018-12-12 14:28:47 UTC
*** Bug 1654736 has been marked as a duplicate of this bug. ***

Comment 44 Alex Villacís Lasso 2018-12-12 17:27:00 UTC
(In reply to Alex Villacís Lasso from comment #33)
> I have a Fedora 28 machine that is also affected by this bug. Updating to
> 4.19.6-200 does NOT fix the bug.
> 
> I have noticed that this bug is unreliable - sometimes the generic PHY
> driver works, sometimes it does not. When it did not work, unloading and
> reloading r8169 managed to make it work.

I must note - in my particular case I did NOT "modprobe realtek" between unloading and reloading r8169. Yet the reload fixed things in that instance. Everyone here should make note of whether the explicit "modprobe realtek" is required in your particular cases between unloading and reloading r8169.

Comment 45 Heiner Kallweit 2018-12-12 19:30:37 UTC
Right, it shouldn't have been necessary to do a "modprobe realtek". The module is loaded at boot, but there is (was) a timing issue causing driver and PHY device probe to collide. Therefore the genphy driver was used.

Comment 46 Frank Crawford 2018-12-15 11:30 UTC
Created attachment 1514587 [details]
lsmod on failed startup

lsmod output on fresh reboot

Comment 47 Frank Crawford 2018-12-15 11:32 UTC
Created attachment 1514588 [details]
lsmod following r8169 reload

The output of lsmod following "rmmod r8169; modprobe r8169".

Comment 48 Heiner Kallweit 2018-12-15 11:39:09 UTC
(In reply to Frank Crawford from comment #46)
> Created attachment 1514587 [details]
> lsmod on failed startup
> 
> lsmod output on fresh reboot

Even in this case the "realtek" module is loaded. This matches the pattern observed by the other affected users.

Comment 49 Frank Crawford 2018-12-15 11:45:47 UTC
Okay, I rebooted the system a couple of times, and you are correct that doing a "modprobe realtek" isn't necessary, just "rmmod r8169; modprobe r8169" is sufficient.

I've loaded up the modprobes from before and after the reload, and unfortunately there is little difference, only the location of the r8169 module in the list.

Also, FYI, while I know it shouldn't be necessary or do anything, if I add the line "softdep 48169 pre: realtek" to /etc/modprobe.d/local.conf then every thing seems to come up correctly.  Now I know that that has a typo "48169" should have been "r8169", which makes this even weirder.

Is it possible the issue is in the realtek module and not in the r8169 module?

Comment 50 Frank Crawford 2018-12-15 11:49:57 UTC
Also, I just looked at the dmesg output from that session, and both times the NIC appears to have bound to the realtek driver, but it didn't come up during the initial boot, but did on the reload!  See the output below:

[    3.091212] libphy: r8169: probed
[    3.091752] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 3
8000000, IRQ 28
[    3.091761] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx c
hecksumming: ko]
[    3.093858] libphy: r8169: probed
[    3.094328] r8169 0000:07:00.0 eth1: RTL8168evl/8111evl, 50:e5:49:45:87:2a, X
ID 2c900800, IRQ 29
[    3.094330] r8169 0000:07:00.0 eth1: jumbo features [frames: 9200 bytes, tx c
hecksumming: ko]
[    3.122238] r8169 0000:07:00.0 enp7s0: renamed from eth1
[    3.133382] r8169 0000:05:00.0 enp5s0: renamed from eth0
[   13.762322] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8
211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[   13.844037] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8
211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[   14.015943] r8169 0000:07:00.0 enp7s0: Link is Down
[   17.080825] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control
 rx/tx
[   84.669233] libphy: r8169: probed
[   84.670011] r8169 0000:05:00.0 eth0: RTL8168b/8111b, 00:e0:4c:14:2e:87, XID 3
8000000, IRQ 28
[   84.670015] r8169 0000:05:00.0 eth0: jumbo features [frames: 4080 bytes, tx c
hecksumming: ko]
[   84.676368] r8169 0000:05:00.0 enp5s0: renamed from eth0
[   84.678825] libphy: r8169: probed
[   84.687877] r8169 0000:07:00.0 eth0: RTL8168evl/8111evl, 50:e5:49:45:87:2a, X
ID 2c900880, IRQ 29
[   84.687884] r8169 0000:07:00.0 eth0: jumbo features [frames: 9200 bytes, tx c
hecksumming: ko]
[   84.693005] r8169 0000:07:00.0 enp7s0: renamed from eth0
[   84.704510] RTL8211B Gigabit Ethernet r8169-500:00: attached PHY driver [RTL8
211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[   84.786049] RTL8211E Gigabit Ethernet r8169-700:00: attached PHY driver [RTL8
211E Gigabit Ethernet] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
[   86.890229] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control
 rx/tx
[   87.915586] r8169 0000:07:00.0 enp7s0: Link is Up - 1Gbps/Full - flow control
 rx/tx

Comment 51 Heiner Kallweit 2018-12-15 12:05:34 UTC
If the dedicated RTL8211B PHY driver is loaded and your network still doesn't come up, then there's a different problem. But it doesn't seem to be in the drivers because the user from comment 1 has the same chip version and for him the issue seems to be fixed.

Comment 52 Frank Crawford 2018-12-15 12:27:45 UTC
(In reply to Heiner Kallweit from comment #51)
> If the dedicated RTL8211B PHY driver is loaded and your network still
> doesn't come up, then there's a different problem. But it doesn't seem to be
> in the drivers because the user from comment 1 has the same chip version and
> for him the issue seems to be fixed.

I'll play around with the system a bit more and see what I can find, since it all be the same underlying issue, just different manifestation.  As you noted, the current fix is a workaround, not a fix for the real issue, which is still not known.

Comment 53 Jeremy Cline 2018-12-18 15:29:56 UTC
*** Bug 1657458 has been marked as a duplicate of this bug. ***

Comment 54 Norbert Jurkeit 2018-12-20 13:10:41 UTC
(In reply to Heiner Kallweit from comment #51)
> If the dedicated RTL8211B PHY driver is loaded and your network still
> doesn't come up, then there's a different problem. But it doesn't seem to be
> in the drivers because the user from comment 1 has the same chip version and
> for him the issue seems to be fixed.

I did some more testing with the newest Fedora kernel-4.19.10-300.fc29.x86_64 (installed as binary from the repo as well as compiled by myself) and for the first time with a mainline kernel built from linux-4.19.10.tar.xz and the config file taken from the Fedora kernel. In opposite to my success reports in Comment 16 and Comment 24 neither of them fixes the issue for me, i.e. after reboot from kernel 4.18.18 I still have to issue "rmmod r8169; modprobe r8169" to make the interface work.

According to my experience the behavior is stable within a particular kernel, but varies between kernel versions and sometimes even between builds of the same version as described in Comment 24.

Comment 55 Heiner Kallweit 2018-12-20 19:00:36 UTC
Thanks for the feedback, then the first attempt to work around the issue wasn't good enough. Something is asynchronous and disturbs.
Could you try the following patch? It will apply to latest linux-next only, for older kernel versions you could simply add an unconditional msleep(20) at the end of phy_device_create().


diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 7d5d69860..d6f464aa5 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -557,6 +557,7 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 {
 	struct phy_device *dev;
 	struct mdio_device *mdiodev;
+	bool delay = false;
 
 	/* We allocate the device, and initialize the default values */
 	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
@@ -613,14 +614,19 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 			if (!(c45_ids->devices_in_package & (1 << i)))
 				continue;
 
-			request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
-				       MDIO_ID_ARGS(c45_ids->device_ids[i]));
+			if (!request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
+					MDIO_ID_ARGS(c45_ids->device_ids[i])))
+				delay = true;
 		}
 	} else {
-		request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
-			       MDIO_ID_ARGS(phy_id));
+		if (!request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT,
+				    MDIO_ID_ARGS(phy_id)))
+			delay = true;
 	}
 
+	if (delay)
+		msleep(20);
+
 	device_initialize(&mdiodev->dev);
 
 	return dev;
-- 
2.20.1

Comment 56 Heiner Kallweit 2018-12-21 09:21:20 UTC
One more potential fix to test:
Please add a call to wait_for_device_probe() at the end of phy_device_create(), directly before the return statement.
This way we may be able to avoid a dumb delay.

Comment 57 Norbert Jurkeit 2018-12-21 13:03:49 UTC
Success! I again tested with mainline kernel 4.19.10 and a msleep(20) statement inserted just before the return statement of function phy_device_create seems to fix the issue for me.

At first I concluded from the patch in Comment 55 that msleep(20) needs to be inserted before device_initialize(&mdiodev->dev), but that did not help.

I also tried wait_for_device_probe() instead of msleep(20), but then the boot process gets stuck after message "Reached target Initrd Raid Device" (the root fs including /boot is indeed located on a MD Raid device).

Comment 58 Heiner Kallweit 2018-12-21 13:10:07 UTC
Thanks for testing! Strange that it makes a difference whether the msleep() is before or after the device_initialize(). Have to check this.
I'd like to ask you for one more test. If you just use async_synchronize_full() instead of wait_for_device_probe(), does this help?

Comment 59 Heiner Kallweit 2018-12-21 13:46:44 UTC
Sorry, may I ask you for one more test:
If the device_initialize() call in phy_device_create() is moved to directly after dev_set_name(), does this help?

Comment 60 Norbert Jurkeit 2018-12-21 14:40:53 UTC
async_synchronize_full() as suggested in Comment 58 lets the boot process finish normally, but does not fix the issue. To avoid a compiler error I also added "#include <linux/async.h>"; please let me know if this was wrong.

I just started a build of the Fedora kernel to see whether the msleep(20) patch also works there. Unfortunately the procedure known to me takes nearly 4 hours and in case of interruption I have to start all over, but after that I will do the test from Comment 59.

Comment 61 Norbert Jurkeit 2018-12-21 18:52:42 UTC
(In reply to Heiner Kallweit from comment #59)
> Sorry, may I ask you for one more test:
> If the device_initialize() call in phy_device_create() is moved to directly
> after dev_set_name(), does this help?

Good news. This modification tested with the mainline kernel also fixes the issue for me.

The test of the msleep(20) patch applied to the Fedora kernel of same version 4.19.10 was also successful.

Comment 62 Heiner Kallweit 2018-12-21 19:04:06 UTC
Very good, then it seems we have a fix w/o having to add a dumb delay. I would submit this fix as part of a patch removing the recently added preliminary fix (which fixed it in som ecases but not in general).
The following would be the patch, could you please test it? Then I would mention you as "Tested-by:" in the patch.

---
 drivers/net/phy/phy_device.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 7d5d69860..3cacf9fe9 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -589,6 +589,7 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 		dev->c45_ids = *c45_ids;
 	dev->irq = bus->irq[addr];
 	dev_set_name(&mdiodev->dev, PHY_ID_FMT, bus->id, addr);
+	device_initialize(&mdiodev->dev);
 
 	dev->state = PHY_DOWN;
 
@@ -621,8 +622,6 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 			       MDIO_ID_ARGS(phy_id));
 	}
 
-	device_initialize(&mdiodev->dev);
-
 	return dev;
 }
 EXPORT_SYMBOL(phy_device_create);
@@ -2255,14 +2254,6 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 	new_driver->mdiodrv.driver.remove = phy_remove;
 	new_driver->mdiodrv.driver.owner = owner;
 
-	/* The following works around an issue where the PHY driver doesn't bind
-	 * to the device, resulting in the genphy driver being used instead of
-	 * the dedicated driver. The root cause of the issue isn't known yet
-	 * and seems to be in the base driver core. Once this is fixed we may
-	 * remove this workaround.
-	 */
-	new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
-
 	retval = driver_register(&new_driver->mdiodrv.driver);
 	if (retval) {
 		pr_err("%s: Error %d in registering driver\n",
-- 
2.20.1

Comment 63 Norbert Jurkeit 2018-12-21 20:34:59 UTC
The patch works for me, i.e. I copied the text to patchfile, ran "patch phy_device.c patchfile", checked the contents, then "make", "make modules_install", "make install" and finally rebooted 3 times from kernel 4.18.18 and checked the network to come up.

However I wonder whether other users should test the patch too before it is submitted to the mainline kernel. We have seen in Comment 30 and also on bodhi.fedoraproject.org that the previous fix with PROBE_FORCE_SYNCHRONOUS has fixed the issue for some users. I don't understand how the driver works but wouldn't exclude the chance that the previous fix is still needed for them.

Comment 64 Heiner Kallweit 2018-12-21 20:40:02 UTC
Thanks for testing. And it's right, I'd appreciate if more affected users could test whether the patch fixes the issue for them.

Comment 65 Heiner Kallweit 2018-12-23 11:06:29 UTC
(In reply to Frank Crawford from comment #52)
> (In reply to Heiner Kallweit from comment #51)
> > If the dedicated RTL8211B PHY driver is loaded and your network still
> > doesn't come up, then there's a different problem. But it doesn't seem to be
> > in the drivers because the user from comment 1 has the same chip version and
> > for him the issue seems to be fixed.
> 
> I'll play around with the system a bit more and see what I can find, since
> it all be the same underlying issue, just different manifestation.  As you
> noted, the current fix is a workaround, not a fix for the real issue, which
> is still not known.

You are one of the affected users where the first fix attempt didn't help. Could you test the latest fix from comment 62 ?

Comment 66 Frank Crawford 2018-12-23 11:32:07 UTC
(In reply to Heiner Kallweit from comment #65)
> You are one of the affected users where the first fix attempt didn't help.
> Could you test the latest fix from comment 62 ?

Heiner, yes, I will within the next day, and let you know the results.

Comment 67 Frank Crawford 2018-12-24 11:11:41 UTC
(In reply to Heiner Kallweit from comment #65)
> You are one of the affected users where the first fix attempt didn't help.
> Could you test the latest fix from comment 62 ?

Heiner, okay, I had a chance to patch and rebuild it with 4.19.12 and then rebooted a few times from different states, and it has come up fine every time.

So, yes, I'm willing to say this patch seems to have fixed my issue.

Comment 68 Heiner Kallweit 2018-12-24 11:34:29 UTC
Great, thanks for testing. Then I'll submit this patch.

Comment 69 Hans de Goede 2018-12-24 12:03:29 UTC
*** Bug 1661680 has been marked as a duplicate of this bug. ***

Comment 70 Hans de Goede 2018-12-24 20:46:52 UTC
*** Bug 1661943 has been marked as a duplicate of this bug. ***

Comment 71 Hans de Goede 2018-12-24 20:47:03 UTC
*** Bug 1654858 has been marked as a duplicate of this bug. ***

Comment 72 oli 2018-12-25 00:21:31 UTC
if someone could provide a .rpm, i would test it too :)

Comment 73 Hans de Goede 2018-12-25 10:18:47 UTC
(In reply to oli from comment #72)
> if someone could provide a .rpm, i would test it too :)

I've started a kernel scratch-build with the patch added here:

https://koji.fedoraproject.org/koji/taskinfo?taskID=31646312

This should be done in a couple of hours, instructions for testing kernels directly from koji are here:
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Comment 74 oli 2018-12-25 14:04:34 UTC
I just tested the new kernel. It did not fix the problem with my nic
If you need more information, please tell me what i should do.

Comment 75 Heiner Kallweit 2018-12-25 14:18:16 UTC
(In reply to oli from comment #74)
> I just tested the new kernel. It did not fix the problem with my nic
> If you need more information, please tell me what i should do.

We don't even know what your problem is. Please add:
- description of problem
- last kernel version w/o this problem
- does removing / reloading module r8169 help?
- output of "dmesg | grep r8169"

Comment 76 oli 2018-12-25 14:42:42 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1661943

i will try to reload the nic module and see what happens

Comment 77 oli 2018-12-25 14:46:17 UTC
thanks, when i do rmmod r8169 && modprobe r8169 the device gets up and running as it should - at least its a workaround, thanks

Comment 78 Heiner Kallweit 2018-12-25 14:48:10 UTC
What's still missing is a dmesg from the test kernel incl. module remove/reload cycle.

Comment 79 oli 2018-12-25 14:52 UTC
Created attachment 1516661 [details]
dmesg of newest kernel with load/unload cycle

Comment 80 Heiner Kallweit 2018-12-25 15:13:42 UTC
(In reply to Hans de Goede from comment #73)
> (In reply to oli from comment #72)
> > if someone could provide a .rpm, i would test it too :)
> 
> I've started a kernel scratch-build with the patch added here:
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=31646312
> 
> This should be done in a couple of hours, instructions for testing kernels
> directly from koji are here:
> https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Hans, could you please build a test kernel with the following patch instead?
The delay is the less sophisticated, but obviously safer method to prevent the issue.
For another affected user it helped, would like to see whether it fixes the issue also for oli.


diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 26c41ede5..09e0ad928 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -600,6 +600,9 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 
 	device_initialize(&mdiodev->dev);
 
+	/* prevent timing issue causing phy driver not bound sometimes */
+	msleep(20);
+
 	return dev;
 }
 EXPORT_SYMBOL(phy_device_create);
-- 
2.20.1

Comment 81 Hans de Goede 2018-12-25 22:33:37 UTC
(In reply to Heiner Kallweit from comment #80)
> Hans, could you please build a test kernel with the following patch instead?
> The delay is the less sophisticated, but obviously safer method to prevent
> the issue.
> For another affected user it helped, would like to see whether it fixes the
> issue also for oli.

Ok, a new kernel with that patch is building here now:

https://koji.fedoraproject.org/koji/taskinfo?taskID=31661328

This should be done in a couple of hours.

Comment 82 oli 2018-12-26 09:00:14 UTC
I downloaded that kernel, installed as instructed from https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt, removed my ethhack.service (called ethhack.sh on startup) and ethhack.sh (rmmod and modprobe of r8169) and rebooted. 
I instantly got a connection. To make sure it was that, i rebooted the old scratch kernel ... and it was working there too? I dont understand it.
So i bootet the old official fedora kernel, this worked now too.
I dont understand why, but for me this is solved. Thanks!

Comment 83 oli 2018-12-26 09:02 UTC
Created attachment 1516774 [details]
dmesg after updating to kernel 4.19.12-301.rhbz1650984_2.fc29.x86_64

Comment 84 Heiner Kallweit 2018-12-26 10:07:25 UTC
(In reply to oli from comment #83)
> Created attachment 1516774 [details]
> dmesg after updating to kernel 4.19.12-301.rhbz1650984_2.fc29.x86_64

Thanks for testing. Still strange that your attached dmesg from the latest test kernel includes the following line:
Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
I would expect a dedicated PHY driver here. If you remove and reload module r8169, which PHY driver is loaded?

Comment 85 Elia Devito 2018-12-26 11:08:11 UTC
not fixed for me, seem that my machine load Generic PHY every time (on boot and after r8169 module reload)

Comment 86 Elia Devito 2018-12-26 11:08 UTC
Created attachment 1516819 [details]
dmesg-4.19.10-300.fc29.x86_64

Comment 87 Elia Devito 2018-12-26 11:09 UTC
Created attachment 1516820 [details]
dmesg-4.19.10-300.fc29.x86_64

Comment 88 Elia Devito 2018-12-26 11:10 UTC
Created attachment 1516821 [details]
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64

Comment 89 Elia Devito 2018-12-26 11:10 UTC
Created attachment 1516822 [details]
dmesg-4.19.12-301.rhbz1650984.fc29.x86_64

Comment 90 Heiner Kallweit 2018-12-26 11:21:54 UTC
(In reply to Elia Devito from comment #85)
> not fixed for me, seem that my machine load Generic PHY every time (on boot
> and after r8169 module reload)

For RTL8168h it's correct that the genphy driver is used. You describe a totally different issue.
The issue we talk about here affects (AFAIK only) chips with a RTL8211C PHY because this PHY isn't compatible with genphy. Link doesn't come up in this case.
You have an unstable link. It's less likely that this is a driver issue, same chip version works totally fine here with latest driver.
How does your system behave with a 4.18 kernel? And if you switch cable and/or switch?

Comment 91 oli 2018-12-26 12:47:53 UTC
[root@localhost oli]# rmmod r8169 && modprobe r8169
(latest lines from dmesg regarding nic driver) :
[13736.406887] libphy: r8169: probed
[13736.407479] r8169 0000:02:00.0 eth0: RTL8168d/8111d, 00:25:22:0c:80:0a, XID 281000c0, IRQ 24
[13736.407482] r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[13736.415866] r8169 0000:02:00.0 enp2s0: renamed from eth0
[13736.434338] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[13736.435042] RTL8211B Gigabit Ethernet r8169-200:00: attached PHY driver [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)
[13736.564882] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
[13738.074544] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off
[13738.074554] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready

> which PHY driver is loaded?
can you help me a bit? i tested lsmod | grep phy; but i did not get any output
is the information provided in the dmesg pasted above?

Comment 92 oli 2018-12-26 12:50:28 UTC
whats also pretty wierd to me, it works now with all kernels which i do not understand. but i deleted the ethhack.sh (which did rmmod && modprobe) and also the service for the startup, so either its kinda random bug (timing?) or it works now... somehow
i will try this evening to power off the machine completly and then start an old kernel which should not work and verify that again

Comment 93 Heiner Kallweit 2018-12-26 13:00:05 UTC
(In reply to oli from comment #91)
> [root@localhost oli]# rmmod r8169 && modprobe r8169
> (latest lines from dmesg regarding nic driver) :
> [13736.406887] libphy: r8169: probed
> [13736.407479] r8169 0000:02:00.0 eth0: RTL8168d/8111d, 00:25:22:0c:80:0a,
> XID 281000c0, IRQ 24
> [13736.407482] r8169 0000:02:00.0 eth0: jumbo features [frames: 9200 bytes,
> tx checksumming: ko]
> [13736.415866] r8169 0000:02:00.0 enp2s0: renamed from eth0
> [13736.434338] IPv6: ADDRCONF(NETDEV_UP): enp2s0: link is not ready
> [13736.435042] RTL8211B Gigabit Ethernet r8169-200:00: attached PHY driver
> [RTL8211B Gigabit Ethernet] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE)

That's the line I was looking for. (BTW: in comment 90 I meant RTL8211B).
Now the dedicated RTL8211B PHY driver is correctly loaded.

As explanation for others involved in the discussion here:
r8169 supports ~ 50 chip versions. The integrated PHY's partially use the same ID as existing external Realtek PHY's, the PHY's in later chip versions use a generic Realtek ID and therefore the generic PHY driver is used.

Comment 94 Heiner Kallweit 2018-12-26 18:00:20 UTC
Still the latest fix attempt doesn't fully cure the issue. According the log in comment 83 again genphy driver was bound instead of the dedicated PHY driver.

And the root cause of the issue seems to be well hidden. I'd appreciate if affected people could test with an original 4.19 (w/o any of the fix attempts) and just the following patch applied.
I'd be curious whether the warning is triggered and the proper dedicated driver is bound.

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 8ac10af17..7f9a01a34 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -804,7 +804,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 		if (ret == 0)
 			ret = 1;
 		else {
-			dev->driver = NULL;
+			WARN(1, "%s\n", dev_name(dev));
 			ret = 0;
 		}
 	} else {
-- 
2.20.1

Comment 95 oli 2018-12-27 00:22:51 UTC
@Heiner

I can test kernels, but i need a .rpm 
Unfortunately, i drive back home in about 12h, then i have no access to this machine anymore and can not test it.

Comment 96 Hans de Goede 2018-12-27 11:20:49 UTC
(In reply to Heiner Kallweit from comment #94)
> Still the latest fix attempt doesn't fully cure the issue. According the log
> in comment 83 again genphy driver was bound instead of the dedicated PHY
> driver.
> 
> And the root cause of the issue seems to be well hidden. I'd appreciate if
> affected people could test with an original 4.19 (w/o any of the fix
> attempts) and just the following patch applied.
> I'd be curious whether the warning is triggered and the proper dedicated
> driver is bound.
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 8ac10af17..7f9a01a34 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -804,7 +804,7 @@ static int __device_attach(struct device *dev, bool
> allow_async)
>  		if (ret == 0)
>  			ret = 1;
>  		else {
> -			dev->driver = NULL;
> +			WARN(1, "%s\n", dev_name(dev));
>  			ret = 0;
>  		}
>  	} else {
> -- 
> 2.20.1

For people who want to give this a test, a scratch kernel-build with the above change is building here now:
https://koji.fedoraproject.org/koji/taskinfo?taskID=31688858

As usual it takes a couple of hours for this to finish.

Comment 97 Hans de Goede 2018-12-27 11:33:26 UTC
I've started another scratch-build of the kernel here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=31688943

With an attempt to workaround this. Since getting this fixed properly seems to be tricky and quite a lot of people are hitting this, this kernel build simply builds the realtek phy driver into the kernel by setting CONFIG_REALTEK_PHY=y, which will hopefully work around the probe issue.

So if you are hitting this bug, please give both the build from comment 96 a try to help Heiner debug this, as well as give the build from this comment a try. If setting CONFIG_REALTEK_PHY=y is a valid workaround then I will push that to the Fedora kernels for now as a workaround.

Comment 98 oli 2018-12-27 13:59 UTC
Created attachment 1517059 [details]
dmesg of 4.19.12-301.rhbz1650984_4.fc29.x86_64

Comment 99 oli 2018-12-27 14:01 UTC
Created attachment 1517072 [details]
dmesg of 4.19.12-301.rhbz1650984_3.fc29.x86_64

Comment 100 oli 2018-12-27 14:01:51 UTC
have to go now, sorry that i can not test anymore (until about middle of jan.). and thanks for your help/engagment!

Comment 101 Heiner Kallweit 2018-12-27 14:13:30 UTC
oli, thanks for testing! Results indicate that:
- building realtek phy driver module into the kernel instead of module avoids the issue
- the code path I mentioned in comment 94 is not related to the issue

Comment 102 Norbert Jurkeit 2018-12-27 14:22:20 UTC
The test kernels provided by Hans gave me following results:

From Comment 73, relocated device_initialize statement: NOT OK, although it did work with my self-built mainline kernel 4.19.10
From Comment 81, additional msleep statement: OK
From Comment 96, WARN statement: OK (what does this do anyway?)
From Comment 97, with built-in realtek phy driver: OK

I then also tried next stock Fedora kernel 4.19.12-301.fc29.x86_64 which is available in koji for some days and presumably contains none of the above patches. To my surprise it also works for me, i.e. network comes up after reboot from kernel 4.18.18.

It seems to be a matter of luck whether one of the patches works, is necessary at all or even breaks things.

Comment 103 Hans de Goede 2018-12-27 15:12:06 UTC
Heiner, since this seems to be a race condition triggering somewhere, building in the realtek-phy module seems like a reasonable workaround for now from a downstream (Fedora) perspective, do you agree?

For upstream it would be nice to eventually get to the bottom of this.

Comment 104 Heiner Kallweit 2018-12-27 15:39:49 UTC
(In reply to Hans de Goede from comment #103)
> Heiner, since this seems to be a race condition triggering somewhere,
> building in the realtek-phy module seems like a reasonable workaround for
> now from a downstream (Fedora) perspective, do you agree?
> 
> For upstream it would be nice to eventually get to the bottom of this.

Yes, building the realtek module in seems to avoid the issue. Fine with me for downstream.
Unfortunately on my systems I can't reproduce the problem. Root cause seems to be neither in r8169 nor phylib, but somewhere in the driver base core.
I'll come back once I have something else to test or a status update.

Comment 105 Hans de Goede 2018-12-27 16:25:21 UTC
(In reply to Heiner Kallweit from comment #104)
> Yes, building the realtek module in seems to avoid the issue. Fine with me
> for downstream.

Ok, I've added this change to the Fedora-kernel configs now. This will get picked up by the next official Fedora kernel build.

Comment 106 Heiner Kallweit 2018-12-27 19:57:47 UTC
Because I can't reproduce the issue here I need some further debug information from affected systems.
The following patch (applies to 4.19.12) emits quite some debug info. I hope the debug output overhead doesn't impact timings in a way that the issue doesn't occur any longer (similar to Heisenberg uncertainty principle). Please attach full dmesg output to this bug ticket (only if issue occurs and Generic PHY driver is loaded when it should not).

Hans, would be great if you could again build a test kernel for the ones who need a rpm to test. Thanks a lot.

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index edfc9f0b1..845b6019c 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -450,6 +450,8 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 	bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
 			   !drv->suppress_bind_attrs;
 
+	pr_info("hk: in %s\n", __func__);
+
 	if (defer_all_probes) {
 		/*
 		 * Value of defer_all_probes can be set only by
@@ -468,7 +470,7 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 		return ret;
 
 	atomic_inc(&probe_count);
-	pr_debug("bus: '%s': %s: probing driver %s with device %s\n",
+	pr_info("bus: '%s': %s: probing driver %s with device %s\n",
 		 drv->bus->name, __func__, drv->name, dev_name(dev));
 	WARN_ON(!list_empty(&dev->devres_head));
 
@@ -537,12 +539,14 @@ static int really_probe(struct device *dev, struct device_driver *drv)
 	goto done;
 
 probe_failed:
+	pr_info("hk: really_probe probe_fail, ret = %d\n", ret);
 	dma_deconfigure(dev);
 dma_failed:
 	if (dev->bus)
 		blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 					     BUS_NOTIFY_DRIVER_NOT_BOUND, dev);
 pinctrl_bind_failed:
+	pr_info("hk: really_probe pinctrl_fail, ret = %d\n", ret);
 	device_links_no_driver(dev);
 	devres_release_all(dev);
 	driver_sysfs_remove(dev);
@@ -736,6 +740,7 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 		return -EBUSY;
 
 	ret = driver_match_device(drv, dev);
+	pr_info("hk: dev_match: drv = %s dev = %s, ret %d\n", drv->name, dev_name(dev), ret);
 	if (ret == 0) {
 		/* no match */
 		return 0;
@@ -748,6 +753,7 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 	} /* ret > 0 means positive match */
 
 	async_allowed = driver_allows_async_probing(drv);
+	pr_info("hk: async_allowed = %d\n", async_allowed);
 
 	if (async_allowed)
 		data->have_async = true;
@@ -772,6 +778,7 @@ static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
 	if (dev->parent)
 		pm_runtime_get_sync(dev->parent);
 
+	pr_info("hk: call __device_attach_driver from __device_attach_async_helper\n");
 	bus_for_each_drv(dev->bus, NULL, &data, __device_attach_driver);
 	dev_dbg(dev, "async probe completed\n");
 
@@ -812,6 +819,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 		if (dev->parent)
 			pm_runtime_get_sync(dev->parent);
 
+		pr_info("hk: call __device_attach_driver from __device_attach\n");
 		ret = bus_for_each_drv(dev->bus, NULL, &data,
 					__device_attach_driver);
 		if (!ret && allow_async && data.have_async) {
@@ -878,6 +886,7 @@ static int __driver_attach(struct device *dev, void *data)
 	 */
 
 	ret = driver_match_device(drv, dev);
+	pr_info("hk: dr_match: drv = %s dev = %s, ret %d\n", drv->name, dev_name(dev), ret);
 	if (ret == 0) {
 		/* no match */
 		return 0;
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 20d1be2b0..9349dea75 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -457,6 +457,8 @@ struct phy_device *phy_device_create(struct mii_bus *bus, int addr, int phy_id,
 	 */
 	request_module(MDIO_MODULE_PREFIX MDIO_ID_FMT, MDIO_ID_ARGS(phy_id));
 
+	pr_info("hk: request_module returned\n");
+
 	device_initialize(&mdiodev->dev);
 
 	return dev;
@@ -1927,14 +1929,6 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 	new_driver->mdiodrv.driver.remove = phy_remove;
 	new_driver->mdiodrv.driver.owner = owner;
 
-	/* The following works around an issue where the PHY driver doesn't bind
-	 * to the device, resulting in the genphy driver being used instead of
-	 * the dedicated driver. The root cause of the issue isn't known yet
-	 * and seems to be in the base driver core. Once this is fixed we may
-	 * remove this workaround.
-	 */
-	new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
-
 	retval = driver_register(&new_driver->mdiodrv.driver);
 	if (retval) {
 		pr_err("%s: Error %d in registering driver\n",
@@ -1943,7 +1937,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
 		return retval;
 	}
 
-	pr_debug("%s: Registered new driver\n", new_driver->name);
+	pr_info("%s: Registered new driver\n", new_driver->name);
 
 	return 0;
 }
-- 
2.20.1

Comment 107 Jeremy Cline 2018-12-28 04:49:14 UTC
(In reply to Heiner Kallweit from comment #106)
> Because I can't reproduce the issue here I need some further debug
> information from affected systems.
> The following patch (applies to 4.19.12) emits quite some debug info. I hope
> the debug output overhead doesn't impact timings in a way that the issue
> doesn't occur any longer (similar to Heisenberg uncertainty principle).
> Please attach full dmesg output to this bug ticket (only if issue occurs and
> Generic PHY driver is loaded when it should not).
> 
> Hans, would be great if you could again build a test kernel for the ones who
> need a rpm to test. Thanks a lot.
> 

I started a build with the debug patch and CONFIG_REALTEK_PHY=m at https://koji.fedoraproject.org/koji/taskinfo?taskID=31702356

Comment 108 Hans de Goede 2019-01-02 08:16:52 UTC
Note the workaround using CONFIG_REALTEK_PHY=y is available in the official Fedora kernel builds now, starting with the 4.19.13 builds, which are in update-testing atm.

Comment 109 Yanko Kaneti 2019-01-02 11:33:18 UTC
*** Bug 1662910 has been marked as a duplicate of this bug. ***

Comment 110 Hans de Goede 2019-01-02 12:47:19 UTC
*** Bug 1662862 has been marked as a duplicate of this bug. ***

Comment 111 Heiner Kallweit 2019-01-06 17:23:44 UTC
The following new proposal for a workaround is basically the same as described in comment 49, just as part of the driver code.
It ensures the the PHY driver module is loaded before r8169. I'd appreciate test results.


diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 7fe918026..c6b98024b 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -708,6 +708,7 @@ module_param(use_dac, int, 0);
 MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
 module_param_named(debug, debug.msg_enable, int, 0);
 MODULE_PARM_DESC(debug, "Debug verbosity level (0=none, ..., 16=all)");
+MODULE_SOFTDEP("pre: realtek");
 MODULE_LICENSE("GPL");
 MODULE_FIRMWARE(FIRMWARE_8168D_1);
 MODULE_FIRMWARE(FIRMWARE_8168D_2);
-- 
2.20.1

Comment 112 Norbert Jurkeit 2019-01-07 17:14:17 UTC
I tested the patch from comment 111 with mainline kernels 4.19.10 and 4.19.13 using default setting CONFIG_REALTEK_PHY=m and it makes the issue vanish in both cases. However due to previous experience I am cautious to say that this patch provides the final fix.

My recent observation that the issue only occurs when the Ethernet link is powered down before boot of a 4.19.* kernel should also be mentioned here. According to the LAN port status LED of my Wifi router this power-down happens during reboot from kernel 4.18.18, but not during reboot from kernel 4.19.* or when the PC is initially switched on.

Comment 113 Marc Dionne 2019-01-07 17:35:03 UTC
After being unable to reproduce for a while, even with kernels that used to fail consistently, I ran into this again today while booting into a 5.0-rc1 kernel for the first time.

I experimented with MODULE_SOFTDEP as described in comment 111 back in december, and it seemed to help at the time; but that's still just another workaround that doesn't explain what's going on.

Another thing I should note is that in my case, the "realtek" module does NOT show up in the lsmod output when the network fails to start (some of the mails on the netdev thread suggest that's different for some users).

Comment 114 Heiner Kallweit 2019-01-07 18:40:17 UTC
(In reply to Marc Dionne from comment #113)
[...]
> 
> Another thing I should note is that in my case, the "realtek" module does
> NOT show up in the lsmod output when the network fails to start (some of the
> mails on the netdev thread suggest that's different for some users).

Thanks for the feedback! That's interesting. In such a case (network fails to start and "realtek" module isn't loaded) what's the value of /sys/class/net/<if>/phydev/phy_id ?
If it's the correct value, then request_module() in phy_device_create() seems to fail. And unfortunately we don't check the return value yet.
If you can reproduce the problem, could you add a debug output printing the return value of request_module() ? Or do you need a patch for it?

Comment 115 Heiner Kallweit 2019-01-07 19:12:05 UTC
(In reply to Heiner Kallweit from comment #114)
> (In reply to Marc Dionne from comment #113)
> [...]
> > 
> > Another thing I should note is that in my case, the "realtek" module does
> > NOT show up in the lsmod output when the network fails to start (some of the
> > mails on the netdev thread suggest that's different for some users).
> 
> Thanks for the feedback! That's interesting. In such a case (network fails
> to start and "realtek" module isn't loaded) what's the value of
> /sys/class/net/<if>/phydev/phy_id ?
> If it's the correct value, then request_module() in phy_device_create()
> seems to fail. And unfortunately we don't check the return value yet.
> If you can reproduce the problem, could you add a debug output printing the
> return value of request_module() ? Or do you need a patch for it?

To rule out that the module is loaded but the init function fails: Do you have any "Error xx in registering driver" message in the syslog?

Comment 116 Marc Dionne 2019-01-07 19:36:27 UTC
(In reply to Heiner Kallweit from comment #114)
> (In reply to Marc Dionne from comment #113)
> [...]
> > 
> > Another thing I should note is that in my case, the "realtek" module does
> > NOT show up in the lsmod output when the network fails to start (some of the
> > mails on the netdev thread suggest that's different for some users).
> 
> Thanks for the feedback! That's interesting. In such a case (network fails
> to start and "realtek" module isn't loaded) what's the value of
> /sys/class/net/<if>/phydev/phy_id ?
> If it's the correct value, then request_module() in phy_device_create()
> seems to fail. And unfortunately we don't check the return value yet.
> If you can reproduce the problem, could you add a debug output printing the
> return value of request_module() ? Or do you need a patch for it?

I can add some debugging, but unfortunately, after fixing the initial boot with a rmmod/modprobe of r8169, the machine now consistently boots with networking.  (btw the ID for my device is 0x1cc912)

There's obviously some other variable that affects the behaviour.  Looks like failure is more likely for a newly installed kernel, or booting into a different kernel than the previous boot.  After a successful boot that gets networking (or failed but manually fixed), the following boots pretty much always succeed, until something changes again.  Some possibilities would be the device state, although at least in my case, powering down completely doesn't seem to have any effect, or perhaps something in user space that would cache the dependency somewhere, helping subsequent boots.

Comment 117 Marc Dionne 2019-01-07 19:39:19 UTC
(In reply to Heiner Kallweit from comment #115)
> (In reply to Heiner Kallweit from comment #114)
> > (In reply to Marc Dionne from comment #113)
> > [...]
> > > 
> > > Another thing I should note is that in my case, the "realtek" module does
> > > NOT show up in the lsmod output when the network fails to start (some of the
> > > mails on the netdev thread suggest that's different for some users).
> > 
> > Thanks for the feedback! That's interesting. In such a case (network fails
> > to start and "realtek" module isn't loaded) what's the value of
> > /sys/class/net/<if>/phydev/phy_id ?
> > If it's the correct value, then request_module() in phy_device_create()
> > seems to fail. And unfortunately we don't check the return value yet.
> > If you can reproduce the problem, could you add a debug output printing the
> > return value of request_module() ? Or do you need a patch for it?
> 
> To rule out that the module is loaded but the init function fails: Do you
> have any "Error xx in registering driver" message in the syslog?

No there is no trace of such a message during a boot where the network failed to start.

Comment 118 Heiner Kallweit 2019-01-07 19:42:03 UTC
(In reply to Marc Dionne from comment #116)
> (In reply to Heiner Kallweit from comment #114)
> > (In reply to Marc Dionne from comment #113)
> > [...]
> > > 
> > > Another thing I should note is that in my case, the "realtek" module does
> > > NOT show up in the lsmod output when the network fails to start (some of the
> > > mails on the netdev thread suggest that's different for some users).
> > 
> > Thanks for the feedback! That's interesting. In such a case (network fails
> > to start and "realtek" module isn't loaded) what's the value of
> > /sys/class/net/<if>/phydev/phy_id ?
> > If it's the correct value, then request_module() in phy_device_create()
> > seems to fail. And unfortunately we don't check the return value yet.
> > If you can reproduce the problem, could you add a debug output printing the
> > return value of request_module() ? Or do you need a patch for it?
> 
> I can add some debugging, but unfortunately, after fixing the initial boot
> with a rmmod/modprobe of r8169, the machine now consistently boots with
> networking.  (btw the ID for my device is 0x1cc912)
> 
Background of the question was to rule out that for whatever reason a wrong PHY ID is read, thus preventing module realtek from being loaded.
Do you see this ID also in case of network failure?

Comment 119 oli 2019-01-07 19:46:21 UTC
From Friday evening (11.1) until sunday i can test new kernels - i am then back on the affected machine. 
If anyone wants me to test something, please give me the link to a .rpm i should test, then i will test it and report.

Comment 120 Marc Dionne 2019-01-07 19:54:40 UTC
(In reply to Heiner Kallweit from comment #118)
> (In reply to Marc Dionne from comment #116)
> > (In reply to Heiner Kallweit from comment #114)
> > > (In reply to Marc Dionne from comment #113)
> > > [...]
> > > > 
> > > > Another thing I should note is that in my case, the "realtek" module does
> > > > NOT show up in the lsmod output when the network fails to start (some of the
> > > > mails on the netdev thread suggest that's different for some users).
> > > 
> > > Thanks for the feedback! That's interesting. In such a case (network fails
> > > to start and "realtek" module isn't loaded) what's the value of
> > > /sys/class/net/<if>/phydev/phy_id ?
> > > If it's the correct value, then request_module() in phy_device_create()
> > > seems to fail. And unfortunately we don't check the return value yet.
> > > If you can reproduce the problem, could you add a debug output printing the
> > > return value of request_module() ? Or do you need a patch for it?
> > 
> > I can add some debugging, but unfortunately, after fixing the initial boot
> > with a rmmod/modprobe of r8169, the machine now consistently boots with
> > networking.  (btw the ID for my device is 0x1cc912)
> > 
> Background of the question was to rule out that for whatever reason a wrong
> PHY ID is read, thus preventing module realtek from being loaded.
> Do you see this ID also in case of network failure?

I won't be able to answer that, or see whether request_module has failed, until another failure occurs.  That's the "unfortunate" part.. :)

Comment 121 Heiner Kallweit 2019-01-08 21:58:35 UTC
(In reply to Heiner Kallweit from comment #111)
> The following new proposal for a workaround is basically the same as
> described in comment 49, just as part of the driver code.
> It ensures the the PHY driver module is loaded before r8169. I'd appreciate
> test results.
> 
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c
> b/drivers/net/ethernet/realtek/r8169.c
> index 7fe918026..c6b98024b 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -708,6 +708,7 @@ module_param(use_dac, int, 0);
>  MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
>  module_param_named(debug, debug.msg_enable, int, 0);
>  MODULE_PARM_DESC(debug, "Debug verbosity level (0=none, ..., 16=all)");
> +MODULE_SOFTDEP("pre: realtek");
>  MODULE_LICENSE("GPL");
>  MODULE_FIRMWARE(FIRMWARE_8168D_1);
>  MODULE_FIRMWARE(FIRMWARE_8168D_2);
> -- 
> 2.20.1

This change has been committed to the "net" tree and is queued for stable, means it should be included in 5.0-rc2 and later in the stable kernel versions.

Comment 122 Daniele De Lorenzi 2019-01-10 07:47:59 UTC
I still have issue with the latest kernel 4.19.13-300.fc29 In my case the issue is slightly different from the described one, after the boot at random time (~10/15 min) I loose internet connectivity and I cannot even ping my default gateway (both LEDs on the NIC stop working).
I resolve with rmmod/modprobe of r8169.
I don't know if this can be a useful info (maybe this trigger the issue) but in my case the networking during the boot is slowly to start due to the STP of the switch that take the NIC port in learning state.

Comment 123 Marc Dionne 2019-01-15 14:44:05 UTC
(In reply to Heiner Kallweit from comment #118)
> (In reply to Marc Dionne from comment #116)
> > (In reply to Heiner Kallweit from comment #114)
> > > (In reply to Marc Dionne from comment #113)
> > > [...]
> > > > 
> > > > Another thing I should note is that in my case, the "realtek" module does
> > > > NOT show up in the lsmod output when the network fails to start (some of the
> > > > mails on the netdev thread suggest that's different for some users).
> > > 
> > > Thanks for the feedback! That's interesting. In such a case (network fails
> > > to start and "realtek" module isn't loaded) what's the value of
> > > /sys/class/net/<if>/phydev/phy_id ?
> > > If it's the correct value, then request_module() in phy_device_create()
> > > seems to fail. And unfortunately we don't check the return value yet.
> > > If you can reproduce the problem, could you add a debug output printing the
> > > return value of request_module() ? Or do you need a patch for it?
> > 
> > I can add some debugging, but unfortunately, after fixing the initial boot
> > with a rmmod/modprobe of r8169, the machine now consistently boots with
> > networking.  (btw the ID for my device is 0x1cc912)
> > 
> Background of the question was to rule out that for whatever reason a wrong
> PHY ID is read, thus preventing module realtek from being loaded.
> Do you see this ID also in case of network failure?

Just had another failure, after rebooting into a newly built kernel.  The correct ID is shown in /sys/class/net/<if>/phydev/phy_id, even when the device is associated with the generic PHY driver.

Marc

Comment 124 Heiner Kallweit 2019-01-15 21:22:59 UTC
Thanks for the update! Then it looks like request_module() sometimes fails, unfortunately we don't check the return code yet. I'll add this to the code.


Note You need to log in before you can comment on or make changes to this bug.