Bug 1660649 - r8169: hang or reboot when networking is enabled with kernel 4.19.9 on Intel Skylake system
Summary: r8169: hang or reboot when networking is enabled with kernel 4.19.9 on Intel ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 29
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-18 22:15 UTC by Steven Usdansky
Modified: 2020-10-26 22:08 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-17 20:05:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Steven Usdansky 2018-12-18 22:15:36 UTC
Description of problem:
System fails to boot with kernel 4.19.8-300.fc29.x86_64 with networking enabled unless nolapic is passed as a kernel parameter (in which case only one CPU is found)

Version-Release number of selected component (if applicable):
kernel 4.19.8-300.fc29.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Select kernel 4.19.8-300.fc29.x86_64 from grub menu
2. Sit back and wait for reboot or lockup

Actual results:
System either locks up or reboots reboots prior to showing login screen

Expected results:
System boots normally, lets me login, automatically connects to network

Additional info:
Bug 1658623 contains the jumble of previous attempt to track this down. My apologies for the mess.

Manjaro kernel 4.19.6-1 boots properly (without nolapic kernel parameter; shows 4 cpus; networking works as expected)

Bug is apparently also present in Arch 4.19 kernel: 
  https://bbs.archlinux.org/viewtopic.php?id=242645

Running F29 Mate spin 
Booted kernel-4.18.16 (all is well)
 Unchecked Enable Networking in panel applet
 Rebooted into kernel 4.19.9 (selinux=0 but did not add nolapic)
 Boot worked, was able to get to Mate Desktop.
 Ran nm-applet program to display networking applet in panel
 Checked  Enable Networking (using panel applet) 
Tried two times, with two different results
 a) System totally locked up; no keyboard response - had to use PC's power button to poweroff
 b) System rebooted itself

Comment 1 Heiner Kallweit 2019-01-08 10:35:23 UTC
Maybe relevant is that r8169 uses MSI-X whilst r8168 supports MSI only. That nolapic is needed indicates a broken BIOS, maybe some change in kernel irq handling triggered a BIOS bug.
Interesting would be whether kernel command line parameter pci=nomsi helps too.

Comment 2 Steven Usdansky 2019-01-08 19:53:43 UTC
Kernel command line parameter pci=nomsi does not help.
Tried kernel-5.0.0-0.rc1.git0.1.fc30 (as well as a few 4.20 Fedora kernels) with the same result.

Comment 3 Heiner Kallweit 2019-01-08 20:00:45 UTC
Thanks. Then basically the only option you have is bisecting to find out which commit causes a conflict with your (most likely) broken BIOS.

Comment 4 Steve 2019-01-08 23:02:45 UTC
Steven: Could you verify that you have the latest linux-firmware package:

$ rpm -q linux-firmware 
linux-firmware-20181219-89.git0f22c852.fc29.noarch

FWIW, the specific firmware files are here:
$ ls /lib/firmware/rtl_nic

Based on your logs and the source, rtl8168e-3.fw is the specific firmware file:

$ grep FIRMWARE_8168E_3 linux-4.19/drivers/net/ethernet/realtek/r8169.c
#define FIRMWARE_8168E_3	"rtl_nic/rtl8168e-3.fw"
	[RTL_GIGA_MAC_VER_34] = {"RTL8168evl/8111evl",	FIRMWARE_8168E_3},
MODULE_FIRMWARE(FIRMWARE_8168E_3);

(That's for background info only ...)

Comment 5 Steve 2019-01-09 13:31:22 UTC
Koji has all the prerelease 4.19 kernels here:
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

After looking at the Changelog and the "build.log" for several, I concluded that the ones that have "git0" in the version correspond to the kernel.org rcN releases.

Thus:

kernel-4.19.0-0.rc1.git0.1.fc30
https://koji.fedoraproject.org/koji/buildinfo?buildID=1139720

kernel-4.19.0-0.rc2.git0.1.fc30
https://koji.fedoraproject.org/koji/buildinfo?buildID=1141549

Etc.

If you want to try test installs, you only need the kernel-core and kernel-modules packages for x86_64.

Install with:

# dnf install kernel*.rpm

(Tested in an F29 VM.)

Caution: Prerelease kernels could have bugs, so backup anything important.

Comment 6 Steve 2019-01-09 13:40:33 UTC
Important detail:

To stop dnf from removing older kernels, edit /etc/dnf/dnf.conf:

#installonly_limit=3
installonly_limit=0 <<<<< This stops dnf from removing older kernels.

Comment 7 Steve 2019-01-09 13:46:37 UTC
(In reply to Steve from comment #5)
...
> If you want to try test installs, you only need the kernel-core and kernel-modules packages for x86_64.
...

For this problem, it might be better to use the debug versions:

kernel-debug-core...
kernel-debug-modules...

Comment 8 Steven Usdansky 2019-01-09 14:07:00 UTC
I have the latest firmware. Been using git0.1 test kernels for a while (needed debug off for my previous desktop PCs video)

Found this "r8169: don’t use MSI-X on RTL8106e [Linux 4.19]" at 
   https://www.systutorials.com/linux-kernels/771433/r8169-dont-use-msi-x-on-rtl8106e-linux-4-19/ 
Possibly relevant??

Comment 9 Heiner Kallweit 2019-01-09 14:31:06 UTC
(In reply to Steven Usdansky from comment #8)
> 
> Found this "r8169: don’t use MSI-X on RTL8106e [Linux 4.19]" at 
>   
> https://www.systutorials.com/linux-kernels/771433/r8169-dont-use-msi-x-on-
> rtl8106e-linux-4-19/ 
> Possibly relevant??

I don't think so. This patch has been reverted meanwhile because the actual issue was some low-level problem with an Intel PCI bridge:
083874549fdf ("PCI: Reprogram bridge prefetch registers on resume")

Comment 10 Steve 2019-01-09 14:57:15 UTC
(In reply to Steven Usdansky from comment #8)
> I have the latest firmware. Been using git0.1 test kernels for a while
> (needed debug off for my previous desktop PCs video)

Have you tried kernel-4.19.0-0.rc1.git0.1.fc30?
https://koji.fedoraproject.org/koji/buildinfo?buildID=1139720

That would narrow the problem down, but not by much. :-)

Here is the git log for v4.19-rc1:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=v4.19-rc1

> Found this "r8169: don’t use MSI-X on RTL8106e [Linux 4.19]" at 
>   
> https://www.systutorials.com/linux-kernels/771433/r8169-dont-use-msi-x-on-
> rtl8106e-linux-4-19/ 
> Possibly relevant??

That patch appears to apply to RTL_GIGA_MAC_VER_39, while you have RTL_GIGA_MAC_VER_34 (Comment 4):

r8169: don't use MSI-X on RTL8106e
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7bb05b85bc2d1a1b647b91424b2ed4a18e6ecd81

Comment 11 Steve 2019-01-09 15:16:26 UTC
Heiner: The Arch Linux user (Blitz67) reported display corruption* and Steven did too, in one case:

"Had one hang at the starting LightDM message wherein background was green and flickering." (Bug 1658623, Comment 10)

Is there a way that enabling networking could cause display corruption?

* https://bbs.archlinux.org/viewtopic.php?pid=1821449#p1821449

Comment 12 Heiner Kallweit 2019-01-09 15:22:36 UTC
(In reply to Steve from comment #11)
> Heiner: The Arch Linux user (Blitz67) reported display corruption* and
> Steven did too, in one case:
> 
> "Had one hang at the starting LightDM message wherein background was green
> and flickering." (Bug 1658623, Comment 10)
> 
> Is there a way that enabling networking could cause display corruption?
> 
> * https://bbs.archlinux.org/viewtopic.php?pid=1821449#p1821449

Not directly at least. But of course a driver could trigger some issue in an underlying subsystem like PCI core. I think you'll find the commit to blame only by bisecting.

Comment 13 Justin M. Forbes 2019-01-29 16:13:51 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 4.20.5-200.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 14 Steven Usdansky 2019-01-30 14:02:39 UTC
Bug persists with kernel 4.20.5-200.fc29.x86_64

Comment 15 Heiner Kallweit 2019-01-30 14:16:30 UTC
Worth a try could also be disabling ASPM, best set it to "perfoemance" or "off" via kernel parameter pcie_aspm.policy.

Comment 16 Steven Usdansky 2019-02-04 15:38:22 UTC
Setting kernel parameter pcie_aspm.policy did not help. 

I'm currently using Rawhide as my primary OS, and the bug persists in kernel-5.0.0-0.rc4.git3.1.fc30.x86_64. 

I'm thinking this bug and bug #1648366 should be closed, even though unresolved, and a new bug opened for the kernel to be used in F30.

Comment 17 Wiktor Wandachowicz 2019-03-08 08:22:39 UTC
I have found that pressing [Esc] during system start (to see boot messages instead of graphical logo) curiously allowed system to boot normally, as I described in https://bugzilla.redhat.com/show_bug.cgi?id=1648366#c11
My Fedora 29 installation is running inside VirtualBox 6.0.4 r128413 using kernel 4.20.13-200.fc29.x86_64 (previous 4.20.8-200.fc29.x86_64 don't have this issue), and this is 100% repeatable there.

Comment 18 Steven Usdansky 2019-03-09 01:17:10 UTC
I did not see the error when running inside VirtualBox with any 4.19 or 4.20 kernel, which makes sense because the error is due to the way my hardware interacts with the Realtek r8169 module, which is not the wired connection seen when running within the virtualized environment. My fix, which enables me to boot into Fedora, and use the Realtek r8169 module with all four threads of my i3-6100U, is documented in comment #2 of bug 1674268

Comment 19 Justin M. Forbes 2019-08-20 17:41:44 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 20 Justin M. Forbes 2019-09-17 20:05:31 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 21 Steven Usdansky 2020-10-26 22:08:35 UTC
Issue resolved for me with later kernels


Note You need to log in before you can comment on or make changes to this bug.