Bug 710650

Summary: Heavy network utilization crashes Fedora 15 host, driver r8169
Product: [Fedora] Fedora Reporter: szt <tszalay>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 15CC: andy, anjo9292, antonio, corey.yeatman, didier.belhomme, gansalmon, itamar, joerg, jonathan, kernel-maint, madhu.chinakonda, marco.hartgring, p.zandbergen, thomas
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-06 13:26:34 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description szt 2011-06-03 22:03:48 UTC
Description of problem:

Copy or read large size file(s) (200-300 MB or more each) from external source attached via eth0 to the local harddisk crashes Fedora 15 hosts. The ethernet hardware is Realtek RTL8111/8168B PCI Express Gigabit Ethernet controller.

Version-Release number of selected component (if applicable):

kernel-2.6.38.5-24.fc15.x86_64
kernel-2.6.38.6-26.rc1.fc15.x86_64
kernel-2.6.38.6-27.fc15.x86_64

How reproducible:

Copy a large file or files from NFS mount to local drive or copy a large file or files to the host with scp command using the Realtek RTL8111/8168B controller. 

Steps to Reproduce:

1. cp -avi /nfs_mount_drive/directory .
2. sha1sum /nfs_mount_drive/directory/*
3. scp -pr ./directory_containing_large_files fedora15:.

any of the commands above could crash Fedor 15 host.

Actual results:

Fedora 15 host crashes with kernel oops, or simply reboots.

Expected results:

Copying or reading large amount of files works well.

Additional info:

The computer has 4GB RAM.

Host initially installed with Fedora 14, preupgraded to Fedora 15 Beta, and yum updated to Fedora 15. The problem exists since Fedora 15 Beta.

Before the system crashes similar a series of similar entries appears in /var/log/messages file:

May 30 14:35:44 rimfall kernel: [ 6744.721874] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6744.742857] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6744.765818] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.011625] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.056563] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.103531] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.361230] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.421170] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.474120] r8169 0000:09:00.0: eth0: link up
May 30 14:35:45 rimfall kernel: [ 6745.541033] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6749.750417] net_ratelimit: 44 callbacks suppressed
May 30 14:35:50 rimfall kernel: [ 6749.750426] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6749.778443] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6750.064112] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6750.324837] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6750.337834] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6750.390761] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6750.466663] r8169 0000:09:00.0: eth0: link up
May 30 14:35:50 rimfall kernel: [ 6750.523598] r8169 0000:09:00.0: eth0: link up
May 30 14:35:51 rimfall kernel: [ 6750.760353] r8169 0000:09:00.0: eth0: link up
May 30 14:35:51 rimfall kernel: [ 6750.837280] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6754.778988] net_ratelimit: 37 callbacks suppressed
May 30 14:35:55 rimfall kernel: [ 6754.778997] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6754.801983] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6754.878892] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6755.128600] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6755.159580] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6755.204521] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6755.281398] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6755.529188] r8169 0000:09:00.0: eth0: link up
May 30 14:35:55 rimfall kernel: [ 6755.566121] r8169 0000:09:00.0: eth0: link up
May 30 14:35:56 rimfall kernel: [ 6755.818861] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6759.827491] net_ratelimit: 38 callbacks suppressed
May 30 14:36:00 rimfall kernel: [ 6759.827499] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6760.075220] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6760.327952] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6760.481792] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6760.526746] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6760.574697] r8169 0000:09:00.0: eth0: link up
May 30 14:36:00 rimfall kernel: [ 6760.622615] r8169 0000:09:00.0: eth0: link up
May 30 14:36:01 rimfall kernel: [ 6760.871399] r8169 0000:09:00.0: eth0: link up
May 30 14:36:01 rimfall kernel: [ 6760.920301] r8169 0000:09:00.0: eth0: link up
May 30 14:36:01 rimfall kernel: [ 6761.180036] r8169 0000:09:00.0: eth0: link up
May 30 14:36:05 rimfall kernel: [ 6765.215638] net_ratelimit: 32 callbacks suppressed
May 30 14:36:05 rimfall kernel: [ 6765.215646] r8169 0000:09:00.0: eth0: link up
May 30 14:36:46 rimfall kernel: imklog 5.7.9, log source = /proc/kmsg started.
May 30 14:36:46 rimfall rsyslogd: [origin software="rsyslogd" swVersion="5.7.9" x-pid="910" x-info="http://www.rsyslog.com"] start
May 30 14:36:46 rimfall mcelog.setup[900]: read: No such device
May 30 14:36:46 rimfall kernel: [    0.000000] Initializing cgroup subsys cpuset
May 30 14:36:46 rimfall kernel: [    0.000000] Initializing cgroup subsys cpu
May 30 14:36:46 rimfall kernel: [    0.000000] Linux version 2.6.38.6-27.fc15.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.6.0 20110428 (Red Hat 4.6.0-6) (GCC) ) #1 SMP Sun May 15 17:23:28 UTC 2011

Comment 1 szt 2011-06-09 20:29:48 UTC
same with kernel 2.6.38.7-30.fc15.x86_64.

Comment 2 Mace Moneta 2011-06-20 02:05:16 UTC
I was having this problem; about a dozen re-occurrences.  I had jumbo frames set on the interface; changed the MTU back to 1500 and haven't had another event.

Comment 3 szt 2011-06-20 09:08:30 UTC
(In reply to comment #2)
> I was having this problem; about a dozen re-occurrences.  I had jumbo frames
> set on the interface; changed the MTU back to 1500 and haven't had another
> event.

i never used mtu bigger than 1500 bytes.

with kernel 2.6.38.8-32.fc15.x86_64: when i attempted to read one large file from nfs mounted drive the computer suddenly rebooted with no explanation in /var/log/messages.

Comment 4 Pim Zandbergen 2011-06-27 14:12:54 UTC
I had the exact same problem. It completely went away after switching to the r8168 driver, version 8.024.00, from the Realtek site.

Comment 5 Didier Belhomme 2011-06-29 19:35:55 UTC
I had exactly the same problem than Pim, with the same cure : installing the r8168 driver replacing the r8169 that came with the kernel, solved the problem. I use kernel 2.6.35.13-92.fc14.x86_64 from Fedora 14.

In my case, the computer rebooted when opening a vncviewer session.

Do someone knows how to report the problem to the kernel team (in RedHat or directly to kernel.org) ?

Regards all,

Didier (happy again, being able to connect at 1000 !)

Comment 6 szt 2011-06-29 22:27:13 UTC
(In reply to comment #4)
> I had the exact same problem. It completely went away after switching to the
> r8168 driver, version 8.024.00, from the Realtek site.

Thanks, the genuine Realtek driver works like a charm.

Comment 7 szt 2011-06-29 22:59:25 UTC
(In reply to comment #5)

Didier,

> I had exactly the same problem than Pim, with the same cure : installing the
> r8168 driver replacing the r8169 that came with the kernel, solved the problem.
> I use kernel 2.6.35.13-92.fc14.x86_64 from Fedora 14.

I use or manage 3 other F14 boxes with same network card, all with 2.6.35.13-92.fc14.x86_64 kernel. All works fine. (Two desktops with different Gigabyte mainboards, and a Lenovo Thinkpad Edge, all with integrated Realtek RTL8111/8168B cards.)

> Do someone knows how to report the problem to the kernel team (in RedHat or
> directly to kernel.org) ?

I'm quite sure they received the opening ticket (and probably all the comments) via email. Check "Email sent to:" at the top of this webpage.

Tamás

Comment 8 Mace Moneta 2011-06-29 23:57:34 UTC
The model reported depends on the command.  While lspci reports RTL8111/8168B:

$ lspci | grep -i eth
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

The dmesg output on my machine reports RTL8168c/8111c:

$ grep -i 8169 /var/log/dmesg
[   11.644904] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[   11.644925] r8169 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[   11.644981] r8169 0000:03:00.0: setting latency timer to 64
[   11.645061] r8169 0000:03:00.0: irq 44 for MSI/MSI-X
[   11.645233] r8169 0000:03:00.0: eth0: RTL8168c/8111c at 0xffffc90006ade000, 00:30:48:b0:96:f0, XID 1c4000c0 IRQ 44

Comment 9 szt 2011-06-30 00:26:42 UTC
(In reply to comment #8)
> The model reported depends on the command.

Additional information to the original bug report: r8169 kernel modul reports RTL8168d/8111d Ethernet card version.

Comment 10 Didier Belhomme 2011-06-30 08:03:33 UTC
More information about the setup :
- I've only seen this problem when connecting the system to a gigabit switch (never when connected at 100 mbps)
- The computer is a laptop from asmobile (ASUS OEM) model Z97V
- The reported card is (from lspci | grep -i eth) : 06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
- The output from grep -i 8168 /var/log/dmesg :

[    0.148168] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[    3.032044] r8168 Gigabit Ethernet driver 8.024.00-NAPI loaded
[    3.032073] r8168 0000:06:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[    3.032096] r8168 0000:06:00.0: setting latency timer to 64
[    3.032229] r8168 0000:06:00.0: irq 45 for MSI/MSI-X
[    3.033127] eth%d: RTL8168B/8111B at 0xffffc90011296000, 00:22:15:ff:8e:3a, IRQ 45
[    3.058457] r8168: This product is covered by one or more of the following patents: US5,307,459, US5,434,872, US5,732,094, US6,570,884, US6,115,776, and US6,327,625.
[    3.058460] eth0: Identified chip type is 'RTL8168C/8111C'.
[    3.058462] r8168  Copyright (C) 2011  Realtek NIC software team <nicfae> 

With this driver, the system is running well at 1000mbps.

Regards,

Didier.

Comment 11 Pim Zandbergen 2011-06-30 09:07:51 UTC
Allow me to correct and extend statements from my comment #4

My symptoms were different; heavy network traffic would just hang my box, without any syslog information, not even while running in the text console mode.

I have watchdog running, using iTCO_wdt, but it would not reset the system.

Problems started with kernel 2.6.38 and up on Fedora 14, both with vanilla kernels and Fedora rawhide kernels. Things were fine up to 2.6.37.

Sometimes running ttcp between this and another host would reproduce the problem in a minute. But not always. In real life, the box would hang if a PVR, using my Fedora box as an NFS server, would start an HD recording.

As I found out the Realtek driver would solve the problem for these experimental kernels in Fedora 14, I expected it would be necessary for the standard Fedora 15 kernel too, and it was necessary indeed.

My system is based on an MSI MS-7522 (X58 Platinum SLI) with dual onboard RTL8111/8168 NICs.

The r8168 driver identifies the NICs as RTL8168B/8111B
The r8169 driver identifies the NICs as RTL8168c/8111c
lspci identifies the NICs as RTL8111/8168B rev 02 (10ec:8168 sub 1462:7522)

Comment 12 Corey 2011-07-25 02:44:04 UTC
Same issue here using 2.6.38.8-35.fc15.x86_64 kernel.  As mentioned by others above, switching from the r8169 driver back to r8168 from Realtek website has solved this issue (for now).

Issue only seems to be triggered when copying large amounts of data (in my case, large .mts video files).  Whole machine locked up, no kernel panics or messages seen before it happened.

Comment 13 Pim Zandbergen 2011-08-16 15:20:53 UTC
I see a number of similar reports on bugzilla.kernel.org:

https://bugzilla.kernel.org/show_bug.cgi?id=29282
https://bugzilla.kernel.org/show_bug.cgi?id=32962
https://bugzilla.kernel.org/show_bug.cgi?id=34172

The last one points to a possible fix by Hayes Wang.

Comment 14 Andrew Haveland-Robinson 2011-09-22 14:57:33 UTC
Happened to me too with 2.6.40.4-5.fc15.x86_64 on Gigabyte EX58-UD5.

Upgraded from fc14 to fc15, and wanted to resize raid partitions and had big problems with existing lvm PVs being unable to shrink, so tore everything down, repartitioned, remade PVs VGs and LVs, updated uuids in grub,fstab, edited initramfs to update with new mdadm.conf and then no matter what I did, the new LV root would not remount in /sysroot which it certainly should have done!

So back to beginning - reinstalled from live usb to get a bootable system then used fc15 liveusb to restore system files and data (while keeping new fstab, mdadm.conf and lvm backup).
After temporary system made, I used fc15 live usb to restore 100Gb of files over NFS. No problems.

Booted into this newly restored fc15 (kernel 2.6.40.4-5) so it could do its job while restoring large archives from the backup server.

Then a few seconds later, screen fills with kernel messages:
Sep 22 14:11:27 odin kernel: [  978.857918] r8169 0000:08:00.0: eth0: link up
Sep 22 14:11:27 odin kernel: [  978.864923] r8169 0000:08:00.0: eth0: link up
Sep 22 14:11:27 odin kernel: [  978.868872] r8169 0000:08:00.0: eth0: link up
Sep 22 14:11:27 odin kernel: [  978.877846] r8169 0000:08:00.0: eth0: link up
Sep 22 14:11:27 odin kernel: [  978.890801] r8169 0000:08:00.0: eth0: link up
Sep 22 14:11:27 odin kernel: [  978.902770] r8169 0000:08:00.0: eth0: link up
...

System locks up, drive light still on. I wait to see if it'll come back.
Then get "watchdog detected hard lockup" errors... Have to do hard reset.
After this happened a few times, I found this thread and this link:
http://code.google.com/p/r8168/updates/list

Downloaded r8168-8.025.tar.bz2 and followed instructions and issues and managed to compile and install r8168.ko even though it was a ubuntu flavour.

wget http://r8168.googlecode.com/files/r8168-8.025.00.tar.bz2
tar -jxf r8168-8.025.00.tar.bz2
cd r8168-8.025.00
./autorun.sh

I continued to restore files, and so far the machine hasn't locked up even though the raids are resyncing in the background.

I suggest escalating this bug to severe status. For me it was a showstopper, and I hope the info here will help others.

Comment 15 Joerg 2011-09-28 13:00:00 UTC
I have the same issue as #11.

No logs, no messages, system just freezes.

But : I have a second NIC (Intel e1000) in my system which is used for network traffic. The Realtek NIC is not active, and no wires are connected. The driver is just loaded. 

The situation could be solved by putting r8169 on the blacklist. So something must be in the r8169 driver which harms the system although it is not used.

Comment 16 Pim Zandbergen 2011-12-06 18:51:36 UTC
I have upgraded to Fedora 16, and again am using a standard Fedora kernel with standard Fedora Realtek drivers.

The system freezes seem to have gone, now just the NFS service freezes under heavy load. "service nfs-service restart" does not seem to actually restart anything. Every other network traffic continues.


I'm not sure whether these NFS freezes are related to the Realtek drivers.