Bug 654147

Summary: crash on lots of traffic to rtl8111/8168b interface
Product: [Fedora] Fedora Reporter: J. Bruce Fields <bfields>
Component: kernelAssignee: Ivan Vecera <ivecera>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 14CC: bugzilla, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, romieu, sgruszka, tom
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-16 18:19:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description J. Bruce Fields 2010-11-17 01:20:00 UTC
On the problem machine, HOST, I run

iperf -s

and on another machine, run

iperf -c HOST

Then the kernel spews several messages per second like

r8169 0000:02:00.0: eth1: link up

sometimes interspersed with

NOHZ: local_osftirq_pending 08

If I do this enough times, eventually HOST will spontaneously reboot.

This is on 2.6.35-48.fc14.x86_64.

lspci reports the interface as "02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)"

I put in another card (a D-link PCI gigabit card) and it worked fine.

Another machine with similar hardware (including the same Realtek controller) also works fine.  The other machine is running F13, whereas the problematic machine is running F14.

Comment 1 J. Bruce Fields 2010-11-17 02:44:46 UTC
I also booted my other machine to a F14 live disk and found it crashed (whereas it doesn't with F13).

So this is likely a regression.

Comment 2 Stanislaw Gruszka 2010-12-20 19:28:16 UTC
Can you log into virtual console (or use serial console), reproduce the problem and provide call trace /dmesg ? 

Also try current 2.6.35.10 kernel from koji, it include some r8169 patches, which can help: http://koji.fedoraproject.org/koji/buildinfo?buildID=210654

Comment 3 Stanislaw Gruszka 2011-02-22 11:15:04 UTC
Does pcie_aspm=off boot option helps?

Comment 4 bugzilla 2011-03-07 20:34:18 UTC
i've got the same problem - hard crashes (not reboots) when doing large transfers over nfs/rsync at gigabit speeds, worked fine on f13

#lspci
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

#uname
2.6.35.11-83.fc14.x86_64

i've seen the same problem before with intel e1000 cards, but never realtek.

Comment 5 bugzilla 2011-03-07 22:14:22 UTC
pcie_aspm=off doesn't help

Comment 6 bugzilla 2011-03-07 22:58:00 UTC
r8168-8.021.00 (r8168.ko) driver from realtek seems to fix the issue, but it essentially slows the link down to 100mbps speeds, so it's not a real fix (if you slow down the regular r8169 driver to 100mbps using ethtool you get the same result).

Comment 7 Stanislaw Gruszka 2011-03-08 07:13:16 UTC
Did you tried if this is fixed in upstream/rawhide kernel? Can we get logs from crash somehow (virual console photo or kdump) ?

Comment 8 bugzilla 2011-03-08 08:26:14 UTC
i get nothing in the logs or on a vty i'm afraid.

how do i install an f15 (rawhide) kernel?

i noticed there's a kernel-2.6.35.11-85.fc14 in koji but the changelog doesn't mention any r8169 fixes.

this is the onboard nic on a "Gigabyte GA-P55M-UD2 iP55" motherboard btw.

Comment 9 bugzilla 2011-03-08 14:18:20 UTC
i don't seem to be seeing the problem with the 2.6.38-0.rc7.git2.3.fc16.x86_64 kernel today, i'll report back if i do.

i've used this motherboard with f10, f12, f13 and now kind of f16, and the only problem seems to be f14.

Comment 10 bugzilla 2011-03-09 13:32:38 UTC
the rawhide kernel didn't fix the issue, i had a crash overnight with quite low traffic (vpn) and today with high traffic (rsync).

Comment 11 Stanislaw Gruszka 2011-03-09 14:07:49 UTC
Since we do not have any logs, it's hard to tell if this is the same issue or not. Let's assume it is, so we have not resolved f13 -> f14 regression. What should be 2.6.34 -> 2.6.35 kernel regression. Can you confirm that 2.6.34 works? And also test some older 2.6.35 to see if bug was not added during backporting fixes to that version?

In example you can install these kernels (by "rpm -ivh --force --nodeps")
http://koji.fedoraproject.org/koji/buildinfo?buildID=182791
http://koji.fedoraproject.org/koji/buildinfo?buildID=188323

There is not much r8196 commits between 2.6.34 and 2.6.35 and all of them looks ok for me. I wander if problem really with driver, maybe it is in net stack or other part in the kernel, it's not possible to tell without logs. I think bisection will be needed to find fix.

Comment 12 Stanislaw Gruszka 2011-03-09 15:22:14 UTC
(In reply to comment #11)
> In example you can install these kernels (by "rpm -ivh --force --nodeps")
> http://koji.fedoraproject.org/koji/buildinfo?buildID=182791
> http://koji.fedoraproject.org/koji/buildinfo?buildID=188323

I missed, these kernels are deleted, I thing the best would be install from git tree, could you do this? Some description is here:
https://bugzilla.redhat.com/show_bug.cgi?id=640612#c37
Kernels would be:
git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.35.y.git
git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.34.y.git

Comment 13 Stanislaw Gruszka 2011-03-09 15:24:08 UTC
To check older 2.6.35 you will need to switch to proper tag i.e: git checkout -b b2.6.35.1 v2.6.35.1.

Comment 14 bugzilla 2011-03-09 15:34:53 UTC
i've just had a crash with almost no network traffic, so starting to think its nvidia/psu/ram or something not network controller (although a bit of a coincidence i've only had it since upgrading to f14!)

i've ordered a via velocity pcie nic to see if we can eliminate that.

i can't really experiment too much with this machine as i need it for work so can't install git/koji kernels (rawhide are at least semi-supported!)

Comment 15 Francois Romieu 2011-03-10 11:01:38 UTC
bugzilla.uk 2011-03-08 03:26:14 EST
[...]
> this is the onboard nic on a "Gigabyte GA-P55M-UD2 iP55" motherboard btw.

This could be similar to:
http://marc.info/?l=linux-kernel&m=129829058425946&w=2
(Gigabyte P55-USB3, r8169 XID = 0c100000, same XID as your ?).

It should not hurt to include f60ac8e7ab7cbb413a0131d5665b053f9f386526 ("prevent
RxFIFO induced loops in the irq handler.") as it will avoid some - supposedly
soft - infinite loops.

-- 
Ueimor

Comment 16 bugzilla 2011-03-10 11:21:10 UTC
it could be the same issue as when i've seen the crashes i'm getting 10MB/sec kind of transfer speeds instead of the usual 60MB/sec

i just tried a fedora 15 alpha livecd but the nic wouldn't even come up! it reported a link and 10mbps speed, but there was no link light and no traffic would flow, "ethtool -s" did nothing either.

i'm going to try to go back to the kernel from the install dvd - 2.6.35.6-45.fc14 and see how that goes.

Comment 17 Stanislaw Gruszka 2011-03-10 15:04:26 UTC
Thanks Francois. I applied it and two other patches that was needed to compile it on 2.6.35. Kernel build is here: 
http://koji.fedoraproject.org/koji/taskinfo?taskID=2900722

bugzilla.uk please test.

Comment 18 bugzilla 2011-03-11 01:06:35 UTC
i've reverted back to fedora 13 now i'm afraid.

whilst using clonezilla to backup my f14 install and restore my f13 install i transferred upwards of 150gb over nfs4 at gigabit speeds without crashing.

also passed memtest86+

Comment 19 Stanislaw Gruszka 2011-03-11 08:41:27 UTC
You can still install test kernel with "rpm -ivh --force --nodep"

Comment 20 bugzilla 2011-03-11 09:09:24 UTC
but surely as f13 appears to be rock solid (still working overnight) that hints that its not the kernel (and the rawhide kernel didn't fix it either) ?

Comment 21 Stanislaw Gruszka 2011-03-11 10:19:17 UTC
I'm confused, you tested f13 kernel with f14 user space and it hangs there?  
> (and the rawhide kernel didn't fix it either)?
Indeed, but perhaps rawhide hung was different issue, so I would like to test patch Francois pointed with older kernel.

Comment 22 Stanislaw Gruszka 2011-03-14 14:53:22 UTC
J. Bruce, can you test kernel from comment 17 (and also pcie_aspm=off)?

Comment 23 bugzilla 2011-03-14 16:30:30 UTC
i think my crashes may be due to a failing western digital hard disk as i've just had a crash on fedora 13 with a lot of disk activity and minimal network traffic.

Comment 24 bugzilla 2011-03-19 22:07:33 UTC
its not this disk as i've replaced it and still getting hangs on fedora 13 with 2.6.34.8-68.fc13.x86_64

the only pattern seems to be high disk activity (and some network).

could it be the disk controller - although why it has worked perfectly for 2 years or so and now isn't - i don't understand:

03:00.0 SATA controller: JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller (rev 02) (prog-if 01 [AHCI 1.0])

i'm thinking its hardware, although memtest86+ passed and temperatures are fine. it could be psu or sata cables i guess. haven't received the new nic yet. i'll try re-seating the ram.

i've setup syslog to log all kern.* to /var/log/kernel so hopefully i'll get some logging.

i don't really want to just through the system away, but i can hardly even use it for parts as i'm not sure which (if any) are defective.

Comment 25 J. Bruce Fields 2011-06-20 01:10:37 UTC
Apologies for the long silence.  I re-tested today with 2.6.35.13-92, and still see messages in the logs like:

Jun 19 20:04:39 phile kernel: [ 6610.613371] r8169 0000:02:00.0: eth1: link up
Jun 19 20:04:39 phile kernel: [ 6610.613383] NOHZ: local_softirq_pending 08
Jun 19 20:04:39 phile kernel: [ 6610.618318] r8169 0000:02:00.0: eth1: link up
Jun 19 20:31:15 phile kernel: [ 8204.552008] r8169 0000:02:00.0: eth1: link up
Jun 19 20:31:15 phile kernel: [ 8204.599186] r8169 0000:02:00.0: eth1: link up
Jun 19 20:31:15 phile kernel: [ 8204.599198] NOHZ: local_softirq_pending 08
Jun 19 20:31:55 phile kernel: [ 8244.649373] r8169 0000:02:00.0: eth1: link up

but am unable to reproduce the spontaneous reboots after a little over an hour of testing.

The same results were seen after rebooting with pcie_aspm=off.  (Though I only tested for a few minutes with pcie=aspm=off.)

Comment 26 J. Bruce Fields 2011-06-20 02:31:23 UTC
Update: I have since been able to produce a spontaneous reboots on 2.6.35.13-92, both with pcie_aspm on and off.

Comment 27 bugzilla 2011-06-20 07:31:43 UTC
i've replaced my psu, motherboard, ram and graphics card and am still getting lockups and the NOHZ/link up messages and system freezes (not reboots).

fedora 15 kernel 2.6.38.8-32.fc15.x86_64

the new motherboard, like all consumer motherboards these days also has a realtek 8111e nic:

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

so it would seem to be at the driver level, not hardware.

going to try a pcie intel e1000 nic next before switching to debian6.

Comment 28 Stanislaw Gruszka 2011-06-20 07:54:45 UTC
Would be great if we get some calltrace of the hung.  Perhaps installing kernel-debug will give some calltrace.

the-jedi.co.uk, you can confirm that realtek driver is responsible for hangs by blacklisting r8169 module.

Comment 30 bugzilla 2011-06-20 11:50:31 UTC
sorry, i've switched to debian wheezy now, i'll report back if that stops the problem.

i noticed on debian there's a specific 8111e firmware package, does fedora need such a thing perhaps?

Comment 31 Itamar Reis Peixoto 2011-06-20 12:00:57 UTC
Switching to debian doesnt help to fix the problem . Please help giving the required information

Comment 32 bugzilla 2011-06-20 12:10:31 UTC
i'm not going to reinstall fedora unless debian crashes too.

if debian doesn't crash then i'd say its something added to the fedora-specific kernel/build options after fedora 13 (and still in 14/15).

sorry but this is my work machine and i need it stable, i can't experiment anymore.

Comment 34 Tamas Vincze 2011-06-20 13:09:02 UTC
The only fix that worked for me on Fedora 14 was to limit the link speed to 100 Mbps.

Comment 35 J. Bruce Fields 2011-06-20 13:10:27 UTC
Upstream bz, for what it's worth: https://bugzilla.kernel.org/show_bug.cgi?id=32962

Comment 36 Fedora End Of Life 2012-08-16 18:19:30 UTC
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping