On the problem machine, HOST, I run
and on another machine, run
iperf -c HOST
Then the kernel spews several messages per second like
r8169 0000:02:00.0: eth1: link up
sometimes interspersed with
NOHZ: local_osftirq_pending 08
If I do this enough times, eventually HOST will spontaneously reboot.
This is on 2.6.35-48.fc14.x86_64.
lspci reports the interface as "02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)"
I put in another card (a D-link PCI gigabit card) and it worked fine.
Another machine with similar hardware (including the same Realtek controller) also works fine. The other machine is running F13, whereas the problematic machine is running F14.
I also booted my other machine to a F14 live disk and found it crashed (whereas it doesn't with F13).
So this is likely a regression.
Can you log into virtual console (or use serial console), reproduce the problem and provide call trace /dmesg ?
Also try current 188.8.131.52 kernel from koji, it include some r8169 patches, which can help: http://koji.fedoraproject.org/koji/buildinfo?buildID=210654
Does pcie_aspm=off boot option helps?
i've got the same problem - hard crashes (not reboots) when doing large transfers over nfs/rsync at gigabit speeds, worked fine on f13
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)
i've seen the same problem before with intel e1000 cards, but never realtek.
pcie_aspm=off doesn't help
r8168-8.021.00 (r8168.ko) driver from realtek seems to fix the issue, but it essentially slows the link down to 100mbps speeds, so it's not a real fix (if you slow down the regular r8169 driver to 100mbps using ethtool you get the same result).
Did you tried if this is fixed in upstream/rawhide kernel? Can we get logs from crash somehow (virual console photo or kdump) ?
i get nothing in the logs or on a vty i'm afraid.
how do i install an f15 (rawhide) kernel?
i noticed there's a kernel-184.108.40.206-85.fc14 in koji but the changelog doesn't mention any r8169 fixes.
this is the onboard nic on a "Gigabyte GA-P55M-UD2 iP55" motherboard btw.
i don't seem to be seeing the problem with the 2.6.38-0.rc7.git2.3.fc16.x86_64 kernel today, i'll report back if i do.
i've used this motherboard with f10, f12, f13 and now kind of f16, and the only problem seems to be f14.
the rawhide kernel didn't fix the issue, i had a crash overnight with quite low traffic (vpn) and today with high traffic (rsync).
Since we do not have any logs, it's hard to tell if this is the same issue or not. Let's assume it is, so we have not resolved f13 -> f14 regression. What should be 2.6.34 -> 2.6.35 kernel regression. Can you confirm that 2.6.34 works? And also test some older 2.6.35 to see if bug was not added during backporting fixes to that version?
In example you can install these kernels (by "rpm -ivh --force --nodeps")
There is not much r8196 commits between 2.6.34 and 2.6.35 and all of them looks ok for me. I wander if problem really with driver, maybe it is in net stack or other part in the kernel, it's not possible to tell without logs. I think bisection will be needed to find fix.
(In reply to comment #11)
> In example you can install these kernels (by "rpm -ivh --force --nodeps")
I missed, these kernels are deleted, I thing the best would be install from git tree, could you do this? Some description is here:
Kernels would be:
To check older 2.6.35 you will need to switch to proper tag i.e: git checkout -b b220.127.116.11 v18.104.22.168.
i've just had a crash with almost no network traffic, so starting to think its nvidia/psu/ram or something not network controller (although a bit of a coincidence i've only had it since upgrading to f14!)
i've ordered a via velocity pcie nic to see if we can eliminate that.
i can't really experiment too much with this machine as i need it for work so can't install git/koji kernels (rawhide are at least semi-supported!)
firstname.lastname@example.org 2011-03-08 03:26:14 EST
> this is the onboard nic on a "Gigabyte GA-P55M-UD2 iP55" motherboard btw.
This could be similar to:
(Gigabyte P55-USB3, r8169 XID = 0c100000, same XID as your ?).
It should not hurt to include f60ac8e7ab7cbb413a0131d5665b053f9f386526 ("prevent
RxFIFO induced loops in the irq handler.") as it will avoid some - supposedly
soft - infinite loops.
it could be the same issue as when i've seen the crashes i'm getting 10MB/sec kind of transfer speeds instead of the usual 60MB/sec
i just tried a fedora 15 alpha livecd but the nic wouldn't even come up! it reported a link and 10mbps speed, but there was no link light and no traffic would flow, "ethtool -s" did nothing either.
i'm going to try to go back to the kernel from the install dvd - 22.214.171.124-45.fc14 and see how that goes.
Thanks Francois. I applied it and two other patches that was needed to compile it on 2.6.35. Kernel build is here:
email@example.com please test.
i've reverted back to fedora 13 now i'm afraid.
whilst using clonezilla to backup my f14 install and restore my f13 install i transferred upwards of 150gb over nfs4 at gigabit speeds without crashing.
also passed memtest86+
You can still install test kernel with "rpm -ivh --force --nodep"
but surely as f13 appears to be rock solid (still working overnight) that hints that its not the kernel (and the rawhide kernel didn't fix it either) ?
I'm confused, you tested f13 kernel with f14 user space and it hangs there?
> (and the rawhide kernel didn't fix it either)?
Indeed, but perhaps rawhide hung was different issue, so I would like to test patch Francois pointed with older kernel.
J. Bruce, can you test kernel from comment 17 (and also pcie_aspm=off)?
i think my crashes may be due to a failing western digital hard disk as i've just had a crash on fedora 13 with a lot of disk activity and minimal network traffic.
its not this disk as i've replaced it and still getting hangs on fedora 13 with 126.96.36.199-68.fc13.x86_64
the only pattern seems to be high disk activity (and some network).
could it be the disk controller - although why it has worked perfectly for 2 years or so and now isn't - i don't understand:
03:00.0 SATA controller: JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller (rev 02) (prog-if 01 [AHCI 1.0])
i'm thinking its hardware, although memtest86+ passed and temperatures are fine. it could be psu or sata cables i guess. haven't received the new nic yet. i'll try re-seating the ram.
i've setup syslog to log all kern.* to /var/log/kernel so hopefully i'll get some logging.
i don't really want to just through the system away, but i can hardly even use it for parts as i'm not sure which (if any) are defective.
Apologies for the long silence. I re-tested today with 188.8.131.52-92, and still see messages in the logs like:
Jun 19 20:04:39 phile kernel: [ 6610.613371] r8169 0000:02:00.0: eth1: link up
Jun 19 20:04:39 phile kernel: [ 6610.613383] NOHZ: local_softirq_pending 08
Jun 19 20:04:39 phile kernel: [ 6610.618318] r8169 0000:02:00.0: eth1: link up
Jun 19 20:31:15 phile kernel: [ 8204.552008] r8169 0000:02:00.0: eth1: link up
Jun 19 20:31:15 phile kernel: [ 8204.599186] r8169 0000:02:00.0: eth1: link up
Jun 19 20:31:15 phile kernel: [ 8204.599198] NOHZ: local_softirq_pending 08
Jun 19 20:31:55 phile kernel: [ 8244.649373] r8169 0000:02:00.0: eth1: link up
but am unable to reproduce the spontaneous reboots after a little over an hour of testing.
The same results were seen after rebooting with pcie_aspm=off. (Though I only tested for a few minutes with pcie=aspm=off.)
Update: I have since been able to produce a spontaneous reboots on 184.108.40.206-92, both with pcie_aspm on and off.
i've replaced my psu, motherboard, ram and graphics card and am still getting lockups and the NOHZ/link up messages and system freezes (not reboots).
fedora 15 kernel 220.127.116.11-32.fc15.x86_64
the new motherboard, like all consumer motherboards these days also has a realtek 8111e nic:
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)
so it would seem to be at the driver level, not hardware.
going to try a pcie intel e1000 nic next before switching to debian6.
Would be great if we get some calltrace of the hung. Perhaps installing kernel-debug will give some calltrace.
the-jedi.co.uk, you can confirm that realtek driver is responsible for hangs by blacklisting r8169 module.
sorry, i've switched to debian wheezy now, i'll report back if that stops the problem.
i noticed on debian there's a specific 8111e firmware package, does fedora need such a thing perhaps?
Switching to debian doesnt help to fix the problem . Please help giving the required information
i'm not going to reinstall fedora unless debian crashes too.
if debian doesn't crash then i'd say its something added to the fedora-specific kernel/build options after fedora 13 (and still in 14/15).
sorry but this is my work machine and i need it stable, i can't experiment anymore.
The only fix that worked for me on Fedora 14 was to limit the link speed to 100 Mbps.
Upstream bz, for what it's worth: https://bugzilla.kernel.org/show_bug.cgi?id=32962
This message is a notice that Fedora 14 is now at end of life. Fedora
has stopped maintaining and issuing updates for Fedora 14. It is
Fedora's policy to close all bug reports from releases that are no
longer maintained. At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.
(Please note: Our normal process is to give advanced warning of this
occurring, but we forgot to do that. A thousand apologies.)
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen
this bug and simply change the 'version' to a later Fedora version.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we were unable to fix it before Fedora 14 reached end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" (top right of this page) and open it against that
version of Fedora.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here: