Bug 654147
Summary: | crash on lots of traffic to rtl8111/8168b interface | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | J. Bruce Fields <bfields> |
Component: | kernel | Assignee: | Ivan Vecera <ivecera> |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 14 | CC: | bugzilla, dougsland, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, romieu, sgruszka, tom |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-08-16 18:19:27 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
J. Bruce Fields
2010-11-17 01:20:00 UTC
I also booted my other machine to a F14 live disk and found it crashed (whereas it doesn't with F13). So this is likely a regression. Can you log into virtual console (or use serial console), reproduce the problem and provide call trace /dmesg ? Also try current 2.6.35.10 kernel from koji, it include some r8169 patches, which can help: http://koji.fedoraproject.org/koji/buildinfo?buildID=210654 Does pcie_aspm=off boot option helps? i've got the same problem - hard crashes (not reboots) when doing large transfers over nfs/rsync at gigabit speeds, worked fine on f13 #lspci 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) #uname 2.6.35.11-83.fc14.x86_64 i've seen the same problem before with intel e1000 cards, but never realtek. pcie_aspm=off doesn't help r8168-8.021.00 (r8168.ko) driver from realtek seems to fix the issue, but it essentially slows the link down to 100mbps speeds, so it's not a real fix (if you slow down the regular r8169 driver to 100mbps using ethtool you get the same result). Did you tried if this is fixed in upstream/rawhide kernel? Can we get logs from crash somehow (virual console photo or kdump) ? i get nothing in the logs or on a vty i'm afraid. how do i install an f15 (rawhide) kernel? i noticed there's a kernel-2.6.35.11-85.fc14 in koji but the changelog doesn't mention any r8169 fixes. this is the onboard nic on a "Gigabyte GA-P55M-UD2 iP55" motherboard btw. i don't seem to be seeing the problem with the 2.6.38-0.rc7.git2.3.fc16.x86_64 kernel today, i'll report back if i do. i've used this motherboard with f10, f12, f13 and now kind of f16, and the only problem seems to be f14. the rawhide kernel didn't fix the issue, i had a crash overnight with quite low traffic (vpn) and today with high traffic (rsync). Since we do not have any logs, it's hard to tell if this is the same issue or not. Let's assume it is, so we have not resolved f13 -> f14 regression. What should be 2.6.34 -> 2.6.35 kernel regression. Can you confirm that 2.6.34 works? And also test some older 2.6.35 to see if bug was not added during backporting fixes to that version? In example you can install these kernels (by "rpm -ivh --force --nodeps") http://koji.fedoraproject.org/koji/buildinfo?buildID=182791 http://koji.fedoraproject.org/koji/buildinfo?buildID=188323 There is not much r8196 commits between 2.6.34 and 2.6.35 and all of them looks ok for me. I wander if problem really with driver, maybe it is in net stack or other part in the kernel, it's not possible to tell without logs. I think bisection will be needed to find fix. (In reply to comment #11) > In example you can install these kernels (by "rpm -ivh --force --nodeps") > http://koji.fedoraproject.org/koji/buildinfo?buildID=182791 > http://koji.fedoraproject.org/koji/buildinfo?buildID=188323 I missed, these kernels are deleted, I thing the best would be install from git tree, could you do this? Some description is here: https://bugzilla.redhat.com/show_bug.cgi?id=640612#c37 Kernels would be: git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.35.y.git git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.34.y.git To check older 2.6.35 you will need to switch to proper tag i.e: git checkout -b b2.6.35.1 v2.6.35.1. i've just had a crash with almost no network traffic, so starting to think its nvidia/psu/ram or something not network controller (although a bit of a coincidence i've only had it since upgrading to f14!) i've ordered a via velocity pcie nic to see if we can eliminate that. i can't really experiment too much with this machine as i need it for work so can't install git/koji kernels (rawhide are at least semi-supported!) bugzilla.uk 2011-03-08 03:26:14 EST [...] > this is the onboard nic on a "Gigabyte GA-P55M-UD2 iP55" motherboard btw. This could be similar to: http://marc.info/?l=linux-kernel&m=129829058425946&w=2 (Gigabyte P55-USB3, r8169 XID = 0c100000, same XID as your ?). It should not hurt to include f60ac8e7ab7cbb413a0131d5665b053f9f386526 ("prevent RxFIFO induced loops in the irq handler.") as it will avoid some - supposedly soft - infinite loops. -- Ueimor it could be the same issue as when i've seen the crashes i'm getting 10MB/sec kind of transfer speeds instead of the usual 60MB/sec i just tried a fedora 15 alpha livecd but the nic wouldn't even come up! it reported a link and 10mbps speed, but there was no link light and no traffic would flow, "ethtool -s" did nothing either. i'm going to try to go back to the kernel from the install dvd - 2.6.35.6-45.fc14 and see how that goes. Thanks Francois. I applied it and two other patches that was needed to compile it on 2.6.35. Kernel build is here: http://koji.fedoraproject.org/koji/taskinfo?taskID=2900722 bugzilla.uk please test. i've reverted back to fedora 13 now i'm afraid. whilst using clonezilla to backup my f14 install and restore my f13 install i transferred upwards of 150gb over nfs4 at gigabit speeds without crashing. also passed memtest86+ You can still install test kernel with "rpm -ivh --force --nodep" but surely as f13 appears to be rock solid (still working overnight) that hints that its not the kernel (and the rawhide kernel didn't fix it either) ? I'm confused, you tested f13 kernel with f14 user space and it hangs there?
> (and the rawhide kernel didn't fix it either)?
Indeed, but perhaps rawhide hung was different issue, so I would like to test patch Francois pointed with older kernel.
J. Bruce, can you test kernel from comment 17 (and also pcie_aspm=off)? i think my crashes may be due to a failing western digital hard disk as i've just had a crash on fedora 13 with a lot of disk activity and minimal network traffic. its not this disk as i've replaced it and still getting hangs on fedora 13 with 2.6.34.8-68.fc13.x86_64 the only pattern seems to be high disk activity (and some network). could it be the disk controller - although why it has worked perfectly for 2 years or so and now isn't - i don't understand: 03:00.0 SATA controller: JMicron Technology Corp. JMB362/JMB363 Serial ATA Controller (rev 02) (prog-if 01 [AHCI 1.0]) i'm thinking its hardware, although memtest86+ passed and temperatures are fine. it could be psu or sata cables i guess. haven't received the new nic yet. i'll try re-seating the ram. i've setup syslog to log all kern.* to /var/log/kernel so hopefully i'll get some logging. i don't really want to just through the system away, but i can hardly even use it for parts as i'm not sure which (if any) are defective. Apologies for the long silence. I re-tested today with 2.6.35.13-92, and still see messages in the logs like: Jun 19 20:04:39 phile kernel: [ 6610.613371] r8169 0000:02:00.0: eth1: link up Jun 19 20:04:39 phile kernel: [ 6610.613383] NOHZ: local_softirq_pending 08 Jun 19 20:04:39 phile kernel: [ 6610.618318] r8169 0000:02:00.0: eth1: link up Jun 19 20:31:15 phile kernel: [ 8204.552008] r8169 0000:02:00.0: eth1: link up Jun 19 20:31:15 phile kernel: [ 8204.599186] r8169 0000:02:00.0: eth1: link up Jun 19 20:31:15 phile kernel: [ 8204.599198] NOHZ: local_softirq_pending 08 Jun 19 20:31:55 phile kernel: [ 8244.649373] r8169 0000:02:00.0: eth1: link up but am unable to reproduce the spontaneous reboots after a little over an hour of testing. The same results were seen after rebooting with pcie_aspm=off. (Though I only tested for a few minutes with pcie=aspm=off.) Update: I have since been able to produce a spontaneous reboots on 2.6.35.13-92, both with pcie_aspm on and off. i've replaced my psu, motherboard, ram and graphics card and am still getting lockups and the NOHZ/link up messages and system freezes (not reboots). fedora 15 kernel 2.6.38.8-32.fc15.x86_64 the new motherboard, like all consumer motherboards these days also has a realtek 8111e nic: 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) so it would seem to be at the driver level, not hardware. going to try a pcie intel e1000 nic next before switching to debian6. Would be great if we get some calltrace of the hung. Perhaps installing kernel-debug will give some calltrace. the-jedi.co.uk, you can confirm that realtek driver is responsible for hangs by blacklisting r8169 module. sorry, i've switched to debian wheezy now, i'll report back if that stops the problem. i noticed on debian there's a specific 8111e firmware package, does fedora need such a thing perhaps? Switching to debian doesnt help to fix the problem . Please help giving the required information i'm not going to reinstall fedora unless debian crashes too. if debian doesn't crash then i'd say its something added to the fedora-specific kernel/build options after fedora 13 (and still in 14/15). sorry but this is my work machine and i need it stable, i can't experiment anymore. The only fix that worked for me on Fedora 14 was to limit the link speed to 100 Mbps. Upstream bz, for what it's worth: https://bugzilla.kernel.org/show_bug.cgi?id=32962 This message is a notice that Fedora 14 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 14. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '14' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 14 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping |