Bug 642861 - Kernel freezes under moderate network load, driver r8169
Summary: Kernel freezes under moderate network load, driver r8169
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 14
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Ivan Vecera
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-13 23:35 UTC by Mike Khusid
Modified: 2013-04-02 23:43 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-08-16 18:19:19 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Image from kernel panic. (266.38 KB, image/jpeg)
2011-02-19 11:47 UTC, Tomi Leppikangas
no flags Details
output from "lspci -vvv -s 04:00.0" with param "pcie_aspm=off" (2.80 KB, text/plain)
2011-02-19 11:52 UTC, Tomi Leppikangas
no flags Details

Description Mike Khusid 2010-10-13 23:35:23 UTC
Description of problem:
Kernel 2.6.35.6-39.fc14.i686 freezes under moderate network load (such as yum update).  The computer stops being pingable first, shortly thereafter console freezes.

Network device is RTL8102e PCI Express FAst Ethernet controller (rev 02), using driver r8169.

Version-Release number of selected component (if applicable):
kernel version 2.6.35.6-39.fc14.i686
r8169 driver version is 2.3LK-NAPI

How reproducible:
every time

Steps to Reproduce:
1. Start computer
2. run yum update
3. ping computer from another computer
  
Actual results:
kernel freeze

Expected results:
no freeze, successful yum update

Additional info:
Base hardware: Dell Mini 12 (Inspiron 1210).

Nothing of notice in /var/log/messages.

Comment 1 Mike Khusid 2010-10-13 23:36:01 UTC
Adding cc per http://fedoraproject.org/wiki/KernelBugTriage

Comment 2 Stanislaw Gruszka 2010-11-26 13:07:46 UTC
Can you log into virtual console (Atl+Ctrl+F2), try to reproduce and take a photo when kernel crash?

Comment 3 Alex G. 2011-02-02 18:32:55 UTC
Can you test with the newest kernel from updates? I had a very similar problem and it disappeared with newer kernels.

Comment 4 Tomi Leppikangas 2011-02-19 11:40:58 UTC
I started to get similar crashes with resent fedora kernels. I might be started from 2.6.35.10-74. I also tried stock 2.6.36.3 kernel, but also crashed. Crash occured around 1-3h of light use, i only have 5BM connection, so network traffic is quite light.

Last kernel i have tested is 2.6.35.11-83.

Now i tried parameter pcie_aspm=off and no crash after that.

Usually i don't see any panic message because X freezes, but one time i managed to get it crash in console open, I'll attach image from that. First i thought that it is cpu/mem problem, but i run memtest86 quite long without any errors.  

mcedecode says this:
CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b200004000000800
TSC bbca6bea68
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
Wed Feb 16 00:15:21 2011
CPU 0 BANK 0 TSC bbca6bea68 (null)
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 4
CPUID Vendor Intel Family 6 Model 15
PROCESSOR 0:6f8 TIME 1297808121 SOCKET 0 APIC 0

Kernel panic: CPU context corrupt


bug#617936, bug#538920 and bug#617936 are quite similar.

Comment 5 Tomi Leppikangas 2011-02-19 11:47:56 UTC
Created attachment 479670 [details]
Image from kernel panic.

Comment 6 Tomi Leppikangas 2011-02-19 11:52:27 UTC
Created attachment 479671 [details]
output from "lspci -vvv -s 04:00.0" with param "pcie_aspm=off"

Comment 7 Stanislaw Gruszka 2011-02-21 07:44:11 UTC
Ok, so pcie_aspm=off helps, I thought we have disabled this by default on r8169 ...

Comment 8 Stanislaw Gruszka 2011-02-22 10:55:17 UTC
W disable ASPM in r8169 but only on RHEL6. No upstream, no stable no fedora, seems nobody care to post patch, I'm going to send it now.

Comment 9 Tomi Leppikangas 2011-02-22 18:38:11 UTC
I am now pretty sure that my problems were caused by faulty hardware. Cpu or motherboard seems to be broken, so pcie_aspm=off  didnt help for me. Sorry about misleading info.

Comment 10 Mike Khusid 2011-02-22 18:41:36 UTC
Just want to put my 2c that my hardware is working fine with other Linux distros, so the problem seems unique to Fedora.  I haven't tried the newer kernel yet per Comment 3.

Comment 11 Stanislaw Gruszka 2011-02-23 08:35:21 UTC
Try pcie_aspm=off, we enabled this in kernel, on upstream and other distros, I suppose, ASPM is disabled by default.

Comment 12 bugzilla 2011-03-07 23:04:29 UTC
pcie_aspm=off doesn't help with 2.6.35.11-83.fc14.x86_64

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03)

card worked fine in f10/12/13

Comment 13 bugzilla 2011-03-07 23:09:25 UTC
similar to https://bugzilla.redhat.com/show_bug.cgi?id=642861

Comment 14 Stanislaw Gruszka 2011-03-15 06:20:48 UTC
For these who pcie_aspm=off does not help and can reproduce the problem, please test this kernel:
http://koji.fedoraproject.org/koji/taskinfo?taskID=2900722

Comment 15 bugzilla 2011-06-03 10:40:43 UTC
still getting this problem with F15 release.

kernel info:

Linux greivous 2.6.38.6-27.fc15.x86_64 #1 SMP Sun May 15 17:23:28 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

dmesg warnings:

Jun  3 12:24:45 greivous kernel: [44605.706282] NOHZ: local_softirq_pending 08
Jun  3 12:33:08 greivous kernel: [   42.167021] r8169 0000:04:00.0: eth0: link down
Jun  3 12:33:11 greivous kernel: [   45.269826] r8169 0000:04:00.0: eth0: link up

Comment 16 James 2011-10-22 15:33:19 UTC
Another report for F15. This was hard locking my kernel. Magic SysRq did not work.

04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

Linux bay 2.6.40.6-0.fc15.x86_64 #1 SMP Tue Oct 4 00:39:50 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

I have since replaced the kernel's r8169 driver with Realtek's proprietary r8168 driver, and it works without issues.

Comment 17 Francois Romieu 2011-10-22 18:19:05 UTC
James, please send dmesg including XID line from the r8169 driver and complete
lspci -tv. There is a wide range of 8168 (resp. 810x) devices which share the
same PCI identifiers.

-- 
Ueimor

Comment 18 James 2011-10-23 02:41:19 UTC
(In reply to comment #17)
If it helps any, here's the dmesg from when I was running Fedora 10 and it worked fine:
Oct 10 02:33:44 bay klogd: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
Oct 10 02:33:44 bay klogd: r8169 0000:04:00.0: PCI INT A -> Link[AE3A] -> GSI 16 (level, low) -> IRQ 16
Oct 10 02:33:44 bay klogd: r8169 0000:04:00.0: no MSI. Back to INTx.
Oct 10 02:33:44 bay klogd: eth0: RTL8168c/8111c at 0xffffc20000642000, 00:1f:bc:03:b4:c9, XID 3c4000c0 IRQ 16


Now, the dmesg from Fedora 15:
Oct 18 15:08:34 bay klogd: [   14.483231] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
Oct 18 15:08:34 bay klogd: [   14.483506] r8169 0000:04:00.0: PCI INT A -> Link[AE3A] -> GSI 16 (level, low) -> IRQ 16
Oct 18 15:08:34 bay klogd: [   14.484238] r8169 0000:04:00.0: eth0: RTL8168c/8111c at 0xffffc9000585e000, 00:1f:bc:03:b4:c9, XID 1c4000c0 IRQ 40


And the dmesg from Realtek's driver (nice bug in the output text...):
Oct 22 10:55:43 bay klogd: [   12.818021] r8168 Gigabit Ethernet driver 8.025.00-NAPI loaded
Oct 22 10:55:44 bay klogd: [   12.818060] r8168 0000:04:00.0: PCI INT A -> Link[AE3A] -> GSI 16 (level, low) -> IRQ 16
Oct 22 10:55:44 bay klogd: [   12.818988] eth%d: RTL8168B/8111B at 0xffffc90005894000, 00:1f:bc:03:b4:c9, IRQ 40

Now lspci -tv:
-[0000:00]-+-00.0  nVidia Corporation MCP78S [GeForce 8200] Memory Controller
           +-01.0  nVidia Corporation MCP78S [GeForce 8200] LPC Bridge
           +-01.1  nVidia Corporation MCP78S [GeForce 8200] SMBus
           +-01.2  nVidia Corporation MCP78S [GeForce 8200] Memory Controller
           +-01.3  nVidia Corporation MCP78S [GeForce 8200] Co-Processor
           +-01.4  nVidia Corporation MCP78S [GeForce 8200] Memory Controller
           +-02.0  nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1 Controller
           +-02.1  nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0 Controller
           +-04.0  nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1 Controller
           +-04.1  nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0 Controller
           +-06.0  nVidia Corporation MCP78S [GeForce 8200] IDE
           +-07.0  nVidia Corporation MCP72XE/MCP72P/MCP78U/MCP78S High Definition Audio
           +-08.0-[01]--+-07.0  Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder
           |            +-07.1  Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder [Audio Port]
           |            +-07.2  Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder [MPEG Port]
           |            \-07.4  Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder [IR Port]
           +-09.0  nVidia Corporation Device 0584
           +-10.0-[02]----00.0  nVidia Corporation GT200 [GeForce GTX 260]
           +-12.0-[03]--
           +-13.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
           +-14.0-[05]--
           +-18.0  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
           +-18.1  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
           +-18.2  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
           \-18.3  Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control

And I thought you might want to see lspci -vv:
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
        Subsystem: eVga.com. Corp. Device 8111
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 40
        Region 0: I/O ports at 9c00 [size=256]
        Region 2: Memory at fdbff000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at fdaf0000 (64-bit, prefetchable) [size=64K]
        [virtual] Expansion ROM at fda00000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0100c  Data: 4161
        Capabilities: [70] Express (v1) Endpoint, MSI 01
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [b0] MSI-X: Enable- Count=2 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [d0] Vital Product Data
                Unknown small resource type 05, will not decode more.
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout+ NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 3e-00-00-00-68-4c-e0-00
        Kernel driver in use: r8168
        Kernel modules: r8168

Comment 19 Stanislaw Gruszka 2011-10-24 12:00:11 UTC
(In reply to comment #16)

> I have since replaced the kernel's r8169 driver with Realtek's proprietary
> r8168 driver, and it works without issues.

You mean this http://code.google.com/p/r8168 ? It's GPL driver. I wonder why Realtek prefer to write (and maintain for each kernel release) separate GPL driver, instead of making possible small improvements of in kernel driver, and make they hardware work out of the box? Ehh, we do not live in sane world.

Comment 20 James 2011-10-24 15:00:10 UTC
(In reply to comment #19)
> You mean this http://code.google.com/p/r8168 ? It's GPL driver. I wonder why
> Realtek prefer to write (and maintain for each kernel release) separate GPL
> driver, instead of making possible small improvements of in kernel driver, and
> make they hardware work out of the box? Ehh, we do not live in sane world.

That's the one, but I got it direct from Realtek: http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2

Comment 21 Francois Romieu 2011-10-24 21:38:30 UTC
(In reply to comment #19)
[...]
> You mean this http://code.google.com/p/r8168 ? It's GPL driver. I wonder why
> Realtek prefer to write (and maintain for each kernel release) separate GPL
> driver, instead of making possible small improvements of in kernel driver, and
> make they hardware work out of the box? Ehh, we do not live in sane world.

Actually Realtek both maintains its own driver - mostly forked from an old in
tree driver - and contributes new chipset support and fixes to the r8169 kernel
driver. Though not perfect, the situation has improved in this regard.

James, can you send the ID of your last known working F10 kernel ?

Your F15 kernel is 2.6.40.6-0.fc15.x86_64 and you use a 1500 byte MTU, right ?

How does your system lock up : idle or under moderate/high network load ?

-- 
Ueimor

Comment 22 Henry Wertz 2011-11-15 23:26:58 UTC
     I've had freezes on my Mini 12 under certain types of load from the Dell specific Ubuntu 8.04 all the way to 11.04 if I use SMP.  If I run "nosmp" it is rock solid.

      I found VERY little speed difference between using SMP (really hyperthreading) versus disabling it (I suspect gcc pretty efficiently uses the CPU's resources, so the second "virtual CPU" has very little to use).

     I figure either 1) The system overheats (it may run just a bit cooler with nosmp...)  or 2) GMA500 driver is not SMP-safe.  I saw freezes with vesa, psb, and emgd drivers -- but all 3 jump into closed-source code (vesa calls the VESA VBE in the BIOS, and psb and emgd have binary blobs that execute on the CPU.)
I could be wrong, but I'd assume there'd be more complaints about b43 and r8169 being non-SMP-safe (from crashing other systems) if they were the culprit.

Comment 23 Henry Wertz 2011-11-15 23:34:57 UTC
     Sorry to "double post", but I should add my lockups also occured especially when downloading stuff, and the most when downloading and playing a video.   I suspect if it is a non-SMP-safe driver, that it's not the r8169 or b43, they just generate plenty of interrupts to trip up whatever critical section is not disabling interrupts.

Comment 24 Fedora End Of Life 2012-08-16 18:19:22 UTC
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping


Note You need to log in before you can comment on or make changes to this bug.