Summary: | Kernel freezes under moderate network load, driver r8169 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mike Khusid <mkhusid> | ||||||
Component: | kernel | Assignee: | Ivan Vecera <ivecera> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 14 | CC: | bugzilla, dougsland, fedora, gansalmon, hwertz10, itamar, ivecera, jofernan, jonathan, kernel-maint, madhu.chinakonda, mschmidt, romieu, sgruszka, stuffcorpse, tomi.leppikangas, tom | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2012-08-16 18:19:19 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Attachments: |
|
Description
Mike Khusid
2010-10-13 23:35:23 UTC
Adding cc per http://fedoraproject.org/wiki/KernelBugTriage Can you log into virtual console (Atl+Ctrl+F2), try to reproduce and take a photo when kernel crash? Can you test with the newest kernel from updates? I had a very similar problem and it disappeared with newer kernels. I started to get similar crashes with resent fedora kernels. I might be started from 2.6.35.10-74. I also tried stock 2.6.36.3 kernel, but also crashed. Crash occured around 1-3h of light use, i only have 5BM connection, so network traffic is quite light. Last kernel i have tested is 2.6.35.11-83. Now i tried parameter pcie_aspm=off and no crash after that. Usually i don't see any panic message because X freezes, but one time i managed to get it crash in console open, I'll attach image from that. First i thought that it is cpu/mem problem, but i run memtest86 quite long without any errors. mcedecode says this: CPU 0: Machine Check Exception: 0000000000000004 Bank 0: b200004000000800 TSC bbca6bea68 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor Wed Feb 16 00:15:21 2011 CPU 0 BANK 0 TSC bbca6bea68 (null) MCG status:MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-timeout Error BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE timeout BINIT (ROB timeout). No micro-instruction retired for some time STATUS b200004000000800 MCGSTATUS 4 CPUID Vendor Intel Family 6 Model 15 PROCESSOR 0:6f8 TIME 1297808121 SOCKET 0 APIC 0 Kernel panic: CPU context corrupt bug#617936, bug#538920 and bug#617936 are quite similar. Created attachment 479670 [details]
Image from kernel panic.
Created attachment 479671 [details]
output from "lspci -vvv -s 04:00.0" with param "pcie_aspm=off"
Ok, so pcie_aspm=off helps, I thought we have disabled this by default on r8169 ... W disable ASPM in r8169 but only on RHEL6. No upstream, no stable no fedora, seems nobody care to post patch, I'm going to send it now. I am now pretty sure that my problems were caused by faulty hardware. Cpu or motherboard seems to be broken, so pcie_aspm=off didnt help for me. Sorry about misleading info. Just want to put my 2c that my hardware is working fine with other Linux distros, so the problem seems unique to Fedora. I haven't tried the newer kernel yet per Comment 3. Try pcie_aspm=off, we enabled this in kernel, on upstream and other distros, I suppose, ASPM is disabled by default. pcie_aspm=off doesn't help with 2.6.35.11-83.fc14.x86_64 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 03) card worked fine in f10/12/13 For these who pcie_aspm=off does not help and can reproduce the problem, please test this kernel: http://koji.fedoraproject.org/koji/taskinfo?taskID=2900722 still getting this problem with F15 release. kernel info: Linux greivous 2.6.38.6-27.fc15.x86_64 #1 SMP Sun May 15 17:23:28 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux dmesg warnings: Jun 3 12:24:45 greivous kernel: [44605.706282] NOHZ: local_softirq_pending 08 Jun 3 12:33:08 greivous kernel: [ 42.167021] r8169 0000:04:00.0: eth0: link down Jun 3 12:33:11 greivous kernel: [ 45.269826] r8169 0000:04:00.0: eth0: link up Another report for F15. This was hard locking my kernel. Magic SysRq did not work. 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) Linux bay 2.6.40.6-0.fc15.x86_64 #1 SMP Tue Oct 4 00:39:50 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux I have since replaced the kernel's r8169 driver with Realtek's proprietary r8168 driver, and it works without issues. James, please send dmesg including XID line from the r8169 driver and complete lspci -tv. There is a wide range of 8168 (resp. 810x) devices which share the same PCI identifiers. -- Ueimor (In reply to comment #17) If it helps any, here's the dmesg from when I was running Fedora 10 and it worked fine: Oct 10 02:33:44 bay klogd: r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded Oct 10 02:33:44 bay klogd: r8169 0000:04:00.0: PCI INT A -> Link[AE3A] -> GSI 16 (level, low) -> IRQ 16 Oct 10 02:33:44 bay klogd: r8169 0000:04:00.0: no MSI. Back to INTx. Oct 10 02:33:44 bay klogd: eth0: RTL8168c/8111c at 0xffffc20000642000, 00:1f:bc:03:b4:c9, XID 3c4000c0 IRQ 16 Now, the dmesg from Fedora 15: Oct 18 15:08:34 bay klogd: [ 14.483231] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded Oct 18 15:08:34 bay klogd: [ 14.483506] r8169 0000:04:00.0: PCI INT A -> Link[AE3A] -> GSI 16 (level, low) -> IRQ 16 Oct 18 15:08:34 bay klogd: [ 14.484238] r8169 0000:04:00.0: eth0: RTL8168c/8111c at 0xffffc9000585e000, 00:1f:bc:03:b4:c9, XID 1c4000c0 IRQ 40 And the dmesg from Realtek's driver (nice bug in the output text...): Oct 22 10:55:43 bay klogd: [ 12.818021] r8168 Gigabit Ethernet driver 8.025.00-NAPI loaded Oct 22 10:55:44 bay klogd: [ 12.818060] r8168 0000:04:00.0: PCI INT A -> Link[AE3A] -> GSI 16 (level, low) -> IRQ 16 Oct 22 10:55:44 bay klogd: [ 12.818988] eth%d: RTL8168B/8111B at 0xffffc90005894000, 00:1f:bc:03:b4:c9, IRQ 40 Now lspci -tv: -[0000:00]-+-00.0 nVidia Corporation MCP78S [GeForce 8200] Memory Controller +-01.0 nVidia Corporation MCP78S [GeForce 8200] LPC Bridge +-01.1 nVidia Corporation MCP78S [GeForce 8200] SMBus +-01.2 nVidia Corporation MCP78S [GeForce 8200] Memory Controller +-01.3 nVidia Corporation MCP78S [GeForce 8200] Co-Processor +-01.4 nVidia Corporation MCP78S [GeForce 8200] Memory Controller +-02.0 nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1 Controller +-02.1 nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0 Controller +-04.0 nVidia Corporation MCP78S [GeForce 8200] OHCI USB 1.1 Controller +-04.1 nVidia Corporation MCP78S [GeForce 8200] EHCI USB 2.0 Controller +-06.0 nVidia Corporation MCP78S [GeForce 8200] IDE +-07.0 nVidia Corporation MCP72XE/MCP72P/MCP78U/MCP78S High Definition Audio +-08.0-[01]--+-07.0 Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder | +-07.1 Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder [Audio Port] | +-07.2 Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder [MPEG Port] | \-07.4 Conexant Systems, Inc. CX23880/1/2/3 PCI Video and Audio Decoder [IR Port] +-09.0 nVidia Corporation Device 0584 +-10.0-[02]----00.0 nVidia Corporation GT200 [GeForce GTX 260] +-12.0-[03]-- +-13.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller +-14.0-[05]-- +-18.0 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration +-18.1 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map +-18.2 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller \-18.3 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control And I thought you might want to see lspci -vv: 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) Subsystem: eVga.com. Corp. Device 8111 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 40 Region 0: I/O ports at 9c00 [size=256] Region 2: Memory at fdbff000 (64-bit, non-prefetchable) [size=4K] Region 4: Memory at fdaf0000 (64-bit, prefetchable) [size=64K] [virtual] Expansion ROM at fda00000 [disabled] [size=128K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0100c Data: 4161 Capabilities: [70] Express (v1) Endpoint, MSI 01 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [b0] MSI-X: Enable- Count=2 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [d0] Vital Product Data Unknown small resource type 05, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout+ NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 3e-00-00-00-68-4c-e0-00 Kernel driver in use: r8168 Kernel modules: r8168 (In reply to comment #16) > I have since replaced the kernel's r8169 driver with Realtek's proprietary > r8168 driver, and it works without issues. You mean this http://code.google.com/p/r8168 ? It's GPL driver. I wonder why Realtek prefer to write (and maintain for each kernel release) separate GPL driver, instead of making possible small improvements of in kernel driver, and make they hardware work out of the box? Ehh, we do not live in sane world. (In reply to comment #19) > You mean this http://code.google.com/p/r8168 ? It's GPL driver. I wonder why > Realtek prefer to write (and maintain for each kernel release) separate GPL > driver, instead of making possible small improvements of in kernel driver, and > make they hardware work out of the box? Ehh, we do not live in sane world. That's the one, but I got it direct from Realtek: http://www.realtek.com/downloads/downloadsView.aspx?Langid=1&PNid=13&PFid=5&Level=5&Conn=4&DownTypeID=3&GetDown=false#2 (In reply to comment #19) [...] > You mean this http://code.google.com/p/r8168 ? It's GPL driver. I wonder why > Realtek prefer to write (and maintain for each kernel release) separate GPL > driver, instead of making possible small improvements of in kernel driver, and > make they hardware work out of the box? Ehh, we do not live in sane world. Actually Realtek both maintains its own driver - mostly forked from an old in tree driver - and contributes new chipset support and fixes to the r8169 kernel driver. Though not perfect, the situation has improved in this regard. James, can you send the ID of your last known working F10 kernel ? Your F15 kernel is 2.6.40.6-0.fc15.x86_64 and you use a 1500 byte MTU, right ? How does your system lock up : idle or under moderate/high network load ? -- Ueimor I've had freezes on my Mini 12 under certain types of load from the Dell specific Ubuntu 8.04 all the way to 11.04 if I use SMP. If I run "nosmp" it is rock solid. I found VERY little speed difference between using SMP (really hyperthreading) versus disabling it (I suspect gcc pretty efficiently uses the CPU's resources, so the second "virtual CPU" has very little to use). I figure either 1) The system overheats (it may run just a bit cooler with nosmp...) or 2) GMA500 driver is not SMP-safe. I saw freezes with vesa, psb, and emgd drivers -- but all 3 jump into closed-source code (vesa calls the VESA VBE in the BIOS, and psb and emgd have binary blobs that execute on the CPU.) I could be wrong, but I'd assume there'd be more complaints about b43 and r8169 being non-SMP-safe (from crashing other systems) if they were the culprit. Sorry to "double post", but I should add my lockups also occured especially when downloading stuff, and the most when downloading and playing a video. I suspect if it is a non-SMP-safe driver, that it's not the r8169 or b43, they just generate plenty of interrupts to trip up whatever critical section is not disabling interrupts. This message is a notice that Fedora 14 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 14. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '14' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 14 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping |