Bug 617936 - "pcie_aspm=off" option required for stable operation of R8169 Driver Under Load
Summary: "pcie_aspm=off" option required for stable operation of R8169 Driver Under Load
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 13
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-25 06:02 UTC by jrickman
Modified: 2011-11-28 08:47 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-06 08:24:19 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description jrickman 2010-07-25 06:02:37 UTC
Description of problem:

All released versions of Fedora Core 13 will "hard lock" when used on Supermicro X7SLA-H motherboard with built-in Realtek RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02) and system is under "heavy" network load. System will "hard lock" requiring power recycle to clear. Only once did I see the commonly reported "transmit queue 0 timed out" message in "/var/log/messages". 

Version-Release number of selected component (if applicable):

All public released versions of Fedora Core 13 as provided on public mirrors.

How reproducible:

Load Fedora Core 13 with "minimal" choice. Strip out all packages that cannot be run at "init 3" to reduce and conflicts or loads on system. Ensure that SSH, sftp-server, and "top" are available. Setup a portion of the hard disk for SFTP files. Do not make any tweaks of "sysctl.conf" for networking; accept the installed defaults. All hard disk are Seagate Barracuda SATA on motherboard controller.

On Windows PC use VanDyke SecureFX in SFTP mode. Attempt to transfer a folder containing 20GB of assorted sized video files to the Linux device. The network is 100Mbps FastEthernet using 2 ports on the same Linksys switch.

Steps to Reproduce:
1. Start both machines.
2. Using local Console on Linux, login and run "top". Press "1" to see all CPU
3. Start file transfer
4. File transfer will randomly halt. Linux machine will be "hard locked up".
5. Make note of display in "top"
6. Power recycle Linux computer. Transfer should self-restart on Windows. If not, restart file transfer.
7. Linux box will "hard lock". Make note of "top" info.
  
Actual results:

Ran many many test runs, possibly up to 100. Changed various things. Used XFS file system. Used ReiserFS. Used Ext3. Toggle HPET in BIOS. Tried both R8169 ports on motherboard. Changed RAM to vendor approved RAM. Tried RAM sizes: 1G, 2G, 4G. Tried SATA drives on motherboard controller (Intel ICH7) and on PCIe x4 Marvell-based 88SX7042 controller board. All failed every time. 100% reproducible.

Expected results:

Linux computer should not "hard lock" and require power recycle.

Additional info:

Inserted HP-branded Intel 10/100 PCI FastEthernet card using Intel 82557/8/9 chipset. Repeated tests. No failure at all. Measured throughput was not really different from R8169 chips: between 2500KB/s and 2800KB/s per VanDyke SecureFX GUI.

In all failure cases "top" showed memory remaining ranging from 60MB to 1GB. Amount of memory remaining at time of failure was never ever consistent.

[root@fatman ~]# lspci -s 02:00.0 -vvv 
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
        Subsystem: Super Micro Computer Inc Device 8168
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 27
        Region 0: I/O ports at c800 [size=256]
        Region 2: Memory at fe8ff000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at fddf0000 (64-bit, prefetchable) [size=64K]
        Expansion ROM at fe8c0000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0f00c  Data: 4189
        Capabilities: [70] Express (v1) Endpoint, MSI 01
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [b0] MSI-X: Enable- Count=2 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [d0] Vital Product Data
                Unknown small resource type 05, will not decode more.
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout+ NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntrySize=0
                Arb:    Fixed- WRR32- WRR64- WRR128- 100ns- - - onfig- TableOffset=0
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Fixed- RR32-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
        Kernel driver in use: r8169
        Kernel modules: r8169

[root@fatman ~]# lspci -s 04:00.0 -vvv
04:00.0 Ethernet controller: Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 01)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64 (2000ns min, 14000ns max)
        Interrupt: pin A routed to IRQ 20
        Region 0: Memory at fdfff000 (32-bit, prefetchable) [size=4K]
        Region 1: I/O ports at ec00 [size=32]
        Region 2: Memory at feb00000 (32-bit, non-prefetchable) [size=1M]
        Expansion ROM at fea00000 [disabled] [size=1M]
        Kernel driver in use: e100
        Kernel modules: e100

# dmidecode 2.10
SMBIOS 2.5 present.
27 structures occupying 1363 bytes.
Table at 0x000FD170.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 1.0a   
        Release Date: 07/10/2009
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 1024 kB
        Characteristics:
                ISA is supported
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                LS-120 boot is supported
                ATAPI Zip drive boot is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
        BIOS Revision: 8.15

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Supermicro
        Product Name: X7SLA
        Version: 1234567890
        Serial Number: 1234567890
        UUID: 00020003-0004-0005-0006-000700080009
        Wake-up Type: Power Switch
        SKU Number: To Be Filled By O.E.M.
        Family: To Be Filled By O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: Supermicro
        Product Name: X7SLA
        Version: 1234567890
        Serial Number: 1234567890
        Asset Tag: To Be Filled By O.E.M.
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: To Be Filled By O.E.M.
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

Comment 1 Chuck Ebbert 2010-07-25 21:04:44 UTC
Try adding "pcie_aspm=off" to the kernel boot options. Also, try kernel-2.6.34.1-29 from koji, which has a fix that should work on some machines without needing that option.

Comment 2 jrickman 2010-07-26 13:56:10 UTC
Adding "pcie_aspm=off" corrects this problem for this motherboard and the R8169 NIC combination.

I have a different machine with a PCIe-connected RTL 8111/8169B chipset that does not experience this issue at all; "lspci" shows that PCIe_ASPM is enabled on it.

[root@fw ~]# lspci -s 01:00.0 -vvv
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
        Subsystem: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 25
        Region 0: I/O ports at ce00 [size=256]
        Region 2: Memory at fdbff000 (64-bit, non-prefetchable) [size=4K]
        Region 4: Memory at fdcf0000 (64-bit, prefetchable) [size=64K]
        [virtual] Expansion ROM at fdc00000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
[snip]
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM L0s L1 Enabled; RCB 64 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[snip]

This other machine uses a different motherboard (Jetway NC92-330) and runs FC 12 2.6.32.14-127.fc12.i686.PAE. I have run this machine for over a year and routinely run 16~18 Mbps data streams through it but never to it. Any thoughts why do I not see the issue on this other machine?

Comment 3 Chuck Ebbert 2010-07-26 16:45:49 UTC
(In reply to comment #2)
> This other machine uses a different motherboard (Jetway NC92-330) and runs FC
> 12 2.6.32.14-127.fc12.i686.PAE. I have run this machine for over a year and
> routinely run 16~18 Mbps data streams through it but never to it. Any thoughts
> why do I not see the issue on this other machine?    

There are many variants of the r8169 cards. It may be that some can work with ASPM and some cannot. Also, the code in our 2.6.33 kernel enables ASPM even when the motherboard claims not to support it. 2.6.34.1-29 has a fix for the latter issue.

Comment 4 jrickman 2010-07-27 00:24:53 UTC
I would accept "pcie_aspm=off" as a Fedora installation default when R8169 chipsets are recognized and until improvements are made in identifying R8169 chipset revisions and ASPM support within those revisions. The end user can always enable ASPM ("pcie_aspm=on") later.

Based on the web searching that I have done regarding "pcie_aspm", setting "pcie_aspm=on" seems to cause problems rather than provide benefit. When benefit is noted it appears to be marginal at best; one post stated 0.5W power savings. Perhaps in the future we will see the "pcie_aspm" technology evolve into something really valuable.

Chuck, thank you very much for your assistance. I am looking forward to the next public release and/or update to Fedora.

Comment 5 Martin F 2010-10-25 05:18:33 UTC
I had a similar issue with Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI
Express Gigabit Ethernet controller (rev 02) with r8169 and pcie_aspm=off seems to help with 2.6.34.7-61.fc13.x86_64 kernel.

Just as a note, I also have Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01) on a different motherboard (the only difference being rev 01 rather than rev 02) and that one has no problem whatsoever with r8169. I'm guessing 2.6.34.1-29 fix mentioned in comment 3 worked for rev 01 but not rev 02.

Anyway, bug#538920 and bug#620047 seem all related to this.

Comment 6 Bug Zapper 2011-06-01 12:55:16 UTC
This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 7 jrickman 2011-06-06 08:24:19 UTC
Retested this issue on same hardware.

Unable to duplicate in Fedora Core 14 kernels:
2.6.35.13-91.fc14.i686
2.6.35.13-91.fc14.i686.PAE

OK to close.

Comment 8 bugzilla 2011-06-06 09:21:11 UTC
i'm still seeing issues in fedora 15, but pcie_aspm=off is not fixing it.

Comment 9 Need Real Name 2011-11-28 08:47:12 UTC
I suspect pcie_aspm=off not working being responsible for an installation failing (bug 707702).


Note You need to log in before you can comment on or make changes to this bug.