Bug 885828 - realtek: working network is disabled "shortly" after reboot on FC17
Summary: realtek: working network is disabled "shortly" after reboot on FC17
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 17
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-12-10 17:59 UTC by wwaustin
Modified: 2013-01-08 09:11 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-01-08 09:10:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
gzipped tar file containing output of dmesg, /var/log/messages from last boot, listing of hardware on machine, output of "rpm -qa" command (78.89 KB, application/x-gzip)
2012-12-10 17:59 UTC, wwaustin
no flags Details

Description wwaustin 2012-12-10 17:59:17 UTC
Created attachment 661000 [details]
gzipped tar file containing output of dmesg, /var/log/messages from last boot, listing of hardware on machine, output of "rpm -qa" command

Description of problem:

----
NOTE - This machine uses the rpmfusion nvidia drivers.  Removing them and going to the Fedora-supplied drivers does not change anything below - first one and then both ethernet cards will become irreversibly disabled.
----

This machine is a workstation connected both directly to the internet (no proxy) and to a local (192.168.1.xxx) lan - it has 2 network cards.

When the machine is rebooted, both cards work "normally" and behave as expected
the cards are
   eth0 - Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
   eth1 - D-Link System Inc DGE-560T PCI Express Gigabit Ethernet Adapter (rev 13)

Depending on system activity, after a few minutes and always after half an hour or so, eth0 - the realtek - card is suddenly disabled and I have no access to local printers, other machines, etc.
NetworkManager grays out the "enable/disable" buttons for BOTH cards, and the card cannot be re-enabled.
The problem is that if I put my previous FC16 drive back in the system instead of the FC17 drive, then both cards work perfectly and there no problems.

The first time this happened, I assumed that the Realtek module on the motherboard had failed and I bought the Realtek pciexpress card which is being disabled here (the nic on the MB still works fine but only under FC16 - but FWIW it is never enabled under FC17 at all - hence the new card).

A few hours or a day or so later, eth1 joins it as disabled, and the machine has no access to anything.  Period.  (I tried a usbnic "borrowed" from another machine, and it, too, was diabled eventually, so I suspect a problem either in the kernel, network manager, or other networking component(s).

So far the only "WORKAROUND" I've found is to reboot the machine, and then the problem comes back again every time.

The weird part here is that at some point (a few kernels back, but I don't know which kernel - sorry) the cards were not disabled that I know of, but that could be merely because I was working for 2 weeks on a critical project which required coding/compiling/debugging but no network access...  




Version-Release number of selected component (if applicable):
The "obvious" culprit is NetworkManager; however, I'm not positive, so I'm giving more info than requested.
Kernel: kernel-3.6.9-2.fc17.x86_64 (has happened with 3.6.8-2 and 3.6.7-4 as well, but before that not sure).
NetworkManager


How reproducible:
Happens every time

Steps to Reproduce:
1.Boot the machine
2.Log in. Then do a telnet, ssh, rcp, etc. to another machine or print to the printer.  
3.Wait a few minutes - the Realtek card (eth0) will have been disabled.
4.Wait a few hours and the D-Link card (eth1) will also have been disabled.
  
Actual results:
First eth0 and then eventually (a few hours up to say 24 hours later) eth1 will also be disabled


Expected results:
Should never be disabled.



Additional info:

Comment 1 Jirka Klimes 2012-12-14 10:40:02 UTC
The logs you've attached doesn't reveal any obvious problem. However, messages log seems quite truncated. Would you attach more complete log, so we can see initializations after boot and from NetworkManager start?
Also, please paste output of:
$ nmcli dev status

What is output of:
$ ip a

Also, try disabling or relabelling SELinux, there are bunch of error messages.

Comment 2 wwaustin 2013-01-03 16:32:53 UTC
I'll try to be brief.

Sorry to have taken so long to respond - the holidays messed up my scehedule
beyond belief.

To make a long story short, I continued to read several different threads
about problems which sounded similar to mine on FC17.

A couple of other people described the same problem there, and their issue
was that the r8169 driver for realtek nic's.

Here's the output from lspci -vv -b for the NIC on the motherboard:

05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B \
    PCI Express Gigabit Ethernet controller (rev 03)
        Subsystem: Giga-byte Technology GA-EP45-DS5 Motherboard
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- \
    Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- \
    <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 5
        Region 0: I/O ports at be00
        Region 2: Memory at fd3ff000 (64-bit, prefetchable)
        Region 4: Memory at fd3f8000 (64-bit, prefetchable)
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,\
                    D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0200c  Data: 4162
        Capabilities: [70] Express (v2) Endpoint, MSI 01
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s \
                    <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- \
                    Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 4096 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ \
                    TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, \
                    Latency L0 <512ns, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- \
                    CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ \
                    DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- \
                    SpeedDis-, Selectable De-emphasis: -6dB
                         Transmit Margin: Normal Operating Range, 
                            EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, 
                            EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, 
                            LinkEqualizationRequest-
        Capabilities: [ac] MSI-X: Enable- Count=4 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00000800
        Capabilities: [cc] Vital Product Data
                Unknown small resource type 00, will not decode more.
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- \
                    RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- \
                    RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- \
                    RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [140 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [160 v1] Device Serial Number 03-00-00-00-68-4c-e0-00
        Kernel driver in use: r8169

Originally I had assumed that the nic on my MB had gone bad.  As it happened
the pcie card I bought to "replace" it used another Realtek chip and the
same driver.

After reading everything I could find, I finally decided to wipe out my
installation (after doing an rpm -qa > /tmp/rpm-qa.save) and then restore
everything from scratch.

When the system booted I STILL couldn't see the Realtek card (the pcie card).

So then following a suggestion that I someone made on a H/W discussion
list, I removed the newer chip and went back to the original nic on my MB
(re-enabling it that is).   It now works just fine.

I don't know whether they filed bug reports or not, but several other
folks had reported having what sounds like the problem:  that is, at some
point one of the fc17 kernels had an r8169 driver that stopped the card
from working and for a couple of kernels from even being recognized.
A very recent kernel fixed the problem - for the older NIC on my MB anyway.

So I suspect that if there is an open Bugzilla issue about the "realtek
problem," this is just another manifestation of it and you should go ahead
and mark it as such.  Otherwise, I'm back up and running now, so this is no
longer an issue for me.

Strangely after everything was "working" again, I DID try disabling the nic
on the MB and trying the pcie card again (which had worked at one point
with the kernel which introduced the problem in the first place - and
sorry, I'm no longer sure which one that was).  But it still didn't work.
If I still had the card I'd update my bug report accordingly, but I swapped
it with someone for a different card (using a different chip) and I don't
have the ability to do any "testing" with it at this point - sorry.

Comment 3 Jirka Klimes 2013-01-08 09:10:04 UTC
(In reply to comment #2)
> I'll try to be brief.
> 
> Strangely after everything was "working" again, I DID try disabling the nic
> on the MB and trying the pcie card again (which had worked at one point
> with the kernel which introduced the problem in the first place - and
> sorry, I'm no longer sure which one that was).  But it still didn't work.
> If I still had the card I'd update my bug report accordingly, but I swapped
> it with someone for a different card (using a different chip) and I don't
> have the ability to do any "testing" with it at this point - sorry.

Well, as the problem was a Realtek driver I will move the bug to kernel, but close it for now. Should you have problems in the future with the chip, consider reopening and attach any relevant data. Thanks.


Note You need to log in before you can comment on or make changes to this bug.