Bug 408891

Summary: [PATCH] tg3: system re-ordering mem-mapped io causes eth link to go down
Product: [Fedora] Fedora Reporter: Steven Samorodin <samorodin>
Component: kernelAssignee: Michael Chan <mchan>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 8CC: benlu, chris.brown, mcarlson
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-09 07:29:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tarball of various commands that may provide insight
none
debug patch
none
dmesg none

Description Steven Samorodin 2007-12-03 17:21:09 UTC
Description of problem:
I got this message in /var/log/messages:

Nov 28 13:35:22 prophecy kernel: tg3: eth0: The system may be
re-ordering memory-mapped I/O cycles to the network device, attempting
to recover.  Please report the problem to the driver maintainer and
include system chipset information.
Nov 28 13:35:22 prophecy kernel: tg3: eth0: Link is down.
Nov 28 13:35:24 prophecy kernel: tg3: eth0: Link is up at 100 Mbps, full
duplex.
Nov 28 13:35:24 prophecy kernel: tg3: eth0: Flow control is on for TX and
on for RX.

After this I lost my network connection, but it got better immediately.  I know
I lost it because my VPN connection dropped.

Version-Release number of selected component (if applicable):
I'm running Fedora 8 on a brand new Dell dual core box.

$ uname -a
Linux prophecy 2.6.23.1-49.fc8 #1 SMP Thu Nov 8 21:41:26 EST 2007 i686 i686 i386
GNU/Linux

With this network card (built in to the motherboard) and no others
installed in the system:

eth0: Tigon3 [partno(BCM95754) rev b002 PHY(5787)] (PCI Express)
10/100/1000Base-T Ethernet 00:1a:a0:cc:6b:cd


How reproducible:
I've seen it twice in a couple of weeks of running Fedora 8.

Steps to Reproduce:
Sorry, I'm not sure what the root cause is.  I did nothing more than install
fedora 8 on a new machine and use it to write some code.  I don't think the
particular network services I was using have any bearing on this bug.
  
Actual results:

See the messages included above.  Ethernet link lost for a moment.

Expected results:

Expect network connectivity to not go down intermittently.

Additional info:
See attachment.  Reported this to the tg3 maintainer (Michael Chan) and he
recommended that I file a bug here.

Comment 1 Steven Samorodin 2007-12-03 17:21:09 UTC
Created attachment 275921 [details]
tarball of various commands that may provide insight

Comment 2 Michael Chan 2007-12-03 23:19:51 UTC
Created attachment 276391 [details]
debug patch

Comment 3 Michael Chan 2007-12-03 23:21:01 UTC
I have received similar reports and my suspicion is that the
shinfo(skb)->nr_frags get changed between ->hard_start_xmit() and tx completion
when the driver is freeing the skb.  The driver relies on the nr_frags to find
the packet boundaries in the tx ring.

If this is easily reproducible, please try the attached debug patch in comment
#2.  It should prevent the problem and will print a warning whenever it detects
that the nr_frags field is corrupted.

Comment 4 Christopher Brown 2008-01-29 22:06:02 UTC
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

Have you been able to test the above patch?

Comment 5 Steven Samorodin 2008-05-24 05:32:41 UTC
I haven't built a kernel since before 1.0.  I'm going to figure out how to do
that and apply and test the suggested patch.  I don't have tg3.c on my box which
probably means I don't have a source RPM for the kernel installed.  I'll update
this bug when I have more.

Comment 6 Steven Samorodin 2008-05-24 19:58:47 UTC
I followed the instructions on this page:
http://fedoraproject.org/wiki/Docs/CustomKernel 

However after following the instruction for supposedly installing the kernel on
the running system, e.g. rpm -ivh
~/rpmbuild/RPMS/i686/kernel-2.6.24.7-92.fc8.i686.rpm

I'm not sure if I'm running my kernel.

$ uname -a
Linux prophecy 2.6.24.7-92.fc8 #1 SMP Wed May 7 16:50:09 EDT 2008 i686 i686 i386
GNU/Linux

Given the date in uname it seems like this is not my kernel.  I'll tried
rebooting and didn't see my kernel in the list on the boot loader.  What am I
missing?

Comment 7 Steven Samorodin 2008-05-30 18:08:52 UTC
To answer my own question those instructions do work, but you have to reboot of
course.  I've been running with the patch for a day now and so far so good. 
I'll update this bug in a week or so if I don't experience any further crashes.

Comment 8 Michael Chan 2008-05-30 18:44:39 UTC
The debug patch will print some debug information when it sees that the skb 
frags are corrupted.  Please provide us the dmesg as well.

Please check for this message in the dmesg log:

"skb frags corrupted: orig: %d now: %d\n"

Comment 9 Steven Samorodin 2008-05-30 19:06:33 UTC
Created attachment 307231 [details]
dmesg

I grepped for skb frags but didn't see it in the output of dmesg.  I realize
now that I should have updated the date or version of tg3 driver just to ensure
that I'm really running your patch.

Comment 10 Steven Samorodin 2008-05-30 19:10:52 UTC
I'd also like to amend the Actual Results portion of the original bug.  I've had
this machine freeze up a few times which is what got me digging around and made
me remember this bug.  I haven't been able to attribute the freezes to anything
other than this problem.  The freeze appears to be a total software freeze of
the box.  It is not ping'able, mouse/keyboard are entirely frozen out, I can't
switch to another virtual console, and I can't ctrl-alt-delete to reboot.  I
pretty much have to hold down the power button and reboot.

I'm not 100% sure that the tg3 issue is the cause of those freezes, but right
now it is my leading suspect.

Comment 11 Michael Chan 2008-05-30 20:28:29 UTC
If you used to see this in your dmesg before the patch:

tg3: eth0: The system may be re-ordering memory-mapped I/O cycles to the 
network device, attempting to recover.  

And then eventually crashing, then the patch is used to confirm that the 
message above and the eventual crash was caused by SKB corruption.

With the patch, you'll see "skb frags corrupted" instead of the "re-ordering" 
message and the crash.

Comment 12 Steven Samorodin 2008-06-24 17:23:54 UTC
I've been running the patch since 5/29/08 and I've yet to see the skb frags
corrupted in /var/log/messages.  I still have intermittent hangs of my system
with nothing in logs.  I'm starting to think that maybe despite this being a new
machine I have some bad RAM.  So I'm not sure what to tell you concerning this
patch.  It may well have made things better for me, but this is still the least
stable linux box I've ever had.

Comment 13 Michael Chan 2008-06-25 00:35:01 UTC
I think the issue causing the original symptoms in this BZ has been found.  
See this thread discussing the same issue on the BNX2 driver.

http://marc.info/?t=121362387400001&r=1&w=2

In a nutshell, the TSO code can change an skb while it is still queued in the 
driver, causing the BNX2 driver to crash and the TG3 driver to first print out 
the "re-ordering" message and then crash.  The fix is going to be in the 
netstack.

Comment 14 Bug Zapper 2008-11-26 08:48:24 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Bug Zapper 2009-01-09 07:29:55 UTC
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.