Bug 227420 - sky2 eth0: tx timeout
sky2 eth0: tx timeout
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
6
All Linux
medium Severity medium
: ---
: ---
Assigned To: Kernel Maintainer List
Brian Brock
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-05 16:35 EST by Roland McGrath
Modified: 2009-10-18 23:47 EDT (History)
4 users (show)

See Also:
Fixed In Version: 2.6.20-1.2948.fc6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-05-05 05:05:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
A shell script to monitor for and attempt to fix a sky2 driver wedge (2.09 KB, text/plain)
2007-05-05 11:34 EDT, Douglas Needham
no flags Details
A rc script to start sky2mon (807 bytes, application/octet-stream)
2007-05-05 11:35 EDT, Douglas Needham
no flags Details

  None (edit)
Description Roland McGrath 2007-02-05 16:35:31 EST
Description of problem:

sky2 ethernet interface wedges sometimes, maybe under heavy load, maybe
randomly.  Sometimes rmmod sky2;modprobe sky2 makes it OK.  Sometimes it wedges
or shows "BUG: soft lockup detected on CPU#0!" so hard-reboot is needed.
Sometimes it says:

sky2 eth0: tx timeout
sky2 status report lost?


Version-Release number of selected component (if applicable):
kernel-2.6.19-1.2895.fc6 (i686)

Hardware is a new Intel Mac mini.  Driver says:

sky2 v1.10 addr 0x90200000 irq 17 Yukon-EC (0xb6) rev 2
Comment 1 Roland McGrath 2007-02-05 19:07:34 EST
I just had the interface wedge with no kernel messages and no crash.
It just stopped exchanging packets.  rmmod sky2;modprobe sky2 fixed it.
Comment 2 Chuck Ebbert 2007-03-20 11:09:43 EDT
Should be fixed now. A major sky2 update went into 2.6.19-1.2911.x and all those
updates are in current 2.6.20-1.2933. Reopen if you see problems.
Comment 3 Roland McGrath 2007-03-20 16:52:52 EDT
I do still get interface lockups on a daily basis using 2.6.19-1.2911.6.5.fc6.
I have not seen the "tx timeout" message in a long time.  But the behavior is
the same: packets stop flowing until I rmmod and modprobe sky2.

Linux patootie 2.6.19-1.2911.6.5.fc6 #1 SMP Sun Mar 4 16:41:13 EST 2007 i686
i686 i386 GNU/Linux
Comment 4 Chuck Ebbert 2007-03-20 16:57:13 EDT
Can you test 2.6.20-1.2933 ? Even more fixes went into that one.
I thought Stephen Hemminger's patchset for 2.6.19 was supposed to
fix everything, though.
Comment 5 Roland McGrath 2007-03-20 22:59:56 EDT
I just got the same silent lossage behavior with 2.6.20-1.2933.fc6
(rmmod;modprobe fixed it).  There were no complaints at all in dmesg.
Comment 6 Roland McGrath 2007-03-21 02:20:22 EDT
When I left it in the nonworking state for long enough, it did produce lots of
messages like this:

NETDEV WATCHDOG: eth0: transmit timed out
sky2 eth0: tx timeout
sky2 eth0: transmit ring 272 .. 249 report=295 done=295
sky2 status report lost?
NETDEV WATCHDOG: eth0: transmit timed out
sky2 eth0: tx timeout
sky2 eth0: transmit ring 295 .. 272 report=295 done=295
sky2 hardware hung? flushing

Comment 7 Chuck Ebbert 2007-03-21 09:39:10 EDT
There is a new patch for tx timeout errors.  I'll put it in a test
kernel.
Comment 8 Chuck Ebbert 2007-03-26 10:45:40 EDT
Test kernels (version 1.2937) for this issue are at:

http://people.redhat.com/cebbert

Please test and report back.

(Stephen Hemminger says this problem is triggered by faulty switches.)
Comment 9 Roland McGrath 2007-03-26 21:57:33 EDT
No change with 1.2937 for sky2.  It did make X stop working, however.
Comment 10 Chuck Ebbert 2007-03-27 11:27:51 EDT
Can you figure out which patch broke X? Any help in the X logs?

I see 2933 worked. There are 19 new patches in 1.2937 (#1800-1818)
plus a couple that came with final 2.6.20.4. The 19 new ones are most
suspect...
Comment 11 Roland McGrath 2007-03-27 13:20:08 EDT
Actually I don't care about X on that machine.  I just happened to notice it.
I can compare the logs from working and nonworking.
Comment 12 Roland McGrath 2007-03-27 22:24:41 EDT
Well, X started working.  Go figure.  When I killed gdm to make it try after
having given up, it just worked.  I figured it was the first try at boot that
must be the problem, but when I rebooted it worked fine.

I still don't have any improvement with the sky2 behavior, which is what I'm
here for.
Comment 13 Chuck Ebbert 2007-03-28 09:24:24 EDT
Stephen Hemminger says it's broken switches that cause the problem and since his
switches work he can't reproduce it. The patch that went in is the latest
attempt to fix this but he's working in the dark...
Comment 14 Douglas Needham 2007-03-28 12:07:43 EDT
What sorts of "broken switches" (compile, or network).  If we are talking about
network switches, I would appreciate knowing the details, as I work with the
network protocols as a part of my job, and have worked on several NIC drivers
for other OSes including UW, and can do low level network traces.  I have been
experiencing this same problem for several months now, but the frequency varies.
 Sometimes, I have run a decent load for 24-48 hours with no problem, and
sometimes I need only have the machine running 10 mins before the network
interface hangs.  Right now, my new machine is unusable using FC6, though no
problems have been seen under XP (gag) so far.  And other machines on this same
segment (which is a Linksys 8-port workgroup hub which consolidates several
machines to a single port of a DES-3226 switch elsewhere in my house), using FC6
and other NICs have no problem.

I would also try the 2933 kernel, but this machine also uses the nvidia driver,
and I don't really have time at this moment to go messing with the machine to
get X running without this driver.
Comment 15 Chuck Ebbert 2007-03-28 12:28:57 EDT
I'm not sure what more can be done, short of taking the bug reports upstream.
The maintainer cannot reproduce the TX timeouts on his hardware:

http://lkml.org/lkml/2007/3/21/304
Comment 16 Roland McGrath 2007-03-28 15:29:29 EDT
The switch I use is a new one I bought at the same time I bought the machine
with the sky2.  I bought it because it was cheap, so I don't have a hard time
believing there is something crappy about it.  OTOH, it was cheap enough ($30)
that anyone wanting to debug could just get one.  Also, I've never noticed any
problem whatsoever with all the other NIC flavors I have plugged into the same
switch.
Comment 17 Chuck Ebbert 2007-03-29 11:36:49 EDT
Discussion is ongoing on linux-kernel:

http://lkml.org/lkml/2007/3/16/369

Comment 18 Andrew Bartlett 2007-03-30 08:04:10 EDT
I've got this with 2.6.20-1.2933.fc6 and an older netgear 100Mbit switch (no
doubt cheap and nasty :-).

Locks up regularly with NFS home directories.  I'm currently using a script
found in the centos bugzilla to autorestart the NIC (before, rather than after,
my wife complains/reboots).

Aside from NFS complaining, I've not seen other messages from the kernel.

http://bugs.centos.org/view.php?id=1540
Comment 19 Chuck Ebbert 2007-04-17 19:02:36 EDT
Starting with version 2945, the kernel will have a new set of fixes
from 2.6.20.7.

Please test when the new kernels become available.
Comment 20 Roland McGrath 2007-04-25 07:11:50 EDT
Is kernel-2.6.20-1.2945.fc6 ever getting pushed to updates-testing?  It's been
built for some time but not published.
Comment 21 Chuck Ebbert 2007-04-25 11:05:14 EDT
(In reply to comment #20)
> Is kernel-2.6.20-1.2945.fc6 ever getting pushed to updates-testing?  It's been
> built for some time but not published.

Probably not until more fixes go in.
Comment 22 George Shearer 2007-05-03 23:33:39 EDT
sky2 still hangs on occasion on my Gigabyte GA-P965-DS3... rmmod/insmod fixes...
Using the newly released 2.6.20-1.2948 kernel. :(

#!/bin/sh
export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin
while true; do
  TEST=`ping -c 3 192.168.1.254 | grep '100% packet loss'`
  if [ -n "${TEST}" ]; then
    ifdown eth0
    rmmod sky2
    modprobe sky2
    sleep 5
    echo sky2 bug happened again | mail -s sky2 doc
    echo Lame... `date`
  fi
  sleep 60
done
Comment 23 Chuck Ebbert 2007-05-04 10:21:36 EDT
sky2 chip 88e8056 on Gigabyte motherboards is now blacklisted in 2.6.21.
Driver won't even try to work with it...
Comment 24 Roland McGrath 2007-05-05 05:05:46 EDT
2.6.20-1.2948.fc6 has now been running without the sky2 problem for far longer
than any other kernel has managed recently.  I think the problem I had is fixed.
Comment 25 Douglas Needham 2007-05-05 11:34:21 EDT
Created attachment 154206 [details]
A shell script to monitor for and attempt to fix a sky2 driver wedge

This script is the workhorse of monitoring for the sky2 driver wedge bug, and
attempting to recover from it.	Instances of restarting the current X session
have been seen when the bug triggers the restart, but this does not happen
every time.  Just so you are aware.
Comment 26 Douglas Needham 2007-05-05 11:35:34 EDT
Created attachment 154207 [details]
A rc script to start sky2mon
Comment 27 Douglas Needham 2007-05-05 11:41:21 EDT
I am still having the problems with my Asus P5W DH, which uses the 88e8053
chipset.  The first re-occurrence of the problem was within the first few hours
of updating to 2948 and rebooting, but it does seem to be less frequent.  Since
George was nice enough to share his nice script, I have attached my script for
others to use.  Main differences would be that I used syslog instead of mail
(but I like the mail idea as a backup notification channel), and I ping a list
of hosts instead of a single host to avoid a false detection.  It also now keeps
track of how many iterations of the monitoring loop have occurred before a
reset, as well as the time interval, so that some metric can be applied to the
resets.

This one is about as difficult to solve as one I worked on with another ethernet
driver at a major UNIX shop.  Hope we can fix this one in less time.
Comment 28 George Shearer 2007-05-09 09:33:12 EDT
(In reply to comment #24)
> 2.6.20-1.2948.fc6 has now been running without the sky2 problem for far longer
> than any other kernel has managed recently.  I think the problem I had is fixed.
> 

Actually, I noted in comment #22 that this problem had occurred on my Gigabyte
board, which is true. But, it has only happened ONE TIME since boot. It's been
up for nearly 6 days now without a problem. This is a vast improvement over
previous kernels.
Comment 29 George Shearer 2007-05-18 15:16:17 EDT
(In reply to comment #28)
> (In reply to comment #24)
> > 2.6.20-1.2948.fc6 has now been running without the sky2 problem for far longer
> > than any other kernel has managed recently.  I think the problem I had is fixed.
> > 
> 
> Actually, I noted in comment #22 that this problem had occurred on my Gigabyte
> board, which is true. But, it has only happened ONE TIME since boot. It's been
> up for nearly 6 days now without a problem. This is a vast improvement over
> previous kernels.
> 

Ehh.. Unfortunately, I have to report that it has happened a few more times.
It's not nearly a bad as it used to be. However, occurrence rate definitely
increases with bandwidth utilization.
Comment 30 George Shearer 2007-05-26 12:26:19 EDT
Strangely, the sky2 hang is far more likely to occur when transferring large
files between my server (which has the sky2 eth0) and a wireless notebook.

It's less likely to occur when transferring the same file to or from a wired
workstation. Strange, because, the wired workstation can achieve as much as 60
MB/sec rates while the wireless workstation may only achieve around 2.6MB/sec.

If a period of time goes by, say, a few days with relatively no network activity
on the server.. and if I suddenly start transferring a file via a wireless
notebook, the sky2 hang will happen almost every time predictably.
Comment 31 George Shearer 2007-06-07 11:11:39 EDT
Sadly, I must report that this bug still exists on kernel 2.6.21-1.394 on Fedora
7. It happens with great reliability anytime there is sustained network load.
The script in comment #22 still fixes the hang.

Over the course of 8 hours or so, with an average incoming ethernet speed of 14
megabits/sec (downloading stuff), here is the frequency...

Lame... Wed Jun 6 23:10:00 EDT 2007
Lame... Wed Jun 6 23:14:23 EDT 2007
Lame... Thu Jun 7 00:23:46 EDT 2007
Lame... Thu Jun 7 01:01:13 EDT 2007
Lame... Thu Jun 7 03:25:08 EDT 2007

Comment 32 George Shearer 2007-06-18 15:59:57 EDT
This problem still exists.. with no change.. on kernel 2.6.21-1.3228.fc7.i686 :(
Comment 33 George Shearer 2007-07-21 14:37:17 EDT
This bug still exists in kernel 2.6.22.1-27.fc7.i686 :(
Comment 34 George Shearer 2007-08-02 11:44:31 EDT
This bug still exists in kernel 2.6.22.1-33.fc7.i686 :( 

Hello? Am I the only one using the sky2 driver? is there some alternative I
don't know about.. 
Comment 35 Andrew Bartlett 2007-08-02 20:39:56 EDT
In my experience the degree of trouble you have with this NIC is inversly
proportional to the price you pay for the NIC.  

My wife endured a world of pain on our mac mini, before we took the trip across
town to replace the cheap 100Mibt switch with a Linksys Gigabit part. 

I've posted the faulty switch to the developer, but he didn't manage to
reproduce anything on the current drivers.  
Comment 36 George Shearer 2007-08-05 17:13:48 EDT
This is not a switch issue. Same exact hardware works fine running Winblows.
This is a straight up bug in the sky2 driver, and it has existed for some time now.
Comment 37 Scott A. Hughes 2009-10-18 23:47:18 EDT
This bug seems to have been closed from some time, but I can confirm that the problem still occurs with the most recent version of RHEL4. It does in fact seem to be a switch related issue.

I am using Red Hat Enterprise Linux ES release 4 (Nahant Update 8) with all previously-released errata relevant to my system applied.

The version of the kernel RPM  I am using is as follows.

kernel-2.6.9-89.0.11.EL

The output of "uname -r" is as follows

2.6.9-89.0.11.ELsmp

I came across this problem after I connected a switch to my network that I had not used for some time (years).

As reported by others above, the interface using sky2 fails entirely. This problem occurred consistently with the switch (100M network hub) attached, and stopped occurring once I removed from the network. (A system reboot was required to enable the interface to operate correctly again.) As such, based on my own observations, I myself can say conclusively that a faulty switch/hub does seem to trigger the problem.

This bug number has been closed for sometime - however evidently not fixed - and I am not able to reopen it myself, however I will send this information in the hope that Red Hat will re-examine this issue.

I plan to dispose of the problematic hub, however I would be happy to send it (for free) to Red Hat (or whoever would use it for testing) if that were to help solve/fix the problem.

Note You need to log in before you can comment on or make changes to this bug.