Description of problem: sky2 ethernet interface wedges sometimes, maybe under heavy load, maybe randomly. Sometimes rmmod sky2;modprobe sky2 makes it OK. Sometimes it wedges or shows "BUG: soft lockup detected on CPU#0!" so hard-reboot is needed. Sometimes it says: sky2 eth0: tx timeout sky2 status report lost? Version-Release number of selected component (if applicable): kernel-2.6.19-1.2895.fc6 (i686) Hardware is a new Intel Mac mini. Driver says: sky2 v1.10 addr 0x90200000 irq 17 Yukon-EC (0xb6) rev 2
I just had the interface wedge with no kernel messages and no crash. It just stopped exchanging packets. rmmod sky2;modprobe sky2 fixed it.
Should be fixed now. A major sky2 update went into 2.6.19-1.2911.x and all those updates are in current 2.6.20-1.2933. Reopen if you see problems.
I do still get interface lockups on a daily basis using 2.6.19-1.2911.6.5.fc6. I have not seen the "tx timeout" message in a long time. But the behavior is the same: packets stop flowing until I rmmod and modprobe sky2. Linux patootie 2.6.19-1.2911.6.5.fc6 #1 SMP Sun Mar 4 16:41:13 EST 2007 i686 i686 i386 GNU/Linux
Can you test 2.6.20-1.2933 ? Even more fixes went into that one. I thought Stephen Hemminger's patchset for 2.6.19 was supposed to fix everything, though.
I just got the same silent lossage behavior with 2.6.20-1.2933.fc6 (rmmod;modprobe fixed it). There were no complaints at all in dmesg.
When I left it in the nonworking state for long enough, it did produce lots of messages like this: NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 272 .. 249 report=295 done=295 sky2 status report lost? NETDEV WATCHDOG: eth0: transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 295 .. 272 report=295 done=295 sky2 hardware hung? flushing
There is a new patch for tx timeout errors. I'll put it in a test kernel.
Test kernels (version 1.2937) for this issue are at: http://people.redhat.com/cebbert Please test and report back. (Stephen Hemminger says this problem is triggered by faulty switches.)
No change with 1.2937 for sky2. It did make X stop working, however.
Can you figure out which patch broke X? Any help in the X logs? I see 2933 worked. There are 19 new patches in 1.2937 (#1800-1818) plus a couple that came with final 2.6.20.4. The 19 new ones are most suspect...
Actually I don't care about X on that machine. I just happened to notice it. I can compare the logs from working and nonworking.
Well, X started working. Go figure. When I killed gdm to make it try after having given up, it just worked. I figured it was the first try at boot that must be the problem, but when I rebooted it worked fine. I still don't have any improvement with the sky2 behavior, which is what I'm here for.
Stephen Hemminger says it's broken switches that cause the problem and since his switches work he can't reproduce it. The patch that went in is the latest attempt to fix this but he's working in the dark...
What sorts of "broken switches" (compile, or network). If we are talking about network switches, I would appreciate knowing the details, as I work with the network protocols as a part of my job, and have worked on several NIC drivers for other OSes including UW, and can do low level network traces. I have been experiencing this same problem for several months now, but the frequency varies. Sometimes, I have run a decent load for 24-48 hours with no problem, and sometimes I need only have the machine running 10 mins before the network interface hangs. Right now, my new machine is unusable using FC6, though no problems have been seen under XP (gag) so far. And other machines on this same segment (which is a Linksys 8-port workgroup hub which consolidates several machines to a single port of a DES-3226 switch elsewhere in my house), using FC6 and other NICs have no problem. I would also try the 2933 kernel, but this machine also uses the nvidia driver, and I don't really have time at this moment to go messing with the machine to get X running without this driver.
I'm not sure what more can be done, short of taking the bug reports upstream. The maintainer cannot reproduce the TX timeouts on his hardware: http://lkml.org/lkml/2007/3/21/304
The switch I use is a new one I bought at the same time I bought the machine with the sky2. I bought it because it was cheap, so I don't have a hard time believing there is something crappy about it. OTOH, it was cheap enough ($30) that anyone wanting to debug could just get one. Also, I've never noticed any problem whatsoever with all the other NIC flavors I have plugged into the same switch.
Discussion is ongoing on linux-kernel: http://lkml.org/lkml/2007/3/16/369
I've got this with 2.6.20-1.2933.fc6 and an older netgear 100Mbit switch (no doubt cheap and nasty :-). Locks up regularly with NFS home directories. I'm currently using a script found in the centos bugzilla to autorestart the NIC (before, rather than after, my wife complains/reboots). Aside from NFS complaining, I've not seen other messages from the kernel. http://bugs.centos.org/view.php?id=1540
Starting with version 2945, the kernel will have a new set of fixes from 2.6.20.7. Please test when the new kernels become available.
Is kernel-2.6.20-1.2945.fc6 ever getting pushed to updates-testing? It's been built for some time but not published.
(In reply to comment #20) > Is kernel-2.6.20-1.2945.fc6 ever getting pushed to updates-testing? It's been > built for some time but not published. Probably not until more fixes go in.
sky2 still hangs on occasion on my Gigabyte GA-P965-DS3... rmmod/insmod fixes... Using the newly released 2.6.20-1.2948 kernel. :( #!/bin/sh export PATH=/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin while true; do TEST=`ping -c 3 192.168.1.254 | grep '100% packet loss'` if [ -n "${TEST}" ]; then ifdown eth0 rmmod sky2 modprobe sky2 sleep 5 echo sky2 bug happened again | mail -s sky2 doc echo Lame... `date` fi sleep 60 done
sky2 chip 88e8056 on Gigabyte motherboards is now blacklisted in 2.6.21. Driver won't even try to work with it...
2.6.20-1.2948.fc6 has now been running without the sky2 problem for far longer than any other kernel has managed recently. I think the problem I had is fixed.
Created attachment 154206 [details] A shell script to monitor for and attempt to fix a sky2 driver wedge This script is the workhorse of monitoring for the sky2 driver wedge bug, and attempting to recover from it. Instances of restarting the current X session have been seen when the bug triggers the restart, but this does not happen every time. Just so you are aware.
Created attachment 154207 [details] A rc script to start sky2mon
I am still having the problems with my Asus P5W DH, which uses the 88e8053 chipset. The first re-occurrence of the problem was within the first few hours of updating to 2948 and rebooting, but it does seem to be less frequent. Since George was nice enough to share his nice script, I have attached my script for others to use. Main differences would be that I used syslog instead of mail (but I like the mail idea as a backup notification channel), and I ping a list of hosts instead of a single host to avoid a false detection. It also now keeps track of how many iterations of the monitoring loop have occurred before a reset, as well as the time interval, so that some metric can be applied to the resets. This one is about as difficult to solve as one I worked on with another ethernet driver at a major UNIX shop. Hope we can fix this one in less time.
(In reply to comment #24) > 2.6.20-1.2948.fc6 has now been running without the sky2 problem for far longer > than any other kernel has managed recently. I think the problem I had is fixed. > Actually, I noted in comment #22 that this problem had occurred on my Gigabyte board, which is true. But, it has only happened ONE TIME since boot. It's been up for nearly 6 days now without a problem. This is a vast improvement over previous kernels.
(In reply to comment #28) > (In reply to comment #24) > > 2.6.20-1.2948.fc6 has now been running without the sky2 problem for far longer > > than any other kernel has managed recently. I think the problem I had is fixed. > > > > Actually, I noted in comment #22 that this problem had occurred on my Gigabyte > board, which is true. But, it has only happened ONE TIME since boot. It's been > up for nearly 6 days now without a problem. This is a vast improvement over > previous kernels. > Ehh.. Unfortunately, I have to report that it has happened a few more times. It's not nearly a bad as it used to be. However, occurrence rate definitely increases with bandwidth utilization.
Strangely, the sky2 hang is far more likely to occur when transferring large files between my server (which has the sky2 eth0) and a wireless notebook. It's less likely to occur when transferring the same file to or from a wired workstation. Strange, because, the wired workstation can achieve as much as 60 MB/sec rates while the wireless workstation may only achieve around 2.6MB/sec. If a period of time goes by, say, a few days with relatively no network activity on the server.. and if I suddenly start transferring a file via a wireless notebook, the sky2 hang will happen almost every time predictably.
Sadly, I must report that this bug still exists on kernel 2.6.21-1.394 on Fedora 7. It happens with great reliability anytime there is sustained network load. The script in comment #22 still fixes the hang. Over the course of 8 hours or so, with an average incoming ethernet speed of 14 megabits/sec (downloading stuff), here is the frequency... Lame... Wed Jun 6 23:10:00 EDT 2007 Lame... Wed Jun 6 23:14:23 EDT 2007 Lame... Thu Jun 7 00:23:46 EDT 2007 Lame... Thu Jun 7 01:01:13 EDT 2007 Lame... Thu Jun 7 03:25:08 EDT 2007
This problem still exists.. with no change.. on kernel 2.6.21-1.3228.fc7.i686 :(
This bug still exists in kernel 2.6.22.1-27.fc7.i686 :(
This bug still exists in kernel 2.6.22.1-33.fc7.i686 :( Hello? Am I the only one using the sky2 driver? is there some alternative I don't know about..
In my experience the degree of trouble you have with this NIC is inversly proportional to the price you pay for the NIC. My wife endured a world of pain on our mac mini, before we took the trip across town to replace the cheap 100Mibt switch with a Linksys Gigabit part. I've posted the faulty switch to the developer, but he didn't manage to reproduce anything on the current drivers.
This is not a switch issue. Same exact hardware works fine running Winblows. This is a straight up bug in the sky2 driver, and it has existed for some time now.
This bug seems to have been closed from some time, but I can confirm that the problem still occurs with the most recent version of RHEL4. It does in fact seem to be a switch related issue. I am using Red Hat Enterprise Linux ES release 4 (Nahant Update 8) with all previously-released errata relevant to my system applied. The version of the kernel RPM I am using is as follows. kernel-2.6.9-89.0.11.EL The output of "uname -r" is as follows 2.6.9-89.0.11.ELsmp I came across this problem after I connected a switch to my network that I had not used for some time (years). As reported by others above, the interface using sky2 fails entirely. This problem occurred consistently with the switch (100M network hub) attached, and stopped occurring once I removed from the network. (A system reboot was required to enable the interface to operate correctly again.) As such, based on my own observations, I myself can say conclusively that a faulty switch/hub does seem to trigger the problem. This bug number has been closed for sometime - however evidently not fixed - and I am not able to reopen it myself, however I will send this information in the hope that Red Hat will re-examine this issue. I plan to dispose of the problematic hub, however I would be happy to send it (for free) to Red Hat (or whoever would use it for testing) if that were to help solve/fix the problem.