Description of problem: sky2 transmitter lockup All networking stops every few days; reboot is necessary. Problem and solution is described in: http://bugzilla.kernel.org/show_bug.cgi?id=6839 Version-Release number of selected component (if applicable): We have problems with Marvell 88E8053 gigabit ethernet interface. This interface is used on all recent ASUS motherboards (e.g. P5W DH Deluxe) As a consequence, RHEL networking is unreliable on all these systems. How reproducible: Difficult, but solution is known :-) Patch for solving this bug is available in kernel 2.6.18.1 Can this patch be made available for RHEL 4.4 (backport to kernel 2.6.9)?
Confirmed (seeing the same problem). See also: http://marc.theaimsgroup.com/?l=linux-netdev&m=116227589707824&w=2
*** Bug 216801 has been marked as a duplicate of this bug. ***
Created attachment 143079 [details] upstream-fix Upstream commit that resolves this is: 470ea7eba4aaa517533f9b02ac9a104e77264548
Comment on attachment 143079 [details] upstream-fix Wrong file....
Created attachment 143081 [details] correct patch correct upstream fix
In comment of patch (id=143081) the comment states: "Only the Yukon-FE chip is Marvell 88E803X (10/100 only) are affected." We have trouble with a different chipset: Marvell 88E8053 (gigabit) So I am wondering whether the comment is wrong (has to be updated?) or does this patch not solve this particular problem?
I wondered that too, but there doesn't seem to be anything specific to that hardware in the patch, so I'm adding to some new test kernels. You should have something to test on a box in a few hours.
Test kernels with the attached patch are available here: http://people.redhat.com/agospoda/#rhel4 And feedback would be greatly appreciated.
Do you have a RHEL4 smp i686 kernel for me? I prepared a test environment, but there is no 32 bits test kernel (only 64 bits). Currently running: 2.6.9-42.0.3.ELsmp #1 SMP Fri Oct 6 06:21:39 CDT 2006 i686 i686 i386 GNU/Linux
Sorry about that. My gtest.5 builds did not include i686 for some reason. Here is a link to an older kernel that should work for you. Please let me know if you need any other versions: http://people.redhat.com/agospoda/bz/216799/
Updated kernels (including the 32-bit builds!) are available here: http://people.redhat.com/agospoda/#rhel4
Installed the 2.6.9-42.29.EL.gtest.4smp kernel on our server about one week ago. Till now, no problems occured. Last night, I did a stress test between two servers. Only one of these servers has the Mavell network interface. I copied 1 Terabyte data from server 1 to server 2 while simultaneously copying 1 Terabyte data from server 2 to server 1. So, heavy load in both directions for a long period. No problems occured during this test. It is no proof but I am pretty confident that the patch solves the problem. Thanks for your effort to provide the patch.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Having lots of issues with the sky2 timeout with heavy traffic. Using Intel® Server Board SE7520BB2 lspci | grep Marvell 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF Gigabit Ethernet Controller (rev 18) was running the latest kernel 2.6.9-42.ELsmp #1 SMP when this was encountered. Traffic of over 10Mbps would cause timeout in about 15-30mins kernel: NETDEV WATCHDOG: eth1: transmit timed out kernel: sky2 eth1: tx timeout kernel: sky2 status report lost? Server[4640]: Failed to open log file, log aborted. I tested the 2.6.9-55.EL.gtest.19smp and the server seemed to be solid, but for only 4-5 hours before the getting a slightly similar error messages again. kernel: NETDEV WATCHDOG: eth1: transmit timed out kernel: sky2 eth1: tx timeout kernel: sky2 hardware hung? flushing
Did the hardware recover on its own or did you need to reboot and or unload/reload the module to make the sky2 device operational again?
Two ways to fix problem 1. unload/reload module #rmmod sky2 && modprobe sky2 would get the network working right away 2. Rebooting the server would also fixed the issue. Hardware was not able to recover on its own after leaving it on for 12hrs. Frank
Someone managed to backport an upstream sky2 driver to RHEL4. You can find the srpm here: http://people.redhat.com/nhorman/rpms/kernel-2.6.9-55.3.EL.bz228733.src.rpm
if someone can confirm that the kernel andy referenced in comment #19 fixes this issue, I can propose it for 4.6
I built and installed kernel-smp-2.6.9-55.3.EL.bz228733.i686.rpm on 3 different servers (identical hardware). I get the following in /var/log/messages at boot time (on each server): Jun 14 13:07:34 el4-node1 kernel: sky2: probe of 0000:02:00.0 failed with error -125173760 Jun 14 13:09:34 el4-node2 kernel: sky2: probe of 0000:02:00.0 failed with error -125173760 Jun 14 13:07:21 el4-node3 kernel: sky2: probe of 0000:02:00.0 failed with error -125173760 And the network device doesn't exit--eth0 becomes the e1000 NIC that is normally eth1.
my bad. Looks like the probe routine changed function signatures upstream, and as a result was returning void where an integer was expected. This link: http://people.redhat.com/nhorman/rpms/kernel-2.6.9-55.3.EL.bz228733.2.src.rpm Is a new srpm that holds the fix for that. Let me know how it goes. Thanks!
I built and installed kernel-smp-2.6.9-55.3.EL.bz228733.2.i686.rpm on the same 3 servers as comment #23. I get a kernel panic when eth0 is initialized: Unable to handle kernel NULL pointer dereference at virtual address 0000017c printing eip: c0282286 *pde = 35939001 Oops: 0000 [#1] SMP Modules linked in: netconsole netdump parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mirror dm_mod button battery ac ftdi_sio usbserial uhci_hcd ehci_hcd hw_random sky2 e1000 ext3 jbd ata_piix libata sd_mod scsi_mod CPU: 2 EIP: 0060:[<c0282286>] Not tainted VLI EFLAGS: 00010292 (2.6.9-55.3.EL.bz228733.2smp) EIP is at netif_receive_skb+0x19/0x2ec eax: f4ee2e80 ebx: f5a58800 ecx: 00000608 edx: 00000000 esi: f4ee2e80 edi: 0000003c ebp: f5a58a40 esp: c03d3f64 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, threadinfo=c03d3000 task=f7e40b30) Stack: f4ee2e80 00000001 f4ee2e80 f5a58800 f4ee2e80 0000003c f5a58a40 f88abb28 00000282 003c0300 f55a8010 00000003 00000000 00000040 f7f3dd80 00000000 00000000 40000000 f5a58800 00000040 f7f3dd80 f88ac29e c03d3fd4 00000000 Call Trace: [<f88abb28>] sky2_status_intr+0x212/0x455 [sky2] [<f88ac29e>] sky2_poll+0x5c/0xbf [sky2] [<c0282704>] net_rx_action+0xae/0x160 [<c0126a14>] __do_softirq+0x4c/0xb1 [<c010819f>] do_softirq+0x4f/0x56 ======================= [<c0107ab4>] do_IRQ+0x1a2/0x1ae [<c02d6da8>] common_interrupt+0x18/0x20 [<c01040e8>] mwait_idle+0x33/0x42 [<c01040a0>] cpu_idle+0x26/0x3b Code: 00 00 89 72 28 e8 6b 48 ea ff 53 9d eb 80 5b 5e 5f c3 55 57 56 53 83 ec 0c 89 44 24 08 c7 44 24 04 01 00 00 00 89 04 24 8b 50 18 <83> ba 7c 01 00 00 00 74 6f 31 c0 f6 42 58 20 74 14 0f b7 82 ae
hmm, thats odd. I've located a sky2 card down here, and I've been pummeling it for about an hour now with icmp traffic in and out with no problems. Looking at the backtrace above, I put your oops in this section of code: /* Update receiver after 16 frames */ if (++buf_write[le->link] == RX_BUF_WRITE) { sky2_put_idx(hw, rxqaddr[le->link], sky2->rx_put); buf_write[le->link] = 0; } Looking at it, both buf_write and rxqaddr are statically defined and should never be NULL, and the sky2, le and hw pointers all get dereferenced previously in the function, indicating that if there were going to be a NULL pointer exception, it should have happened earlier in the function. About the only cause for this oops that I could see would be if le->link were greater than 2 and we overran one of buf_write or rxqaddr (both of which are statically defined arrays). Since my card seems to be working with that kernel just fine, do you think you can add some debug code to sky2_intr_status to see what exactly is NULL when we oops? Thanks!
Not there is new kernel on my people page that fixes the above problem. Please test it out and report results.
I too am having this problem. I am running a Pentium 4 dual core SMP system with kernel 2.6.9-55.0.9 SMP. My Marvell chipset is as follows: 01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) 03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) I have tried both the patch suggested by Andy Gospodarek which I built myself into a new module and the ready built kernel by Neil Horman at: http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-55.3.EL.bz228733.i686.rpm I still get one interface or other locking up under heavy load after between 0.5 and 3 hours. Running Neils kernel I get: Oct 30 16:40:25 app201 kernel: NETDEV WATCHDOG: eth0: transmit timed out Oct 30 16:40:25 app201 kernel: sky2 eth0: tx timeout Oct 30 16:40:25 app201 kernel: sky2 hardware hung? flushing Oct 30 16:41:58 app201 kernel: sky2 eth0: disabling interface running my own kernel with Andys patch I get much the same: Oct 31 14:48:37 app201 kernel: NETDEV WATCHDOG: eth0: transmit timed out Oct 31 14:48:37 app201 kernel: sky2 eth0: tx timeout Oct 31 14:48:37 app201 kernel: sky2 hardware hung? flushing Oct 31 14:53:47 app201 kernel: NETDEV WATCHDOG: eth0: transmit timed out Oct 31 14:53:47 app201 kernel: sky2 eth0: tx timeout Oct 31 14:53:47 app201 kernel: sky2 status report lost? Oct 31 14:54:17 app201 kernel: NETDEV WATCHDOG: eth0: transmit timed out Oct 31 14:54:17 app201 kernel: sky2 eth0: tx timeout Oct 31 14:54:17 app201 kernel: sky2 hardware hung? flushing Oct 31 14:59:37 app201 kernel: NETDEV WATCHDOG: eth0: transmit timed out Oct 31 14:59:37 app201 kernel: sky2 eth0: tx timeout Oct 31 14:59:37 app201 kernel: sky2 status report lost? Oct 31 14:59:53 app201 su(pam_unix)[2713]: session closed for user root Oct 31 15:00:07 app201 kernel: NETDEV WATCHDOG: eth0: transmit timed out Oct 31 15:00:07 app201 kernel: sky2 eth0: tx timeout Oct 31 15:00:07 app201 kernel: sky2 hardware hung? flushing The machine I am using has 8 NIC sockets. 4 are unused. The 4 that are used use the sky2 driver. eth0 and eth1 are used to form a bridge, br0. The error is triggered when two machines are sending and receiving large amounts data over the bridge. If you need any further details of my setup, please just let me know. Thanks, Pete Philips.
Thanks for the feedback, Pete. It sounds like you are getting an error similar to what others are seeing upstream. I'll talk with Neil and see if we can get something going soon to resolve this.
Out of interest, has anyone reported this problem with RHEL5?
I can confirm, after conducting a test over the weekend, that RHEL5 with the kernel-2.6.18-8.el5 kernel is also affected by this bug. The only problem is there was no log output to confirm this. I set up a machine with two Marvell NICs in a bridge configuration. I then set up two other machines (client/server) to continually pass large files between them, over the bridge. I left this going over the weekend. On Monday morning the bridge was no longer active and the client/server pair could no longer communicate. This was reset using the standard rmmod sky ; modprobe sky2 reset procedure. After that the bridge came back. Since there was no log output it cannot be conclusively said to be the same problem but it certainly looks like it.
Further experimentation reveals that this problem only manifests itself if the server in question is running a bridge and the two ports associated with the bridge use different speeds (10/100/Gbit) or different duplex settings. If both interfaces are set identically then the problem does not appear.
I can confirm that this bug is also present in RHEL 5.1 (kernel-2.6.18-53.el5) . After only 1/2 hour of putting data through my bridge I get this message: Nov 20 12:27:40 secerno kernel: sky2 eth0: tx timeout Nov 20 12:27:40 secerno kernel: sky2 eth0: disabling interface Nov 20 12:27:40 secerno kernel: sky2 eth0: enabling interface Nov 20 12:27:40 secerno kernel: sky2 eth0: ram buffer 48K Nov 20 12:27:43 secerno kernel: sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control rx Although eth0 is disabled / enabled as stated in the log, the bridge is no longer functional after this event. Pete.
Interesting find that this problem manifests itself when the links are set to different speed/duplex. Have you happenened to notice if the sky2-based hardware is the slower or faster link, or does it even matter?
The machine I am using has four Marvell 88E8053 NICs so both sides of the bridge are exactly the same NIC.
Pete, is this hardware still problematic for you? So I'm back to looking at this and as it stands right now there appears to be an issue with sky2 that becomes apparent when you have a sky2 interface as part of a bridging interface. The reason that is significant, is that the bridging devices can often push more traffic through the box than when using the device as an endpoint. I realize that this is a painful issue, so I would like to see if we can get it resolved. Now that we have a decent idea of why this might happen I'm going to see if I can get my hands on some sky2 hardware (I think we have some in the office) and look at reproducing the issue so I can understand better why it's happening. I have a feeling this might still be a problem upstream, but once we get a good handle on how to reproduce it, then we can start to figure out if it's still upstream or not.
Hi Andy, Yes I can confirm that this is still a problem. I can also confirm that this problem is also present in the latest RHEL5.1 kernel. As you suggest, the crucial element in reproducing the behaviour seems to be the use of a bridge. Pete.
Created attachment 302651 [details] new sky2 backport I've not tested it yet (no sky2 hardware in hand at the moment), but I've done this backport of the latest sky2 driver that a co-worker has been using on a 2.6.25 kernel, and he has been unable to reproduce any lockups or crashes with it. If you could give it a spin, I'd appreciate it. Thanks!
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel4 Please test them and report back your results.
With the backported sky2 driver in the test kernel (smp-2.6.9-70.EL.gtest.44) I get no network connectivity. The driver loads and finds the nic but there's just no connection. This is on a Gigabyte GA-965P-DS3 with onboard Marvell 88E8053 chip. With the stock RHEL4 kernel (smp-2.6.9-67.0.15.EL) I see tx timeouts that the driver cannot recover from, reloading the driver fixes it. Infact I'm getting tx timeouts right now and it's entirely reproduceable. Just start a couple of torrents (in this case Fedora 9 DVD images) and wait. Within 30 minutes tx timeouts will kill the network. This is with very moderate load just around 600kb/s down and 40kb/s up. sky2 eth0: tx timeout sky2 eth0: transmit ring 269 .. 228 report=269 done=269 sky2 hardware hung? flushing sky2 eth0: tx timeout sky2 eth0: transmit ring 268 .. 227 report=269 done=269 sky2 eth0: status report lost? ... repeat last three lines ad nauseum
Andy / Neil, Sorry but I am unable to perform further testing of this issue with RHEL4 as my organisation have since moved over to RHEL5. I can however confirm that it remains a problem in RHEL5. Do I need to report it separately as a RHEL5 bug or does this ticket cover it? Thanks, Pete. pete.philips
No problem, Pete. I actually read yesterday that a firmware update will likely fix the tx timeout problems that have been plaguing sky2 for a while. Here is a copy of the email that was sent out to address it: "Subject: [sky2, solved] transmit timeouts and firmware update... I (and a lot of other users) have been experiencing the frequent sky2 transmit timeout problem [1] (on 88E8053/Yukon2 EC gig hardware); this is a result of the embedded NIC controller locking up, and I've found that updating the firmware addresses this issue. I'm still seeing a previous and different issue [2] from time to time though (silicon bug?). Marvell shipping broken firmware is completely unpublicised or acknowledged, however updated firmware is available through your motherboard vendor, so all hope it not lost after all... My 8053/EC is using firmware 2.2 (previously 1.9) - you can check in DOS with 'yukondg.exe' from http://www.marvell.com/drivers/files/yukondg_v6.53.4.3.zip . Thanks, Daniel --- [1] NETDEV WATCHDOG: eth0 (sky2): transmit timed out sky2 eth0: tx timeout sky2 eth0: transmit ring 20 .. 491 report=20 done=20 ... --- [2] sky2 eth0: hung mac 1:119 fifo 7 (90:163) sky2 eth0: receiver hang detected sky2 eth0: disabling interface sky2 eth0: enabling interface --"
Andy, Thanks for the information. The ZIP file only has a DOS utility which is always a little tricky ;-) Do you know how to determine the firmware revision from Linux? If I use lcpci I get: 01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) 03:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) 04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit Ethernet Controller (rev 15) I wonder if "rev 15" corresponds to firmware 1.5? Alternatively if I use "ethtool -i eth4" I get driver: sky2 version: 1.14 firmware-version: N/A bus-info: 0000:01:00.0 The "N/A" doesn't sound promising. Regards, Pete.
Just a note of warning. In the original thread on linux-net which #46 quoted from there are links to firmware downloads from Gigabyte and Jetway. That firmware is for the *88E8056 only* and it *will* trash your 88E8053 nic without so much as a warning. I did just that on my GA-965P-DS3 (rev 1.0) yesterday and am currently waiting on a reply from Gigabyte techsupport. It's no big loss really since it was basically useless anyway but still unfortunate.
Thanks for the warning, Tom!
Pete, the lspci information is not going to contain the firmware version -- that should specifically be hardware related. Unfortunately ethtool doesn't seem to show the correct version for your sky2 either. Hopefully anyone with this problem will be able to get firmware/BIOS updates that will help out. Both Neil and I tried quite a bit to reproduce this problem with our pci-e sky2 cards and I'm starting to understand why we could not when it seemed pretty easy for most who had on-board sky2 cards. If you do have success with a firmware/BIOS update for your on-board sky2 cards, please post the model of your motherboard and firmware version that fixed it if you don't mind. It would be great for us to collect a list so we can help others that have problems. Thanks!
Updating PM score.
Since RHEL 4.8 External Beta has begun, and this bugzilla remains unresolved, it has been rejected as it is not proposed as exception or blocker.
any update to this, Lodewijk?
(In reply to comment #55) > any update to this, Lodewijk? This problem caused serious unpredictable unstable behaviour in production servers (stopping network traffic after a few days) in November 2006. I bought new Intel network cards almost immediately, because I really needed stable servers on a short term. Looking back, that was not a bad choice, as this issue still exists in March 2009. So, no comments from my side except that I am disappointed in Marvell. People should avoid these Marvell network interfaces as they are not well supported for Linux.
I assume by that you mean that using Andys test kernels, the problem still exists, correct?
ping, any update here?
closing due to inactivity. No update in 2 months.