Bug 186397 - Problem with the sky2.ko network driver (Marvell GigE card driver)
Problem with the sky2.ko network driver (Marvell GigE card driver)
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: John W. Linville
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-03-23 06:40 EST by Ariel Biener
Modified: 2007-11-30 17:07 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-12 11:08:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ariel Biener 2006-03-23 06:40:35 EST
Description of problem:

The sky2.ko driver gets stuck after running for a few hours
on a workstation with minimal to moderate network activity (see
details below)

Version-Release number of selected component (if applicable):
2.6.9-34smp (kernel)

How reproducible:
Always

Steps to Reproduce:
1. Just run the system as normal, on the hardware described below
  
Actual results:
Networking gets stuck, system continues to run with no network.

Expected results:
System should operate normally, without the network getting stuck.

Additional info:

Below is the e-mail sent to the sky2.ko driver writer/maintainer,
it also includes all the details about this problem, and the
relevant hardware and other details.

------------------

  Hello Stephen,


    The sky2.ko driver seems to get broken on the below environment. What I
mean by "broken" is that after working for a few hours, it gets stuck, and
networking is gone. I used sk98lin driver from Marvell till yesterday, and
the last up2date of RHEL4 brought on board a new driver, named sky2.ko, which
supposedly supports my interface (and in fact it did). I thus started using it
instead of sk98lin.ko driver (from Intel ->Marvell), but after a few hours it
got suck (below are the messages from the kernel, and then the system config
and parameters):

Mar 23 03:06:08 fireball kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 23 03:06:08 fireball kernel: sky2 transmit interrupt missed? recovered
Mar 23 03:07:43 fireball kernel: NETDEV WATCHDOG: eth0: transmit timed out
Mar 23 03:07:43 fireball kernel: sky2 eth0: tx timeout


Below are the initialization messages from the sk98lin.ko Intel/Marvell 
driver, so you can further compare to your sky2.ko driver:

Mar 23 13:13:03 fireball kernel: eth0: network connection up using port A
Mar 23 13:13:03 fireball kernel:     speed:           100
Mar 23 13:13:03 fireball kernel:     autonegotiation: yes
Mar 23 13:13:03 fireball kernel:     duplex mode:     full
Mar 23 13:13:03 fireball kernel:     flowctrl:        none
Mar 23 13:13:03 fireball kernel:     irq moderation:  disabled
Mar 23 13:13:03 fireball kernel:     tcp offload:     enabled
Mar 23 13:13:03 fireball kernel:     scatter-gather:  enabled
Mar 23 13:13:03 fireball kernel:     tx-checksum:     enabled
Mar 23 13:13:03 fireball kernel:     rx-checksum:     enabled



System configuration:


RHEL 4 WS (update 3), running on Intel Desktop board chipset 925,
processor is a Pentium 4HT 3.2 (non 64bit), 512MB of DDR2 memory.

Kernel is 2.6.9-34smp (RHEL 4WS update 3).


the Marvell GigE card is detected as:

04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF
Gigabit Ethernet Controller (rev 17)


Running `strings' on the driver yields the following:

parm=debug:Debug level (0=none,...,16=all)
parm=copybreak:Receive copy threshold
description=Marvell Yukon 2 Gigabit Ethernet driver
author=Stephen Hemminger <shemminger@osdl.org>
license=GPL
version=0.13 DD105319C547861EB68046F
vermagic=2.6.9-34.ELsmp SMP 686 REGPARM 4KSTACKS gcc-3.4


thanks for your time,

--Ariel
 --
 Ariel Biener, CISO
 Tel-Aviv University CIT div.
 e-mail: ariel@aristo.tau.ac.il phone: 03-6406086
 PGP key:    http://www.tau.ac.il/~ariel/pgp.html
Comment 1 John W. Linville 2006-03-23 09:40:52 EST
Test kernels w/ very late version of sky2 are available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give them a try and post the results here...thanks! 
Comment 2 Ariel Biener 2006-03-23 14:29:27 EST
Hi John,

  

    While I cannot run a test kernel on a production machine, I can enclose
the answer from the `sky2' developer/maintainer (I contacted him directly
as well). He offered to use the latest 1.1 sky2 version, and sent me both
sky2.h and sky2.c. What version are you using in the 2.6.9-34.6smp kernel ?
See Stephen Hemminger answer below.

--Ariel

Date: Thu, 23 Mar 2006 09:24:04 -0800
From: Stephen Hemminger <shemminger@osdl.org>
To: Ariel Biener <ariel@aristo.tau.ac.il>
Subject: Re: Marvell Yukon 2 Gigabit Ethernet driver, version = 0.13
 DD105319C547861EB68046F
Message-ID: <20060323092404.47a65145@localhost.localdomain>
In-Reply-To: <200603231340.23302.ariel@aristo.tau.ac.il>
References: <200603231340.23302.ariel@aristo.tau.ac.il>
X-Mailer: Sylpheed-Claws 2.0.0 (GTK+ 2.8.6; i486-pc-linux-gnu)
Mime-Version: 1.0

I would recommend you take the latest 1.1 version and try it.
There are lots of race issues resolved after I finally got the hardware
documentation.  Nothing is perfect, but this version is way more stable.

Comment 3 John W. Linville 2006-03-23 14:33:36 EST
The stock RHEL4 kernel is currently at 0.13.  The test kernels at the location 
in comment 1 are using 1.1. 
 
I would suggest you at least try the test kernels.  If the driver you have now 
is locking-up, how much worse could it be? :-) 
Comment 4 Ariel Biener 2006-03-23 18:42:04 EST
Well, actually, I switched back to using the sk98lin driver from Intel, which
works fine, however, a stock driver is of course alot more preferable. Any idea
when you expect the next kernel upgrade release ?

Regardless, since my whole bug report was meant to help myself and others
who may encounter this, I will install a test kernel for a week, and report
back whether this fixes it or not.

--Ariel
Comment 5 John W. Linville 2006-03-24 10:38:51 EST
Cool...let me know how it goes!  BTW, next kernel update won't be until 
May/June IIRC... 
Comment 6 Bill Hoover 2006-04-28 14:56:50 EDT
I am also testing this, as I ran into this problem with the NICs on the
PenguinComputing Xeon blades.  I'll let you know how things look once the
machines have been run under load for a while.
Comment 7 Bill Hoover 2006-05-01 13:41:46 EDT
I'm afraid there are still issues.  I got the following on one of the nodes.  I
haven't put any load on them yet, so I may have more problems when they are
loaded.  Once this got into this state it was off the network till I rebooted.

Apr 27 09:29:35 compute-105-12 kernel: sky2 v1.1 addr 0xf6a00000 irq 169
Yukon-XL (0xb3) rev 1
Apr 27 09:29:35 compute-105-12 kernel: divert: allocating divert_blk for eth0
Apr 27 09:29:35 compute-105-12 kernel: sky2 eth0: addr 00:a0:d1:e4:66:0d
Apr 27 09:29:35 compute-105-12 kernel: divert: allocating divert_blk for eth1
Apr 27 09:29:35 compute-105-12 kernel: sky2 eth1: addr 00:a0:d1:e4:66:0e
Apr 27 09:29:35 compute-105-12 kernel: sky2 eth0: enabling interface
Apr 27 09:29:35 compute-105-12 kernel: sky2 eth0: Link is up at 1000 Mbps, full
duplex, flow control none
Apr 28 18:32:13 compute-105-12 kernel: sky2 eth0: tx timeout
Apr 28 18:32:13 compute-105-12 kernel: sky2 eth0: transmit ring 489 .. 449
report=491 done=491
Apr 28 18:32:13 compute-105-12 kernel: sky2 status report lost?
Apr 28 18:32:23 compute-105-12 kernel: NETDEV WATCHDOG: eth0: transmit timed
outApr 28 18:32:23 compute-105-12 kernel: sky2 eth0: tx timeout
Apr 28 18:32:23 compute-105-12 kernel: sky2 eth0: transmit ring 491 .. 451
report=491 done=491
Apr 28 18:32:23 compute-105-12 kernel: sky2 hardware hung? flushing
Apr 28 18:40:28 compute-105-12 kernel: NETDEV WATCHDOG: eth0: transmit timed
outApr 28 18:40:28 compute-105-12 kernel: sky2 eth0: tx timeout
Apr 28 18:40:28 compute-105-12 kernel: sky2 eth0: transmit ring 451 .. 410
report=491 done=491
Apr 28 18:40:28 compute-105-12 kernel: sky2 status report lost?
Apr 28 18:41:08 compute-105-12 kernel: NETDEV WATCHDOG: eth0: transmit timed
outApr 28 18:41:08 compute-105-12 kernel: sky2 eth0: tx timeout
Comment 8 Bill Hoover 2006-05-02 14:19:25 EDT
Should I file this upstream in kernel bugzilla also, or do you want to handle that?
Comment 9 John W. Linville 2006-05-02 14:27:56 EDT
There are some further upstream changes.  Let me get a test kernel together 
with them to see if it covers this problem.  If not, we can go upstream with 
the problem. 
Comment 10 Bill Hoover 2006-05-02 14:54:40 EDT
OK - but...I either need the SRPM for it so I can rebuild, or else I need the
patch from bug 173843 in it also.  As you can see in the text, I took your SRPM
and added that patch to it.
Comment 11 John W. Linville 2006-05-02 15:01:06 EDT
I can't promise to include that patch, but I do always publish SRPMs. 
Comment 12 Bill Hoover 2006-05-03 17:30:43 EDT
Are you going to be using sky2 1.3-rc1?
Comment 13 John W. Linville 2006-05-04 10:28:26 EDT
The build was already in progress when Stephen posted that.  I have 1.2 in the  
test kernels here:  
 
  http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give those a try and post the results here...thanks!  
Comment 14 John W. Linville 2006-05-19 13:13:40 EDT
Test kernels w/ sky2 1.3 now available at the same location... 
Comment 15 Anthony Dodson 2006-06-19 11:11:00 EDT
I still see the same problem with kernel 2.6.9-39.EL.jwltest.143smp, which
includes version 1.3 of the driver.
Comment 16 Ariel Biener 2006-08-10 13:13:32 EDT
Hi,

   I also still see the problem, on RHEL4.3-WS. The interface gets stuck every
few days, and a reboot is required since the module is stuck.

Linux fireball.tau.ac.il 2.6.9-34.0.2.ELsmp #1 SMP Fri Jun 30 10:33:58 EDT 2006
i686 i686 i386 GNU/Linux

--Ariel
Comment 17 John W. Linville 2006-08-10 19:28:41 EDT
Test kernels w/ sky2 1.5 are available here:

   http://people.redhat.com/linville/kernels/rhel4/

Please give them a try and post the results here...thanks!
Comment 18 Anthony Dodson 2006-08-16 13:27:39 EDT
I have the same problem with the 1.5 driver in kernel 2.6.9-42.EL.jwltest.156smp.
Comment 19 Anthony Dodson 2006-08-24 15:28:01 EDT
And, I experience the same problem with the non-SMP kernel in
kernel-2.6.9-42.2.EL.jwltest.160.i686.rpm.
Comment 20 John W. Linville 2006-08-29 17:52:26 EDT
One of the patches that went into sky2 1.6 has this comment:

    [PATCH] sky2: status interrupt handling improvement

    More changes to prevent losing status and causing hangs.
    The hardware is smarter than I gave it credit for.
    Clearing the status IRQ causes the status state machine to
    toggle an IRQ if needed and post any more transmits.

Test kernels w/ this and other patches to bring sky2 up-to-date w/ 1.6 are 
available at the same location as in comment 17.  I hate to keep spinning you 
off to random new versions, but give then 'causing hangs' fix comment above, 
would you mind giving this new kernel a try?  Thanks!
Comment 21 John W. Linville 2006-10-12 11:08:48 EDT
Closed due to lack of response.  Please reopen when the requested information 
becomes available...thanks!
Comment 22 Frank hoang 2007-05-09 16:52:31 EDT
Sorry for reopening this,
I'm having issues w/ the sky2 drivers
Having lots of issues with the sky2 timeout with heavy traffic.
Using Intel® Server Board SE7520BB2

#lspci | grep Marvell
04:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8050 PCI-E ASF
Gigabit Ethernet Controller (rev 18)
#ethtool -i eth1
driver: sky2
version: 1.1

was running the latest kernel 2.6.9-42.0.10.ELsmp #1 SMP when this was encountered.
Traffic of over 10Mbps would cause timeout in about 15-30mins
kernel: NETDEV WATCHDOG: eth1: transmit timed out
kernel: sky2 eth1: tx timeout 
kernel: sky2 status report lost? 
Server[4640]: Failed to open log file, log aborted.
fix for it is to reboot server or run #rmmod sky2 && modprobe sky2 to remount
the modules.
I tested the 2.6.9-55.EL.gtest.19smp from http://people.redhat.com/agospoda/#rhel4
and the server seemed to be solid, but for 
only for a few hours before the getting a slightly similar error messages again.

kernel: NETDEV WATCHDOG: eth1: transmit timed out
kernel: sky2 eth1: tx timeout
kernel: sky2 hardware hung? flushing
current version is 1.6
# ethtool -i eth1
driver: sky2
version: 1.6
firmware-version: N/A
bus-info: 0000:04:00.0

Note You need to log in before you can comment on or make changes to this bug.