Bug 228825

Summary: e1000 driver does not work properly with Tyan Tempest i5000PX (S5380) onboard nic
Product: Red Hat Enterprise Linux 4 Reporter: Magnus Pfeffer <pfeffer>
Component: kernelAssignee: Andy Gospodarek <agospoda>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: davem, jbaron, linville, nhorman, peterm, tgraf
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.9-55.ELsmp Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-06-27 21:15:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Attachments:
Description Flags
lspci output plain, verbose and numerical
none
es2kick.patch
none
Ethereal screenshot none

Description Magnus Pfeffer 2007-02-15 13:06:19 UTC
Description of problem:

We are using Tyan Tempest i5000PX (S5380) server mainboards with RHEL 4.

The onboard NICs are recognized by the e1000 driver, but only work with speeds
up to 100 MBit. If the NICs are connected to a gigabit switch, the network link
switches to 1000 MBit full duplex (according to the kernel log), but the
connection is not usable. Neither incoming nor outgoing connections are possible.

We upgraded to the latest RHEL kernel 2.6.9-42.0.8.ELsmp but the problem persists.

Knoppix/Debian kernels (2.6.18) do not show this behaviour, the NICs work
properly at all speeds (10/100/1000).

Version-Release number of selected component (if applicable):

See attached lspci output.


How reproducible:
Easily.

Steps to Reproduce:
1. Connect NIC to gigabit switch
2. Observe complete non-connectivity
3. 
  
Actual results:
No connectivity.

Expected results:
Connectivity at gigabit speed.

Additional info:

Comment 1 Magnus Pfeffer 2007-02-15 13:06:19 UTC
Created attachment 148108 [details]
lspci output plain, verbose and numerical

Comment 2 John W. Linville 2007-02-28 19:53:34 UTC
Can we see the output of ethtool and mii-tool on the NICs in question?

Comment 3 Magnus Pfeffer 2007-03-01 16:37:05 UTC
(In reply to comment #2)
> Can we see the output of ethtool and mii-tool on the NICs in question?

With working 100MBit link:

[root@aleph oracle_tables]# mii-tool -v
eth0: negotiated 100baseTx-FD flow-control, link ok
  product info: vendor 00:50:43, model 10 rev 2
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
eth1: no link
  product info: vendor 00:50:43, model 10 rev 2
  basic mode:   autonegotiation enabled
  basic status: no link
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

[root@aleph oracle_tables]# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
        Link detected: yes



Comment 4 Andy Gospodarek 2007-03-05 20:18:12 UTC
Created attachment 149285 [details]
es2kick.patch

There is an upstream patch that might address this issue, but I'm not certain. 
Here's the description:

commit bb8e3311ef9de8e72f45f910e4a977c313c7009c
Author: Jeff Garzik <jeff@garzik.org>
Date:	Fri Dec 15 11:06:17 2006 -0500

    e1000: workaround for the ESB2 NIC RX unit issue

    In rare occasions, ESB2 systems would end up started without the RX
    unit being turned on. Add a check that runs post-init to work around
    this issue.

    Originally from Jesse Brandeburg <jesse.brandeburg@intel.com>,
    rewritten to use feature flags by me.

    Signed-off-by: Jeff Garzik <jeff@garzik.org>

Can you tell me if you narrowed this problem down to one that is related to RX,
TX, or both?  For example, can you receive frames by running tcpdump/wireshark
on this interface?  What about generating some traffic with 'arping' and
checking whether or not it came out onto the wire?  This might help me since I
don't have that specific hardware around.

Comment 5 Magnus Pfeffer 2007-03-12 15:40:39 UTC
Created attachment 149828 [details]
Ethereal screenshot

Comment 6 Magnus Pfeffer 2007-03-12 15:56:56 UTC
Both send and receive seem to work, but the TCP connection gets out of step
after a few packets. See attached ethereal screenshot.


Comment 7 Neil Horman 2007-03-12 17:24:41 UTC
Do you have firewalls running on either the .45 or the .53 host?  Can you turn
them off for the purposes of testing.  The screenshot you are providing suggests
that you have an iptables rule running that is misbehaving and dropping some tcp
frames that it shouldn't be.

Comment 8 Magnus Pfeffer 2007-03-15 07:22:56 UTC
There is no firewall running on the servers in question. There are no iptable
rules set. We can send you a full tcpdump log file, but there is little more to
see than in the already posted screenshot: TCP connections do not work once the
server is connected to a gigabit switch. 

I'd like to repeat: Simply plugging the server into a 100 MBit switch solves the
problem completely. With a debian/knoppix kernel gigabit connections work with
no problems at all.



Comment 9 Andy Gospodarek 2007-03-15 13:46:42 UTC
Magnus,

Thanks for the information.  I don't see any patches that immediately address
this issue, but I will keep looking.  

As a data point, could you disable TSO on the Tyan system and see if that helps?
 You can do this with ethtool:

ethtool -K ethX tso [on|off]

Thanks.

Comment 10 Magnus Pfeffer 2007-03-23 08:52:20 UTC
Hello,

we tried to disable TSO as suggested and tried a few other ethtool switches for
good measure. The issue remains the same.

Yours,

Magnus Pfeffer

Comment 11 Magnus Pfeffer 2007-03-26 07:18:16 UTC
Andy,

as the server is supposed to enter productive use in May, we decided to buy an
additional PCIe Gigabit LAN card. Can you suggest a maker/model that would
definitely work with RHEL AS 4.0? The hardware compatibility lists we found only
listed complete systems.

Thanks,

Magnus

Comment 12 Magnus Pfeffer 2007-03-26 12:13:39 UTC
Hello,

using the latest test kernel from
http://people.redhat.com/linville/kernels/rhel4/ solved the problem. 

Yours,

Magnus

Comment 13 Andy Gospodarek 2007-03-26 13:43:39 UTC
That is excellent news.  There are no patches in Linville's latest test kernels
that won't appear in the next update, so this should be resolved in 4.5.  If you
would like to test kernels to be sure, you can grab them here:

http://people.redhat.com/jbaron/rhel4/

Comment 14 Andy Gospodarek 2007-05-03 14:58:28 UTC
Have the new kernels for RHEL 4.5 resolved this issue?



Comment 15 Magnus Pfeffer 2007-06-26 16:09:46 UTC
Kernel 2.6.9-55.ELsmp fixed the issue. Please close the bug.

Thanks for the support.