Bug 194408

Summary: Intel PRO/1000 w/e1000 fails at 1Gb speed
Product: Red Hat Enterprise Linux 4 Reporter: Larry R. Irwin <larryi>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: akarlsso, bruce.w.allan, jbaron, jesse.brandeburg, larryi
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0304 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-05-08 01:49:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Boot section from /var/log/messages
none
Output from lspci
none
Output from dmesg
none
Test Results of various recommended tests
none
Output from dmidecode none

Description Larry R. Irwin 2006-06-07 20:23:31 UTC
Description of problem:
70-100% packet loss unless forced to 100Mb speed
Necessary to also force full duplex.

Version-Release number of selected component (if applicable):
Redhat ES 4 Update 3 - Stock kernel and components
Dell Dual Xeon server w/2GB RAM w/on-board dual Intel PRO/1000 NIC
(currently with second port disabled in BIOS)
-----
# uname -a
Linux ccasvr.meca.com 2.6.9-34.0.1.ELsmp #1 SMP Wed May 17 17:05:24 EDT 2006 
i686 i686 i386 GNU/Linux
-----
# lspci | grep Ethernet
0b:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet 
Controller (rev 05)
-----
# ethtool -h 2>&1 | head -1
ethtool version 1.8
# ethtool -i eth0
driver: e1000
version: 6.1.16-k3-NAPI
firmware-version: N/A
bus-info: 0000:0b:07.0

How reproducible:
1. Boot stock kernel shown above with stock e1000 module.
2. Ping another machine on the same subnet. (70-100% packet loss)
3. Run command:
   ethtool -s eth0 speed 100 duplex full port tp autoneg off
4. Ping another machine on the same subnet. (0% packet loss)
5. (Note that even if you connect to a 10/100 switch that you still need to 
use the ethool command given above or it will default to 100Mb _half_ duplex.)

Additional info:
I have the boot section from the messages attached.
I have lspci and dmesg output also, but could only attach one attachment.

Comment 1 Larry R. Irwin 2006-06-07 20:23:31 UTC
Created attachment 130706 [details]
Boot section from /var/log/messages

Comment 2 Larry R. Irwin 2006-06-07 20:28:34 UTC
Created attachment 130707 [details]
Output from lspci

Comment 3 Larry R. Irwin 2006-06-07 20:29:31 UTC
Created attachment 130708 [details]
Output from dmesg

Comment 4 Jesse Brandeburg 2006-06-14 16:39:55 UTC
Hi, sounds like a wierd problem, I have some follow up questions.

what does 'ethtool eth0' report when you're negotiated at 1Gb?
please attach the output of 'ethtool -S eth0' after your ping test
please attach output of cat /proc/interrupts after your ping test.

please also attach the output of dmidecode and make sure your BIOS is up to
date.  We had some mention of a user having lots of packet loss problems on a
dell server when they had enabled server management using the same IP address.

Do you have remote system management enabled?

Comment 5 Larry R. Irwin 2006-06-14 21:22:55 UTC
Hi,

I do not have remote system management enabled.

As soon as I can coordinate after hours with an on-site IT person I am going 
to do the following:

1) Try John Linville's newest kernel release.
2) Try a back-to-back connection between 2 servers to rule out the switches.
3) Capture output requested by Jesse (above).
4) If possible, provide port stats from managed switch for 1Gb connection.

This issue is also being followed as a Dell Support Incident. It will probably 
be sometime next week to get the info, since I have to coordinate with another 
person for after-hours testing...

Comment 6 Larry R. Irwin 2006-06-20 19:49:58 UTC
More info from testing:
1) The new kernel did not help.
2) Back-to-Back test did not help.
3) Ping testing and "ethtool eth0" output in attachment Test_Results.txt
   "ifconfig eth0" output at various times provided also.
   "dmidecode" output is also attached.
   (I did not capture the output of "ethtool -S eth0"
    or "cat /proc/interrupts...
    If we still need that I can get it...)
4) The switch never reported any errors at any time.

Comment 7 Larry R. Irwin 2006-06-20 19:51:23 UTC
Created attachment 131221 [details]
Test Results of various recommended tests

Comment 8 Larry R. Irwin 2006-06-20 19:52:17 UTC
Created attachment 131222 [details]
Output from dmidecode

Comment 9 Jesse Brandeburg 2006-06-21 23:10:45 UTC
Are these machines connected to any other network?  The behavior you're
describing is common when you have two nics on the same switch segment and the
packets take an alternate route.  what does tracepath 192.168.1.7 say?

I would still like to see the output of ethtool -S eth0 and cat /proc/interrupts

are you able to try the second (currently disabled) port instead?  have you
tried hooking directly to a linux box instead?  If you do so you will be able to
tcpdump at both ends and we can see what the packet stream looks like.

did you run the self test in the driver?
# ethtool -t eth0 offline

Thanks for following up.

Comment 10 Larry R. Irwin 2006-06-22 14:12:50 UTC
The configuration:
The main office LAN 192.168.1.0/24
Multitech RouteFinder 560 provids an internet gateway at 192.168.1.10
There is a remote office connected via a hardware VPN. LAN 192.168.2.0/24
The remote office uses a Multitech RouteFinder 550 for this purpose.

*No other machines on the network experience any issues with 1Gb connections.*
*The servers (except the Dell) and 45+ of the workstations are running 1Gb.*
This, in combination with the Back-to-Back test we performed, would rule out 
the alternate route possibilities as being the source of the problem.

The main office has 4 servers, 60+ workstations and 15+ printers. The Dell 
2800 linux server houses medical data and applications. Another linux server 
(not from us) runs their optical shop. The HP G3 is their accounting server. A 
new Windows 2003 server is being set up to implement Active Directory for 
controlling the 60+ workstations. The remote office on the VPN has about 4 
workstations and 2 printers.

Everything is working on the entire network as long as we don't allow the Dell 
to connect at 1Gb. - But we really should have it connected at 1Gb since the 
medical records data contains many images. - In fact, once we have the 1Gb 
connection working, I would like to have both ports connected and bound to 
provide an alternate path during heavy bandwidth hits.

I have tried the other NIC port with the same results. Currently, it is 
disabled in the BIOS.

I think (the next time I can coordinate testing) that the first thing we will 
try, in the interest of time, is to disable both on-board NIC's and put in a 
new NIC. If that works, then we know it is a hardware problem with the on-
board NIC. - Dell can do this since the client has Dell hardware support.

Then, if that doesn't work, I will obtain the things I forgot to pick up for 
Jesse during the last series of tests.

I really appreciate everyone's help.

Comment 11 Larry R. Irwin 2006-06-22 16:05:11 UTC
Dell is going to replace the motherboard 6/23/6 at 6:30pm.
(the h/w contract requires this approach vs. adding a new NIC)
I will post results on 6/26/6.

Comment 12 Larry R. Irwin 2006-06-26 13:43:19 UTC
Dell swapped out the motherboard on Friday night and when we switched the 
system and the switch to autonegotiate, it locked in at 1Gb and we lost no 
packets pinging other PC's on the LAN! So, it was a hardware problem all 
along. It just wasn't broken enough to make it easy to identify...

Thanks for everyone's participation,
Larry Irwin
CCA Medical


Comment 13 John W. Linville 2006-06-26 14:10:35 UTC
Closing as NOTABUG since it appears to have been a hardware issue.

Comment 19 Jesse Brandeburg 2006-10-25 21:56:01 UTC
why did this get magically reopened?

Comment 20 Sirius Rayner-Karlsson 2006-10-25 23:01:02 UTC
Good Evening Jesse,

Two customers reported problems that matched this, and the second customer has
confirmed driver level update has their problems. The first customer reported
that kernel 2.6.17-rc6 worked flawlessly on their system, while the RHEL4 kernel
did not.

Kind Regards,

Anders Karlsson


Comment 21 John W. Linville 2006-10-26 14:41:42 UTC
Anders, has the "first" customer actually responded re: my test kernels?  Or 
are we still waiting to here?

Comment 22 John W. Linville 2006-10-26 14:48:11 UTC
s/here/hear

Comment 26 Jason Baron 2006-12-04 19:14:39 UTC
committed in stream U5 build 42.28. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 27 RHEL Program Management 2006-12-14 22:05:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Jay Turner 2006-12-18 15:19:54 UTC
QE ack for RHEL5.

Comment 29 Jay Turner 2006-12-18 15:21:12 UTC
Make that RHEL4.5.

Comment 31 Mike Gahagan 2007-04-03 20:28:05 UTC
It looks like the e1000 driver update has resolved this issue.


Comment 33 Red Hat Bugzilla 2007-05-08 01:49:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0304.html