Red Hat Bugzilla – Bug 194408
Intel PRO/1000 w/e1000 fails at 1Gb speed
Last modified: 2007-11-30 17:07:25 EST
Description of problem:
70-100% packet loss unless forced to 100Mb speed
Necessary to also force full duplex.
Version-Release number of selected component (if applicable):
Redhat ES 4 Update 3 - Stock kernel and components
Dell Dual Xeon server w/2GB RAM w/on-board dual Intel PRO/1000 NIC
(currently with second port disabled in BIOS)
# uname -a
Linux ccasvr.meca.com 2.6.9-34.0.1.ELsmp #1 SMP Wed May 17 17:05:24 EDT 2006
i686 i686 i386 GNU/Linux
# lspci | grep Ethernet
0b:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller (rev 05)
# ethtool -h 2>&1 | head -1
ethtool version 1.8
# ethtool -i eth0
1. Boot stock kernel shown above with stock e1000 module.
2. Ping another machine on the same subnet. (70-100% packet loss)
3. Run command:
ethtool -s eth0 speed 100 duplex full port tp autoneg off
4. Ping another machine on the same subnet. (0% packet loss)
5. (Note that even if you connect to a 10/100 switch that you still need to
use the ethool command given above or it will default to 100Mb _half_ duplex.)
I have the boot section from the messages attached.
I have lspci and dmesg output also, but could only attach one attachment.
Created attachment 130706 [details]
Boot section from /var/log/messages
Created attachment 130707 [details]
Output from lspci
Created attachment 130708 [details]
Output from dmesg
Hi, sounds like a wierd problem, I have some follow up questions.
what does 'ethtool eth0' report when you're negotiated at 1Gb?
please attach the output of 'ethtool -S eth0' after your ping test
please attach output of cat /proc/interrupts after your ping test.
please also attach the output of dmidecode and make sure your BIOS is up to
date. We had some mention of a user having lots of packet loss problems on a
dell server when they had enabled server management using the same IP address.
Do you have remote system management enabled?
I do not have remote system management enabled.
As soon as I can coordinate after hours with an on-site IT person I am going
to do the following:
1) Try John Linville's newest kernel release.
2) Try a back-to-back connection between 2 servers to rule out the switches.
3) Capture output requested by Jesse (above).
4) If possible, provide port stats from managed switch for 1Gb connection.
This issue is also being followed as a Dell Support Incident. It will probably
be sometime next week to get the info, since I have to coordinate with another
person for after-hours testing...
More info from testing:
1) The new kernel did not help.
2) Back-to-Back test did not help.
3) Ping testing and "ethtool eth0" output in attachment Test_Results.txt
"ifconfig eth0" output at various times provided also.
"dmidecode" output is also attached.
(I did not capture the output of "ethtool -S eth0"
or "cat /proc/interrupts...
If we still need that I can get it...)
4) The switch never reported any errors at any time.
Created attachment 131221 [details]
Test Results of various recommended tests
Created attachment 131222 [details]
Output from dmidecode
Are these machines connected to any other network? The behavior you're
describing is common when you have two nics on the same switch segment and the
packets take an alternate route. what does tracepath 192.168.1.7 say?
I would still like to see the output of ethtool -S eth0 and cat /proc/interrupts
are you able to try the second (currently disabled) port instead? have you
tried hooking directly to a linux box instead? If you do so you will be able to
tcpdump at both ends and we can see what the packet stream looks like.
did you run the self test in the driver?
# ethtool -t eth0 offline
Thanks for following up.
The main office LAN 192.168.1.0/24
Multitech RouteFinder 560 provids an internet gateway at 192.168.1.10
There is a remote office connected via a hardware VPN. LAN 192.168.2.0/24
The remote office uses a Multitech RouteFinder 550 for this purpose.
*No other machines on the network experience any issues with 1Gb connections.*
*The servers (except the Dell) and 45+ of the workstations are running 1Gb.*
This, in combination with the Back-to-Back test we performed, would rule out
the alternate route possibilities as being the source of the problem.
The main office has 4 servers, 60+ workstations and 15+ printers. The Dell
2800 linux server houses medical data and applications. Another linux server
(not from us) runs their optical shop. The HP G3 is their accounting server. A
new Windows 2003 server is being set up to implement Active Directory for
controlling the 60+ workstations. The remote office on the VPN has about 4
workstations and 2 printers.
Everything is working on the entire network as long as we don't allow the Dell
to connect at 1Gb. - But we really should have it connected at 1Gb since the
medical records data contains many images. - In fact, once we have the 1Gb
connection working, I would like to have both ports connected and bound to
provide an alternate path during heavy bandwidth hits.
I have tried the other NIC port with the same results. Currently, it is
disabled in the BIOS.
I think (the next time I can coordinate testing) that the first thing we will
try, in the interest of time, is to disable both on-board NIC's and put in a
new NIC. If that works, then we know it is a hardware problem with the on-
board NIC. - Dell can do this since the client has Dell hardware support.
Then, if that doesn't work, I will obtain the things I forgot to pick up for
Jesse during the last series of tests.
I really appreciate everyone's help.
Dell is going to replace the motherboard 6/23/6 at 6:30pm.
(the h/w contract requires this approach vs. adding a new NIC)
I will post results on 6/26/6.
Dell swapped out the motherboard on Friday night and when we switched the
system and the switch to autonegotiate, it locked in at 1Gb and we lost no
packets pinging other PC's on the LAN! So, it was a hardware problem all
along. It just wasn't broken enough to make it easy to identify...
Thanks for everyone's participation,
Closing as NOTABUG since it appears to have been a hardware issue.
why did this get magically reopened?
Good Evening Jesse,
Two customers reported problems that matched this, and the second customer has
confirmed driver level update has their problems. The first customer reported
that kernel 2.6.17-rc6 worked flawlessly on their system, while the RHEL4 kernel
Anders, has the "first" customer actually responded re: my test kernels? Or
are we still waiting to here?
committed in stream U5 build 42.28. A test kernel with this patch is available
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
QE ack for RHEL5.
Make that RHEL4.5.
It looks like the e1000 driver update has resolved this issue.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.