Bug 194408
Summary: | Intel PRO/1000 w/e1000 fails at 1Gb speed | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Larry R. Irwin <larryi> | ||||||||||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 4.0 | CC: | akarlsso, bruce.w.allan, jbaron, jesse.brandeburg, larryi | ||||||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | i686 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | RHBA-2007-0304 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2007-05-08 01:49:48 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Larry R. Irwin
2006-06-07 20:23:31 UTC
Created attachment 130706 [details]
Boot section from /var/log/messages
Created attachment 130707 [details]
Output from lspci
Created attachment 130708 [details]
Output from dmesg
Hi, sounds like a wierd problem, I have some follow up questions. what does 'ethtool eth0' report when you're negotiated at 1Gb? please attach the output of 'ethtool -S eth0' after your ping test please attach output of cat /proc/interrupts after your ping test. please also attach the output of dmidecode and make sure your BIOS is up to date. We had some mention of a user having lots of packet loss problems on a dell server when they had enabled server management using the same IP address. Do you have remote system management enabled? Hi, I do not have remote system management enabled. As soon as I can coordinate after hours with an on-site IT person I am going to do the following: 1) Try John Linville's newest kernel release. 2) Try a back-to-back connection between 2 servers to rule out the switches. 3) Capture output requested by Jesse (above). 4) If possible, provide port stats from managed switch for 1Gb connection. This issue is also being followed as a Dell Support Incident. It will probably be sometime next week to get the info, since I have to coordinate with another person for after-hours testing... More info from testing: 1) The new kernel did not help. 2) Back-to-Back test did not help. 3) Ping testing and "ethtool eth0" output in attachment Test_Results.txt "ifconfig eth0" output at various times provided also. "dmidecode" output is also attached. (I did not capture the output of "ethtool -S eth0" or "cat /proc/interrupts... If we still need that I can get it...) 4) The switch never reported any errors at any time. Created attachment 131221 [details]
Test Results of various recommended tests
Created attachment 131222 [details]
Output from dmidecode
Are these machines connected to any other network? The behavior you're describing is common when you have two nics on the same switch segment and the packets take an alternate route. what does tracepath 192.168.1.7 say? I would still like to see the output of ethtool -S eth0 and cat /proc/interrupts are you able to try the second (currently disabled) port instead? have you tried hooking directly to a linux box instead? If you do so you will be able to tcpdump at both ends and we can see what the packet stream looks like. did you run the self test in the driver? # ethtool -t eth0 offline Thanks for following up. The configuration: The main office LAN 192.168.1.0/24 Multitech RouteFinder 560 provids an internet gateway at 192.168.1.10 There is a remote office connected via a hardware VPN. LAN 192.168.2.0/24 The remote office uses a Multitech RouteFinder 550 for this purpose. *No other machines on the network experience any issues with 1Gb connections.* *The servers (except the Dell) and 45+ of the workstations are running 1Gb.* This, in combination with the Back-to-Back test we performed, would rule out the alternate route possibilities as being the source of the problem. The main office has 4 servers, 60+ workstations and 15+ printers. The Dell 2800 linux server houses medical data and applications. Another linux server (not from us) runs their optical shop. The HP G3 is their accounting server. A new Windows 2003 server is being set up to implement Active Directory for controlling the 60+ workstations. The remote office on the VPN has about 4 workstations and 2 printers. Everything is working on the entire network as long as we don't allow the Dell to connect at 1Gb. - But we really should have it connected at 1Gb since the medical records data contains many images. - In fact, once we have the 1Gb connection working, I would like to have both ports connected and bound to provide an alternate path during heavy bandwidth hits. I have tried the other NIC port with the same results. Currently, it is disabled in the BIOS. I think (the next time I can coordinate testing) that the first thing we will try, in the interest of time, is to disable both on-board NIC's and put in a new NIC. If that works, then we know it is a hardware problem with the on- board NIC. - Dell can do this since the client has Dell hardware support. Then, if that doesn't work, I will obtain the things I forgot to pick up for Jesse during the last series of tests. I really appreciate everyone's help. Dell is going to replace the motherboard 6/23/6 at 6:30pm. (the h/w contract requires this approach vs. adding a new NIC) I will post results on 6/26/6. Dell swapped out the motherboard on Friday night and when we switched the system and the switch to autonegotiate, it locked in at 1Gb and we lost no packets pinging other PC's on the LAN! So, it was a hardware problem all along. It just wasn't broken enough to make it easy to identify... Thanks for everyone's participation, Larry Irwin CCA Medical Closing as NOTABUG since it appears to have been a hardware issue. why did this get magically reopened? Good Evening Jesse, Two customers reported problems that matched this, and the second customer has confirmed driver level update has their problems. The first customer reported that kernel 2.6.17-rc6 worked flawlessly on their system, while the RHEL4 kernel did not. Kind Regards, Anders Karlsson Anders, has the "first" customer actually responded re: my test kernels? Or are we still waiting to here? s/here/hear committed in stream U5 build 42.28. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. QE ack for RHEL5. Make that RHEL4.5. It looks like the e1000 driver update has resolved this issue. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html |