Bug 131433

Summary: e1000 locks up on Dell PowerEdge 2600
Product: Red Hat Enterprise Linux 3 Reporter: Thomas J. Baker <tjb>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED WORKSFORME QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: bpeck, jgarzik, john.ronciak, petrides, riel
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-03-15 14:48:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 143331    
Bug Blocks:    
Attachments:
Description Flags
e1000-update-5_5_4_k2.patch
none
e1000-5_5_4-k2--rhel3.patch none

Description Thomas J. Baker 2004-09-01 12:21:38 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2)
Gecko/20040803 Galeon/1.3.17

Description of problem:
The e1000 seems to be locking up a couple of times a day on this Power
Edge 2600. It was previously running RH 7.3 and had no problems but
was upgraded to RHEL3U2+errata. Console says this:

Uhhuh. NMI received. Dazed and confused, but trying to continue
You probably have a hardware problem with your RAM chips
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
e1000: eth0 NIC Link is Down
e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
NETDEV WATCHDOG: eth0: transmit timed out

System passed all hardware diagnostics and memory tests.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-15.0.4.EL

How reproducible:
Didn't try


Additional info:

Comment 1 Thomas J. Baker 2004-09-01 12:22:01 UTC
[root@intheair tjb]# ethtool -i eth0
driver: e1000
version: 5.2.30.1-k1
firmware-version: N/A
bus-info: 03:01.0
[root@intheair tjb]#


Comment 2 John W. Linville 2004-09-03 14:27:22 UTC
"You probably have a hardware problem with your RAM chips" -- seems
telling even if it is passing diags.  Intel does the upstream
maintenance of this driver, so they probably have a good idea of what
may cause such a problem.  Did you try using a different card?

RHEL3U3 should be available very soon.  It contains an update of the
e1000 driver to version 5.2.52k1.  It would worth trying again after
the upgrade as well.  Please let me know if the problem persists after
the upgrade to U3.

Comment 3 Ernie Petrides 2004-09-07 20:22:25 UTC
RHEL3 U3 is already available.  The advisory is RHBA-2004:433.
Thomas, as John wrote previously, please let us know how things
go with U3.  Thanks.  -ernie


Comment 4 Thomas J. Baker 2004-09-08 14:28:03 UTC
I installed U3 this morning and got the same errors. The e1000 is on
board this Dell PowerEdge 2600 so we won't be able to try another
e1000. I had a similiar problem with a Dell Precision 650 a while back
with Fedora Core 1:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115877
It was fixed by a later kernel. If it was a problem with the memory, I
would expect other problems with the system but there haven't been any.

Comment 5 John W. Linville 2004-09-08 20:32:10 UTC
All the patches referred to in bug 115877 are already present
(verbatim) in the RHEL3 U3 kernel.

Not sure where to go w/ this...will likely ping the Intel guys...

In the meantime, it might be useful if you could attach the results of
running sysreport on the failing system.  Thanks in advance!

Comment 6 John Ronciak 2004-09-08 22:45:39 UTC
The NMI is a system hardware problem and possibly not related to a
NIC/driver problem.  When NMI's happen in the system, system integrety
can no longer be assured.  Any problems that devices or drivers are
having after an NMI might not really be happening.  Please try to
repro the "netdev watchdog" hangs after the NMI has been fixed.

We can't look at this until the NMI has been corrected.

Comment 7 John W. Linville 2004-09-09 14:00:22 UTC
Thomas, any chance you can get a recreate w/o an NMI?

Comment 8 Thomas J. Baker 2004-09-16 14:41:49 UTC
We installed a 3c59x in this system and we've don't have anymore NMIs
or ethernet problems. (Note that as I mentioned above we were
previously running RH 7.3 without any e1000 or NMI problems.) Just
noticed today on another system that was just upgraded to U3 the same
lockups:

e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex

[root@bertha tjb]# ethtool -i eth0
driver: e1000
version: 5.2.52-k3
firmware-version: N/A
bus-info: 04:01.0
[root@bertha tjb]#



Comment 10 Bill Peck 2004-11-03 17:11:40 UTC
Using the latest RHEL3U4 kernel (2.4.21-23.EL) on ia64 I get the
following:

Nov  3 13:09:32 bull1 sshd(pam_unix)[3490]: session opened for user
root by (uid=0)
Nov  3 13:11:18 bull1 kernel: ip_tables: (C) 2000-2002 Netfilter core team
Nov  3 13:11:19 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
Nov  3 13:11:20 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Down
Nov  3 13:11:23 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
Nov  3 13:11:25 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Down
Nov  3 13:11:28 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
Nov  3 13:11:34 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Down
Nov  3 13:11:37 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Up 1000 Mbps Full Duplex
Nov  3 13:11:38 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is
Down
 
[root@bull1 network-scripts]# ethtool -i eth1
driver: e1000
version: 5.3.19-k2-NAPI
firmware-version: N/A
bus-info: 1f:01.0


Comment 11 John W. Linville 2004-11-04 21:00:31 UTC
Doesn't look too good...

This is a little out of my hands, since the hardware documentation is
unavailable.  About all I can do is to ping the Intel guys and keep-up
with the updates upstream.

I'll put together a patch for RHEL3-U4 to get up-to-date w/ upstream...

Comment 12 John W. Linville 2004-11-04 21:02:16 UTC
Created attachment 106187 [details]
e1000-update-5_5_4_k2.patch

Backport of e1000 driver version 5.5.4-k2 to RHEL3 U4...

I'd love to hear if this helps...

Comment 13 John Ronciak 2004-11-04 21:43:32 UTC
Guys,

The messages you are showing in the last few updates just show the
watchdog routine detecting that link is down.  The PRO/1000 hardware
has a link status change interrupt which normally reports that link is
lost (or come up for that matter).  It doesn't look like you are
seeing that.  The message from the first note is:
NETDEV WATCHDOG: eth0: transmit timed out

which would indicate some sort of driver/HW issue.  The last few notes
above do not show that the transmits timeout.  Since NMI's were
happening, there is no way to tell what state the actual HW was in. 
I've never heard of or seen where our adapter (especially a LOM) would
cause an NMI.  Never.   So I asked for this to be repro'd without the
NMI.  Now the only thing the log is showing is that link in coming up
and down for some reason.  A PRO/1000NIC could be plugged into the
system to see if it is also seeing this issue.  I assume these
messages are being pulled from /var/log/messages?  You could just try
our new drivers without having to port them.  We have stand alone
versions on both support.intel.com and at sf.net/projects/e1000.

Since link is coming up and down, I think something is strange with
the network like cabling, switch, etc.

Also, have you guys tried the l;atest BIOS for the 2600?  We have seen
strange things in the past due to BIOS.  It's worth checkingn and
updating if needed.

Comment 14 John W. Linville 2004-11-08 14:31:31 UTC
Putting this in NEEDINFO until I hear some results of the latest patch...

Comment 15 John W. Linville 2004-11-22 21:17:43 UTC
Created attachment 107241 [details]
e1000-5_5_4-k2--rhel3.patch

I think the last patch was busted -- try this instead...

Comment 16 John W. Linville 2005-01-10 15:20:19 UTC
Any word as to the effectiveness of the above patch?

Comment 17 John W. Linville 2005-03-15 14:48:28 UTC
I'm closing this due to lack of response.

Newer RHEL releases have the 5.6.10.1-k2 e1000 driver.  Please attempt
to recreate the problem with the latest available RHEL3 update and
reopen if the problem persists.