From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803 Galeon/1.3.17 Description of problem: The e1000 seems to be locking up a couple of times a day on this Power Edge 2600. It was previously running RH 7.3 and had no problems but was upgraded to RHEL3U2+errata. Console says this: Uhhuh. NMI received. Dazed and confused, but trying to continue You probably have a hardware problem with your RAM chips NETDEV WATCHDOG: eth0: transmit timed out e1000: eth0 NIC Link is Up 100 Mbps Full Duplex e1000: eth0 NIC Link is Down e1000: eth0 NIC Link is Up 100 Mbps Full Duplex NETDEV WATCHDOG: eth0: transmit timed out e1000: eth0 NIC Link is Up 100 Mbps Full Duplex NETDEV WATCHDOG: eth0: transmit timed out e1000: eth0 NIC Link is Up 100 Mbps Full Duplex NETDEV WATCHDOG: eth0: transmit timed out e1000: eth0 NIC Link is Up 100 Mbps Full Duplex NETDEV WATCHDOG: eth0: transmit timed out System passed all hardware diagnostics and memory tests. Version-Release number of selected component (if applicable): kernel-smp-2.4.21-15.0.4.EL How reproducible: Didn't try Additional info:
[root@intheair tjb]# ethtool -i eth0 driver: e1000 version: 5.2.30.1-k1 firmware-version: N/A bus-info: 03:01.0 [root@intheair tjb]#
"You probably have a hardware problem with your RAM chips" -- seems telling even if it is passing diags. Intel does the upstream maintenance of this driver, so they probably have a good idea of what may cause such a problem. Did you try using a different card? RHEL3U3 should be available very soon. It contains an update of the e1000 driver to version 5.2.52k1. It would worth trying again after the upgrade as well. Please let me know if the problem persists after the upgrade to U3.
RHEL3 U3 is already available. The advisory is RHBA-2004:433. Thomas, as John wrote previously, please let us know how things go with U3. Thanks. -ernie
I installed U3 this morning and got the same errors. The e1000 is on board this Dell PowerEdge 2600 so we won't be able to try another e1000. I had a similiar problem with a Dell Precision 650 a while back with Fedora Core 1: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=115877 It was fixed by a later kernel. If it was a problem with the memory, I would expect other problems with the system but there haven't been any.
All the patches referred to in bug 115877 are already present (verbatim) in the RHEL3 U3 kernel. Not sure where to go w/ this...will likely ping the Intel guys... In the meantime, it might be useful if you could attach the results of running sysreport on the failing system. Thanks in advance!
The NMI is a system hardware problem and possibly not related to a NIC/driver problem. When NMI's happen in the system, system integrety can no longer be assured. Any problems that devices or drivers are having after an NMI might not really be happening. Please try to repro the "netdev watchdog" hangs after the NMI has been fixed. We can't look at this until the NMI has been corrected.
Thomas, any chance you can get a recreate w/o an NMI?
We installed a 3c59x in this system and we've don't have anymore NMIs or ethernet problems. (Note that as I mentioned above we were previously running RH 7.3 without any e1000 or NMI problems.) Just noticed today on another system that was just upgraded to U3 the same lockups: e1000: eth0: e1000_watchdog: NIC Link is Down e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex e1000: eth0: e1000_watchdog: NIC Link is Down e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex e1000: eth0: e1000_watchdog: NIC Link is Down e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex e1000: eth0: e1000_watchdog: NIC Link is Down e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex [root@bertha tjb]# ethtool -i eth0 driver: e1000 version: 5.2.52-k3 firmware-version: N/A bus-info: 04:01.0 [root@bertha tjb]#
Using the latest RHEL3U4 kernel (2.4.21-23.EL) on ia64 I get the following: Nov 3 13:09:32 bull1 sshd(pam_unix)[3490]: session opened for user root by (uid=0) Nov 3 13:11:18 bull1 kernel: ip_tables: (C) 2000-2002 Netfilter core team Nov 3 13:11:19 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Nov 3 13:11:20 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down Nov 3 13:11:23 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Nov 3 13:11:25 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down Nov 3 13:11:28 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Nov 3 13:11:34 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down Nov 3 13:11:37 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Nov 3 13:11:38 bull1 kernel: e1000: eth1: e1000_watchdog: NIC Link is Down [root@bull1 network-scripts]# ethtool -i eth1 driver: e1000 version: 5.3.19-k2-NAPI firmware-version: N/A bus-info: 1f:01.0
Doesn't look too good... This is a little out of my hands, since the hardware documentation is unavailable. About all I can do is to ping the Intel guys and keep-up with the updates upstream. I'll put together a patch for RHEL3-U4 to get up-to-date w/ upstream...
Created attachment 106187 [details] e1000-update-5_5_4_k2.patch Backport of e1000 driver version 5.5.4-k2 to RHEL3 U4... I'd love to hear if this helps...
Guys, The messages you are showing in the last few updates just show the watchdog routine detecting that link is down. The PRO/1000 hardware has a link status change interrupt which normally reports that link is lost (or come up for that matter). It doesn't look like you are seeing that. The message from the first note is: NETDEV WATCHDOG: eth0: transmit timed out which would indicate some sort of driver/HW issue. The last few notes above do not show that the transmits timeout. Since NMI's were happening, there is no way to tell what state the actual HW was in. I've never heard of or seen where our adapter (especially a LOM) would cause an NMI. Never. So I asked for this to be repro'd without the NMI. Now the only thing the log is showing is that link in coming up and down for some reason. A PRO/1000NIC could be plugged into the system to see if it is also seeing this issue. I assume these messages are being pulled from /var/log/messages? You could just try our new drivers without having to port them. We have stand alone versions on both support.intel.com and at sf.net/projects/e1000. Since link is coming up and down, I think something is strange with the network like cabling, switch, etc. Also, have you guys tried the l;atest BIOS for the 2600? We have seen strange things in the past due to BIOS. It's worth checkingn and updating if needed.
Putting this in NEEDINFO until I hear some results of the latest patch...
Created attachment 107241 [details] e1000-5_5_4-k2--rhel3.patch I think the last patch was busted -- try this instead...
Any word as to the effectiveness of the above patch?
I'm closing this due to lack of response. Newer RHEL releases have the 5.6.10.1-k2 e1000 driver. Please attempt to recreate the problem with the latest available RHEL3 update and reopen if the problem persists.