From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Description of problem: Hi We have a system with an Intel gigabit network controller PWLA8490T - identified by /sbin/lspic as Intel Corp. 82543GC Gigabit Ethernet Controller (rev 02). During the last days the system has halted various times. The server has been running headless and only today I managed to get some of the trace from the kernel panic. CALL TRACE: [<c0218c00>] ip_local_deliver_finish [kernel] 0x0 (0xc0c36fe74)) [<f88ca24d>] e1000_reset [e1000] 0x59 (0xc036fe90)) [<f88ca1da>] e1000_down [e1000] 0x5a (0xc36fec0)) Linux detail: Linux version 2.4.18-14smp (bhcompile.redhat.com) (gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7)) #1 SMP Wed Sep 4 11:55:37 E DT 2002 Uname -a = Linux localhost 2.4.18-14smp #1 SMP Wed Sep 4 11:55:37 EDT 2002 i686 athlon i386 GNU/Linux Hardware: this is an Asus A7M266-D motherboard, 2x athlon 2000+, 1 GB ecc ram. Short history: This system have been operational since June this year running RedHat linux 7.3 with both RedHat provided drivers and later also using the driver provided by Intel (e1000-4.2.17). The system ran stable. However we had poor network performance. Last Saturday (October 6th) we upgraded redhat to current release the first crash was on Tuesday. After another crash on Wednesday I replaced the nic driver with the one provided by Intel (e1000-4.3.15). It was first this afternoon I managed to see the actual trace and I copied down the first lines before I replaced / inserted another nic to make the server operational. What i really need to know is if this is a problem related to harware or if this comes only from the current drivers and / or i realation to RedHat 8.0. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. set up a system with the above mentioned specifications 2. boot up 3. run some network services (and probably wait some hours) Actual Results: system crash Expected Results: no system crash Additional info:
I assume you're not running the iANS stuff ?
Yes, that is right
Both the redhat 8.0 and 4.3.15 drivers have the same problem: if we get called to run our tx_timeout routine to reset the card, we'll panic because of a bug I introduced [BUG() running msec_delay in_interrupt during tx_timeout timer call back!] In a couple of weeks, we'll have an updated driver on the Intel web site that has a fix for this; the fix has already been applied to the 2.4 and 2.5 kernel drivers. The real question is: why are we getting into tx_timeout???? There is a known hang with 82543 when using RxIntDelay, but that's turned off in these drivers. We shouldn't be in tx_timeout.
Assigned to arjan, for integrating fixed e1000 into rawhide/8.0 errata. And added CC to Scott Feldman @ Intel in case he wants to pursue further issue of why tx_timeouts are occurring in the first place.
Created attachment 82127 [details] trace
This last attatchment is from running the latest kernel release (2.4.18- 17.8.0smp). Reverting to the Intel 4.2.17 driver keeps the system from crashing.
We are having the same problem here at LLNL.
Here is another report of the same problem. Might yeild some useful information: From: Jim Garlick <garlick> To: bwoodard Subject: bug report - e1000 driver Date: Tue, 12 Nov 2002 09:26:56 -0800 (PST) Ben - We've been seeing a BUG() triggered in the e1000 driver. The call chain is: e1000_tx_timeout -> e1000_down -> e1000_reset -> e1000_reset_hw -> msec_delay -> BUG() Under heavy load, this is occasionally triggered and crashes the node. The attached patch works around the problem by spinning with interrupts off for longer than probably is sociable, but not long enough to trigger an NMI watchdog at least (that was enabled during our testing). It also may mask other problems that really should trigger a BUG(). Ultimately I think a better fix is needed... Could you report this to RH? Thanks, Jim ---------------------- RCS file: /chaos/cvs/kernel-rh/linux/drivers/net/e1000/Attic/e1000_osdep.h,v retrieving revision 1.1.4.1 retrieving revision 1.1.4.3 diff -u -r1.1.4.1 -r1.1.4.3 --- e1000_osdep.h 29 Oct 2002 00:34:34 -0000 1.1.4.1 +++ e1000_osdep.h 12 Nov 2002 00:54:22 -0000 1.1.4.3 @@ -88,8 +88,8 @@ #define usec_delay(x) udelay(x) #ifndef msec_delay #define msec_delay(x) do { if(in_interrupt()) { \ - /* Don't mdelay in interrupt context! */ \ - BUG(); \ + int i; \ + for (i = 0; i < (x); i++) udelay(1000); \ } else { \ set_current_state(TASK_UNINTERRUPTIBLE); \ schedule_timeout((x * HZ)/1000); \
Arjan can you also do a 7.x errata kernel for this one? If not could you please tell me when this hits rawhide so that I can grab the changes and merge them with the kernel that we have here? I'll see how effectively we can reproduce the the problem here. Hopefully we can provide Scott with an easy way to manifest the problem.
The errata kernel needs to be updated to use the 4.4.12-k1 driver from the 2.4.20-rc2 kernel. This driver has the fix for this bug. The files in drivers/net/e1000 should be a drop in replacement for the previous driver. The 4.4.12 driver is also available from Intel's support web site.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/