Bug 75680

Summary: Intel e1000 driver on 2.4.18-14smp (kernel panic)
Product: [Retired] Red Hat Linux Reporter: Marius Hjelle <marius.hjelle>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 8.0CC: scott.feldman, signal, woodard
Target Milestone: ---   
Target Release: ---   
Hardware: athlon   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:40:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
trace none

Description Marius Hjelle 2002-10-10 23:42:27 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Description of problem:
Hi
We have a system with an Intel gigabit network controller PWLA8490T - 
identified by /sbin/lspic as Intel Corp. 82543GC Gigabit Ethernet Controller 
(rev 02). During the last days the system has halted various times. The server 
has been running headless and only today I managed to get some of the trace 
from the kernel panic.

CALL TRACE:
[<c0218c00>] ip_local_deliver_finish [kernel] 0x0 (0xc0c36fe74))
[<f88ca24d>] e1000_reset [e1000] 0x59 (0xc036fe90))
[<f88ca1da>] e1000_down [e1000] 0x5a (0xc36fec0))

Linux detail:
Linux version 2.4.18-14smp (bhcompile.redhat.com) (gcc version 3.2 
20020903 (Red Hat Linux 8.0 3.2-7)) #1 SMP Wed Sep 4 11:55:37 E
DT 2002

Uname -a = Linux localhost 2.4.18-14smp #1 SMP Wed Sep 4 11:55:37 EDT 2002 i686 
athlon i386 GNU/Linux

Hardware: this is an Asus A7M266-D motherboard, 2x athlon 2000+, 1 GB ecc ram.

Short history: 
This system have been operational since June this year running RedHat linux 7.3 
with both RedHat provided drivers and later also using the driver provided by 
Intel (e1000-4.2.17). The system ran stable. However we had poor network 
performance.

Last Saturday (October 6th) we upgraded redhat to current release the first 
crash was on Tuesday. After another crash on Wednesday I replaced the nic 
driver with the one provided by Intel (e1000-4.3.15). It was first this 
afternoon I managed to see the actual trace and I copied down the first lines 
before I replaced / inserted another nic to make the server operational.

What i really need to know is if this is a problem related to harware or if 
this comes only from the current drivers and / or i  realation to RedHat 8.0.


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. set up a system with the above mentioned specifications
2. boot up
3. run some network services (and probably wait some hours)
	

Actual Results:  system crash

Expected Results:  no system crash

Additional info:

Comment 1 Arjan van de Ven 2002-10-11 13:50:50 UTC
I assume you're not running the iANS stuff ?

Comment 2 Marius Hjelle 2002-10-11 14:36:10 UTC
Yes, that is right

Comment 3 Scott Feldman 2002-10-15 19:56:37 UTC
Both the redhat 8.0 and 4.3.15 drivers have the same problem: if we get called 
to run our tx_timeout routine to reset the card, we'll panic because of a bug 
I introduced [BUG() running msec_delay in_interrupt during tx_timeout timer 
call back!]  In a couple of weeks, we'll have an updated driver on the Intel 
web site that has a fix for this; the fix has already been applied to the 2.4 
and 2.5 kernel drivers.

The real question is: why are we getting into tx_timeout????  There is a known 
hang with 82543 when using RxIntDelay, but that's turned off in these 
drivers.  We shouldn't be in tx_timeout.

Comment 4 Jeff Garzik 2002-10-25 18:12:41 UTC
Assigned to arjan, for integrating fixed e1000 into rawhide/8.0 errata.
And added CC to Scott Feldman @ Intel in case he wants to pursue further issue
of why tx_timeouts are occurring in the first place.


Comment 5 Marius Hjelle 2002-10-25 20:13:27 UTC
Created attachment 82127 [details]
trace

Comment 6 Marius Hjelle 2002-10-25 20:19:20 UTC
This last attatchment is from running the latest kernel release (2.4.18-
17.8.0smp). Reverting to the Intel 4.2.17 driver keeps the system from crashing.

Comment 7 Ben Woodard 2002-11-13 17:17:16 UTC
We are having the same problem here at LLNL.

Comment 8 Ben Woodard 2002-11-13 17:20:09 UTC
Here is another report of the same problem. Might yeild some useful information:

	From: 	Jim Garlick <garlick>
To: 	bwoodard
Subject: 	bug report - e1000 driver
Date: 	Tue, 12 Nov 2002 09:26:56 -0800 (PST)	
Ben -

We've been seeing a BUG() triggered in the e1000 driver.  The call chain is:

  e1000_tx_timeout -> e1000_down -> e1000_reset -> e1000_reset_hw
       -> msec_delay -> BUG()

Under heavy load, this is occasionally triggered and crashes the node.

The attached patch works around the problem by spinning with interrupts
off for longer than probably is sociable, but not long enough to trigger
an NMI watchdog at least (that was enabled during our testing).  It also
may mask other problems that really should trigger a BUG().  Ultimately
I think a better fix is needed...

Could you report this to RH?

Thanks,

Jim

----------------------
RCS file: /chaos/cvs/kernel-rh/linux/drivers/net/e1000/Attic/e1000_osdep.h,v
retrieving revision 1.1.4.1
retrieving revision 1.1.4.3
diff -u -r1.1.4.1 -r1.1.4.3
--- e1000_osdep.h       29 Oct 2002 00:34:34 -0000      1.1.4.1
+++ e1000_osdep.h       12 Nov 2002 00:54:22 -0000      1.1.4.3
@@ -88,8 +88,8 @@
 #define usec_delay(x) udelay(x)
 #ifndef msec_delay
 #define msec_delay(x)  do { if(in_interrupt()) { \
-                               /* Don't mdelay in interrupt context! */ \
-                               BUG(); \
+                               int i; \
+                               for (i = 0; i < (x); i++) udelay(1000); \
                        } else { \
                                set_current_state(TASK_UNINTERRUPTIBLE); \
                                schedule_timeout((x * HZ)/1000); \

Comment 9 Ben Woodard 2002-11-13 19:00:22 UTC
Arjan can you also do a 7.x errata kernel for this one?
If not could you please tell me when this hits rawhide so that I can grab the
changes and merge them with the kernel that we have here?
I'll see how effectively we can reproduce the the problem here. Hopefully we can
provide Scott with an easy way to manifest the problem.

Comment 10 Scott Feldman 2002-11-19 02:14:11 UTC
The errata kernel needs to be updated to use the 4.4.12-k1 driver from the 
2.4.20-rc2 kernel.  This driver has the fix for this bug.  The files in 
drivers/net/e1000 should be a drop in replacement for the previous driver.

The 4.4.12 driver is also available from Intel's support web site.

Comment 11 Bugzilla owner 2004-09-30 15:40:02 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/