102926 – (NET E100) e100 network failure on RedHat9 machines running 2.4.20-18.9smp

Bug 102926 - (NET E100) e100 network failure on RedHat9 machines running 2.4.20-18.9smp

Summary: (NET E100) e100 network failure on RedHat9 machines running 2.4.20-18.9smp

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	9
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Garzik
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-08-22 18:15 UTC by Charles Long
Modified:	2013-07-03 02:14 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:41:27 UTC
Embargoed:

Attachments	(Terms of Use)

Description Charles Long 2003-08-22 18:15:42 UTC

Description of problem:
When using ethernet driver e100/Intel Ethernet Pro 100 chipset network shutsdown
but does not bring interface down.  Although interface continues to be
recognized (visible in ifconfig, processes such as NFS, if in use at time of
failure, do not recognize failure), new processes after failure cannot access
network: bind, nfs and autofs fail, machine cannot ping or resolve host names. 
Cannot produce conditions to reproduce but does occur only on machines upgraded
to Redhat 9.  Network failure resolved with a restart of network init script and
a reload of failed services (bind, autofs).   Problem ceases with replacement of
card but does not require the existing interface to be removed, disabled in the
bios, or its existing configuration to be removed; all new configurations are
moved to eth1 and eth0 is not brought up by network scripts.


Version-Release number of selected component (if applicable):
Redhat 9 with 2.4.20-18.9smp kernel

How reproducible:
Occurance seems to be random and not dependent on network load but we have
discovered on several machines with identical hardware.


Steps to Reproduce:
1.
2.
3.
    
Actual results:
network failure.

Expected results:


Additional info:

Comment 1 Charles Long 2003-11-19 01:17:23 UTC

This failure only occurs with Asus CUV4X-DLS motherboards with
integrated ethernet.  Sorry to say, not all the same machines
encounter the same problems, regardless that they are otherwise
identical, including bios version.

Comment 2 Rex Dieter 2004-02-01 04:09:46 UTC

I'm seeing a similar type of failure on a server recently upgraded
from rh62.  In my case, it's a Supermicro board, with plain-jane e100
pci adapter, rh9, kernel-2.4.20-28.9(i686), samba/nfs server (with
decent, but not astronomical usage/load).  Server will be running
along nicely, sometimes for hours or even days, and suddenly eth0
(appears to?) drop *all* packets.  No tcp, pings, nada (though I still
see link-lights on the card)
ifdown eth0; rmmod e100; ifup eth0
seems to restore normal operation.
During failure, nothing seems to be logged to /var/logs/messages.

ideas?  hints/pointers?  I wouldn't have thought upgrading rh62->rh9
would have left me with a *less* reliable server... )-:

Comment 3 Rex Dieter 2004-02-01 06:05:17 UTC

I've installed/upgraded-to Intel's e100-2.3.33 driver to see if that
helps any.

Comment 4 Scott Feldman 2004-03-04 03:38:49 UTC

Did e100-2.3.33 help?

There is a newer version of e100 at http://sf.net/projects/e1000, 
version 3.0.15.  I'd like to know if that driver fixes Charles' 
up/down interface issue and Rex's packet drop issue.

Comment 5 Rex Dieter 2004-03-04 04:21:10 UTC

e100-2.3.33 didn't help any (same behavior).

Comment 6 Scott Feldman 2004-03-04 16:21:15 UTC

Rex, with 2.3.33, you can dump the nic stats using ethtool -S 
eth<X>.  Would you check to see if stat "rx_tco_packets" is non-zero 
after the hang?  

Also, would you attach the output of lspci -n?

Thanks.

Comment 7 Rex Dieter 2004-03-04 19:25:28 UTC

# lspci -n
00:00.0 Class 0600: 8086:7190 (rev 03)
00:01.0 Class 0604: 8086:7191 (rev 03)
00:07.0 Class 0601: 8086:7110 (rev 02)
00:07.1 Class 0101: 8086:7111 (rev 01)
00:07.2 Class 0c03: 8086:7112 (rev 01)
00:07.3 Class 0680: 8086:7113 (rev 02)
00:0f.0 Class 0100: 1119:011a
00:10.0 Class 0200: 8086:1229 (rev 08)
00:14.0 Class 0104: 1103:0008 (rev 07)
00:14.1 Class 0104: 1103:0008 (rev 07)
01:00.0 Class 0300: 102b:0521 (rev 01)

Comment 8 Rex Dieter 2004-03-04 19:30:30 UTC

I'm trying out e100-2.3.38 + kernel-2.4.20-30.9 now.  Is it worth
trying the developmental e100-3.0.15 driver from sourceforge?

Unfortunately, if I'm not at the console within minutes after the
hang, the machine is quickly becomes completely unresponsive even on
the console.  I'll do my best to catch it and to the ethtool -S eth0
when/if it happens again.

Comment 9 Scott Feldman 2004-03-04 21:24:18 UTC

Yes, try the e100-3.0.15 driver.  It's really our focus right now, so 
if this driver has a problem on your system, we'd like to fix the 
problem in that driver.

Also, I asked for the lspci dump to see which pro/100 controller you 
actually had.  You have a 82558 part, which is good because it's 
pretty basic and doesn't have a lot of fancy features.  It should NOT 
stop!  Let's see if 3.0.15 likes it better.

Comment 10 Bugzilla owner 2004-09-30 15:41:27 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.