Bug 166786 - Dell onboard e1000 stops receiving packets from some hosts after 30 minutes
Summary: Dell onboard e1000 stops receiving packets from some hosts after 30 minutes
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 4
Hardware: i386
OS: Linux
medium
medium
Target Milestone: ---
Assignee: John W. Linville
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-08-25 18:20 UTC by Eric Z. Ayers
Modified: 2007-11-30 22:11 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-15 15:01:54 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Output of running 'sysreport' (380.72 KB, application/octet-stream)
2005-08-30 17:15 UTC, Eric Z. Ayers
no flags Details

Description Eric Z. Ayers 2005-08-25 18:20:35 UTC
Description of problem:

I am a long time Linux user in a corporate environment.  This is a weird problem.

We just got a new Dell PowerEdge 1800 with Dual Xeon 3.0 GHz processors
and an onboard Intel e1000 NIC.  The adapeter is plugged into a switch 100Mbit
cisco hub running Full duplex.  I just installed Fedora Core 4 on a machine on
Tuesday.

After running for about 30 minutes or so, the machine can no longer see
some packets.  For the most part, the problem is with packets from outside
of our subnet.  We have a weird  10 bit subnet (netmask 255.255.252.0) in
our network, if that makes any difference.

1) Rebooting the machine fixes the problem (for about 30 minutes)
2) Some machines that go through the router are still visible (192.168.x.x IP
addresses)
3) Nothing looks  "funny" about the routing table.
4) Turning off the onboard e1000 NIC and replacing it with a 3c905 network
card makes the problem go away.
5) The problem occured with both the stock Fedora Core 4 SMP and uniprocessor
kernel.  I went through hell rebooting repeatedly until all packages could
be downloaded using 'yum'.   Even with the updated kernel the problem persists.
6) I just downloaded Intel's latest ethernet driver, and the problem persists.

ACPI: PCI Interrupt 0000:02:05.0[A] -> GSI 37 (level, low) -> IRQ 201
3c59x: Donald Becker and others. www.scyld.com/network/vortex.html
0000:02:05.0: 3Com PCI 3c905C Tornado at 0xec80. Vers LK1.1.19
Intel(R) PRO/1000 Network Driver - version 6.1.16-NAPI
Copyright (c) 1999-2005 Intel Corporation.
ACPI: PCI Interrupt 0000:03:07.0[A] -> GSI 69 (level, low) -> IRQ 209
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection


This has been going on for about 2 days.

$ sudo /sbin/ethtool eth0
Password:
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: d
        Current message level: 0x00000007 (7)
        Link detected: yes

$ /sbin/ifconfig -a

# This is the 3c905 card - not hooked up at the moment
dev20271  Link encap:Ethernet  HWaddr 00:50:DA:60:1F:2C
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:201 Base address:0xec80

# This is the e1000 nic
eth0      Link encap:Ethernet  HWaddr 00:14:22:0B:62:1E
          inet addr:158.155.4.123  Bcast:158.155.7.255  Mask:255.255.252.0
          inet6 addr: fe80::214:22ff:fe0b:621e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1621932 errors:0 dropped:0 overruns:0 frame:0
          TX packets:855106 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2277444127 (2.1 GiB)  TX bytes:79514599 (75.8 MiB)
          Base address:0xdcc0 Memory:dfbe0000-dfc00000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:40859 errors:0 dropped:0 overruns:0 frame:0
          TX packets:40859 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:19204150 (18.3 MiB)  TX bytes:19204150 (18.3 MiB)

sit0      Link encap:IPv6-in-IPv4
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

$ netstat -nr
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
158.155.4.0     0.0.0.0         255.255.252.0   U         0 0          0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 eth0
0.0.0.0         158.155.4.1     0.0.0.0         UG        0 0          0 eth0


Think I'm nuts and it is something in the network?  Look at this. ntp and ping
say the remote isn't responding:

eric@bass2:/bass/home/eric$ /usr/sbin/ntpq -p 158.155.2.3
158.155.2.3: timed out, nothing received
***Request timed out
eric@bass2:/bass/home/eric$ date
Thu Aug 25 14:01:54 EDT 2005
eric@bass2:/bass/home/eric$ ping 158.155.2.3
PING 158.155.2.3 (158.155.2.3) 56(84) bytes of data.

--- 158.155.2.3 ping statistics ---
9 packets transmitted, 0 received, 100% packet loss, time 7999ms

eric@bass2:/bass/home/eric$ date
Thu Aug 25 14:02:08 EDT 2005

But look what tcpdump says on the same machine - the packets are getting to
the remote machine and coming back, but for some reason, the replies are
being ignored.

$ sudo /usr/sbin/tcpdump -n -i eth0 host 158.155.2.3
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
14:01:42.388815 IP 158.155.4.123.32904 > 158.155.2.3.ntp: NTPv2, Reserved, length 12
14:01:42.389531 IP 158.155.2.3.ntp > 158.155.4.123.32904: NTPv2, Reserved, length 20
14:01:47.388353 IP 158.155.4.123.32904 > 158.155.2.3.ntp: NTPv2, Reserved, length 12
14:01:47.388839 IP 158.155.2.3.ntp > 158.155.4.123.32904: NTPv2, Reserved, length 20
14:01:58.309397 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 0
14:01:58.311005 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 0
14:01:59.309148 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 1
14:01:59.309914 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 1
14:02:00.309431 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 2
14:02:00.310200 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 2
14:02:01.309685 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 3
14:02:01.310359 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 3
14:02:02.308975 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 4
14:02:02.309769 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 4
14:02:03.309294 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 5
14:02:03.310055 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 5
14:02:04.309427 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 6
14:02:04.310214 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 6
14:02:05.308736 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 7
14:02:05.309500 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 7
14:02:06.308957 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 8
14:02:06.309660 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 8


Version-Release number of selected component (if applicable):
kernel-smp-2.6.12-1.1398_FC4


How reproducible:
Every time after I reboot, packets offnet stop after about 30 minutes

Steps to Reproduce:

1. Just reboot, work normally
2. After about half an hour.

  
Actual results:


Expected results:


Additional info:

I saw another buzilla bug on the e1000, but it didn't help:

# BUGZILLA Bug ID 149887 - Workaround for problem with e1000 adapters
# 24 Aug 2005 -EZA
# Linux bass2.compgen.com 2.6.12-1.1398_FC4smp #1 SMP Fri Jul 15 01:30:13 EDT
2005 i686 i686 i386 GNU/Linux
/sbin/ethtool -K eth0 rx off tx off

Comment 1 Eric Z. Ayers 2005-08-25 18:52:03 UTC
The IT guy commented that there is a firewall in between 158.155.2.3 and
158.155.4.1 (the default route) which is common to the systems we are having
troubles with.  Still, it doesn't explain why the problem is not reproducable
when we switch to use the 3c905 NIC.

Comment 2 John W. Linville 2005-08-29 18:06:12 UTC
It is difficult to know where to start...please attach the output of running 
"sysreport"...thanks! 

Comment 3 Eric Z. Ayers 2005-08-30 17:15:22 UTC
Created attachment 118257 [details]
Output of running 'sysreport'

Comment 4 Eric Z. Ayers 2005-08-30 17:24:45 UTC
FYI, I did just update the kernel to 2.6.12-1.1447_FC4smp - same problem. 
The network went out while I was running 'sysreport' above.

Comment 5 John W. Linville 2005-09-09 12:55:11 UTC
Perhaps there is an auto-negotiation problem?  I have occasionally seen or 
heard of problems like this that go away when a fixed port configuration is 
used. 
 
Could you force the link speed to 1000/Full (or whatever is appropriate) at 
the switch?  For good measure, you should also set ETHTOOL_OPTS 
in /etc/sysconfig/network-scripts/ifcfg-ethX: 
 
   ETHTOOL_OPTS="speed 1000 duplex full autoneg off" 
 
Modify that as appropriate if not using 1000/full, of course. 
 
Could you give that a try and report the results...thanks! 

Comment 6 Eric Z. Ayers 2005-09-09 18:08:50 UTC
The machine goes live in about 1 week.  Folks are getting their feet wet now,
the new hardware replaces one of our mainstay machines running RH Linux 7.3.  
I'm waiting for a chance to reboot the machine and re-enable the onboard
controller.  I won't have much of an opportunity to do these kinds of tests
after the server goes live.

We've set the port to full duplex, 100Mbit, replaced a cable was questionable
(we jiggled it and the switch port re-negotiated), and added the line
ETHTOOL_OPTS to the network interface script.

Comment 7 Eric Z. Ayers 2005-09-12 13:07:42 UTC
No joy after that change.  I rebooted this morning after nailing the port to
100MBit full duplex and adding the ETHTOOL_OPTS line:

$ uptime
 08:57:43 up 31 min,  2 users,  load average: 0.00, 0.02, 0.06

The problem is exhibiting itself again already.

Comment 8 Eric Z. Ayers 2005-09-14 10:44:20 UTC
Thanks for trying to help me resolve this problem.  We have a workaround
(installing a second NIC) and tonight we are taking the server 'live'.  After
7pm EDT or so, I won't be able to screw around with the onboard NIC without
disrupting business.  If there is something else you can think of to try today,
let me know.

Comment 9 John W. Linville 2005-09-15 15:01:54 UTC
Moving this to CANTFIX due to need for continued testing that the reporter 
will be unable to conduct.  Please reopen if this situation changes. 


Note You need to log in before you can comment on or make changes to this bug.