Description of problem: I am a long time Linux user in a corporate environment. This is a weird problem. We just got a new Dell PowerEdge 1800 with Dual Xeon 3.0 GHz processors and an onboard Intel e1000 NIC. The adapeter is plugged into a switch 100Mbit cisco hub running Full duplex. I just installed Fedora Core 4 on a machine on Tuesday. After running for about 30 minutes or so, the machine can no longer see some packets. For the most part, the problem is with packets from outside of our subnet. We have a weird 10 bit subnet (netmask 255.255.252.0) in our network, if that makes any difference. 1) Rebooting the machine fixes the problem (for about 30 minutes) 2) Some machines that go through the router are still visible (192.168.x.x IP addresses) 3) Nothing looks "funny" about the routing table. 4) Turning off the onboard e1000 NIC and replacing it with a 3c905 network card makes the problem go away. 5) The problem occured with both the stock Fedora Core 4 SMP and uniprocessor kernel. I went through hell rebooting repeatedly until all packages could be downloaded using 'yum'. Even with the updated kernel the problem persists. 6) I just downloaded Intel's latest ethernet driver, and the problem persists. ACPI: PCI Interrupt 0000:02:05.0[A] -> GSI 37 (level, low) -> IRQ 201 3c59x: Donald Becker and others. www.scyld.com/network/vortex.html 0000:02:05.0: 3Com PCI 3c905C Tornado at 0xec80. Vers LK1.1.19 Intel(R) PRO/1000 Network Driver - version 6.1.16-NAPI Copyright (c) 1999-2005 Intel Corporation. ACPI: PCI Interrupt 0000:03:07.0[A] -> GSI 69 (level, low) -> IRQ 209 e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection This has been going on for about 2 days. $ sudo /sbin/ethtool eth0 Password: Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: umbg Wake-on: d Current message level: 0x00000007 (7) Link detected: yes $ /sbin/ifconfig -a # This is the 3c905 card - not hooked up at the moment dev20271 Link encap:Ethernet HWaddr 00:50:DA:60:1F:2C BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:201 Base address:0xec80 # This is the e1000 nic eth0 Link encap:Ethernet HWaddr 00:14:22:0B:62:1E inet addr:158.155.4.123 Bcast:158.155.7.255 Mask:255.255.252.0 inet6 addr: fe80::214:22ff:fe0b:621e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1621932 errors:0 dropped:0 overruns:0 frame:0 TX packets:855106 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2277444127 (2.1 GiB) TX bytes:79514599 (75.8 MiB) Base address:0xdcc0 Memory:dfbe0000-dfc00000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:40859 errors:0 dropped:0 overruns:0 frame:0 TX packets:40859 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:19204150 (18.3 MiB) TX bytes:19204150 (18.3 MiB) sit0 Link encap:IPv6-in-IPv4 NOARP MTU:1480 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) $ netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 158.155.4.0 0.0.0.0 255.255.252.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 0.0.0.0 158.155.4.1 0.0.0.0 UG 0 0 0 eth0 Think I'm nuts and it is something in the network? Look at this. ntp and ping say the remote isn't responding: eric@bass2:/bass/home/eric$ /usr/sbin/ntpq -p 158.155.2.3 158.155.2.3: timed out, nothing received ***Request timed out eric@bass2:/bass/home/eric$ date Thu Aug 25 14:01:54 EDT 2005 eric@bass2:/bass/home/eric$ ping 158.155.2.3 PING 158.155.2.3 (158.155.2.3) 56(84) bytes of data. --- 158.155.2.3 ping statistics --- 9 packets transmitted, 0 received, 100% packet loss, time 7999ms eric@bass2:/bass/home/eric$ date Thu Aug 25 14:02:08 EDT 2005 But look what tcpdump says on the same machine - the packets are getting to the remote machine and coming back, but for some reason, the replies are being ignored. $ sudo /usr/sbin/tcpdump -n -i eth0 host 158.155.2.3 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes 14:01:42.388815 IP 158.155.4.123.32904 > 158.155.2.3.ntp: NTPv2, Reserved, length 12 14:01:42.389531 IP 158.155.2.3.ntp > 158.155.4.123.32904: NTPv2, Reserved, length 20 14:01:47.388353 IP 158.155.4.123.32904 > 158.155.2.3.ntp: NTPv2, Reserved, length 12 14:01:47.388839 IP 158.155.2.3.ntp > 158.155.4.123.32904: NTPv2, Reserved, length 20 14:01:58.309397 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 0 14:01:58.311005 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 0 14:01:59.309148 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 1 14:01:59.309914 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 1 14:02:00.309431 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 2 14:02:00.310200 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 2 14:02:01.309685 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 3 14:02:01.310359 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 3 14:02:02.308975 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 4 14:02:02.309769 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 4 14:02:03.309294 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 5 14:02:03.310055 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 5 14:02:04.309427 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 6 14:02:04.310214 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 6 14:02:05.308736 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 7 14:02:05.309500 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 7 14:02:06.308957 IP 158.155.4.123 > 158.155.2.3: icmp 64: echo request seq 8 14:02:06.309660 IP 158.155.2.3 > 158.155.4.123: icmp 64: echo reply seq 8 Version-Release number of selected component (if applicable): kernel-smp-2.6.12-1.1398_FC4 How reproducible: Every time after I reboot, packets offnet stop after about 30 minutes Steps to Reproduce: 1. Just reboot, work normally 2. After about half an hour. Actual results: Expected results: Additional info: I saw another buzilla bug on the e1000, but it didn't help: # BUGZILLA Bug ID 149887 - Workaround for problem with e1000 adapters # 24 Aug 2005 -EZA # Linux bass2.compgen.com 2.6.12-1.1398_FC4smp #1 SMP Fri Jul 15 01:30:13 EDT 2005 i686 i686 i386 GNU/Linux /sbin/ethtool -K eth0 rx off tx off
The IT guy commented that there is a firewall in between 158.155.2.3 and 158.155.4.1 (the default route) which is common to the systems we are having troubles with. Still, it doesn't explain why the problem is not reproducable when we switch to use the 3c905 NIC.
It is difficult to know where to start...please attach the output of running "sysreport"...thanks!
Created attachment 118257 [details] Output of running 'sysreport'
FYI, I did just update the kernel to 2.6.12-1.1447_FC4smp - same problem. The network went out while I was running 'sysreport' above.
Perhaps there is an auto-negotiation problem? I have occasionally seen or heard of problems like this that go away when a fixed port configuration is used. Could you force the link speed to 1000/Full (or whatever is appropriate) at the switch? For good measure, you should also set ETHTOOL_OPTS in /etc/sysconfig/network-scripts/ifcfg-ethX: ETHTOOL_OPTS="speed 1000 duplex full autoneg off" Modify that as appropriate if not using 1000/full, of course. Could you give that a try and report the results...thanks!
The machine goes live in about 1 week. Folks are getting their feet wet now, the new hardware replaces one of our mainstay machines running RH Linux 7.3. I'm waiting for a chance to reboot the machine and re-enable the onboard controller. I won't have much of an opportunity to do these kinds of tests after the server goes live. We've set the port to full duplex, 100Mbit, replaced a cable was questionable (we jiggled it and the switch port re-negotiated), and added the line ETHTOOL_OPTS to the network interface script.
No joy after that change. I rebooted this morning after nailing the port to 100MBit full duplex and adding the ETHTOOL_OPTS line: $ uptime 08:57:43 up 31 min, 2 users, load average: 0.00, 0.02, 0.06 The problem is exhibiting itself again already.
Thanks for trying to help me resolve this problem. We have a workaround (installing a second NIC) and tonight we are taking the server 'live'. After 7pm EDT or so, I won't be able to screw around with the onboard NIC without disrupting business. If there is something else you can think of to try today, let me know.
Moving this to CANTFIX due to need for continued testing that the reporter will be unable to conduct. Please reopen if this situation changes.