From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Description of problem: The e1000 driver intermittently drops received packets when running with Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01). Packets with the same payload are consistently dropped. The packet drops are particularly evident with RPC, where certain RPC replies and their retransmits (which I believe are identical) are all dropped, resulting in RPC time outs. In our case this impacted NFS and automounting. Version-Release number of selected component (if applicable): kernel-2.4.21-32.0.1.EL How reproducible: Always Steps to Reproduce: Run #/bin/sh while true; do rpcinfo -u servername 100000 2 done where servername is the name of a server running portmap. Actual Results: program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting [...] rpcinfo: RPC: Port mapper failure - RPC: Timed out program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting [...] rpcinfo: RPC: Port mapper failure - RPC: Timed out [etc.] Expected Results: program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting program 100000 version 2 ready and waiting [...] Additional info: The problem occurs in both RHEL4 running kernel-2.6.9-11.EL and RHEL3 running kernel-2.4.21-32.0.1.EL. Both kernels have e1000 version 5.6.10.1-k2-NAPI. Nothing suspicious is logged to dmesg or /var/log/messages. We have tried disabled RX checksum offloading using both ethtool and modules.conf/modprobe.conf, and we still saw packet loss. We have three machines with 82546EB controllers, and they all drop packets. The problem machines are connected to different switches from different manufacturers. Some are running at 100Mbps and and others at 1Gbps. All drop packets. We have other machines running e1000 drivers which do not drop packets. They have different versions of the controller. For example, Intel Corp. 82545GM Gigabit Ethernet Controller (rev 04) works fine with e1000 version 5.6.10.1-k2-NAPI.
Please try the test kernels available here: http://people.redhat.com/linville/kernels/rhel3/ http://people.redhat.com/linville/kernels/rhel4/ Those both have e1000 drivers based on version 6.0.54-k2. Please try to recreate the issue described above with these kernels and post the results here...thanks!
Our RHEL4 boxes exhibit the same problem with the RPC test when running the test kernel. [root@moorhen tmp]# uname -a Linux moorhen.ecs.soton.ac.uk 2.6.9-15.2.EL.jwltest.49smp #1 SMP Mon Aug 15 16:21:22 EDT 2005 i686 i686 i386 GNU/Linux 117: program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting 118: program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting 119: program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting 120: program 100003 version 2 ready and waiting program 100003 version 3 ready and waiting 121: rpcinfo: RPC: Port mapper failure - RPC: Timed out program 100003 version 2 is not available program 100003 version 3 ready and waiting
Please post the output of running "ethtool -S" for the appropriate interface after conducting your RPC test and experiencing the failures...thanks!
Sorry, but we have now replaced all our 82546EB interfaces and so are no longer able to do tests.
CANTFIX, based on lack of available testing.
Created attachment 147059 [details] uname -a; ethtool eth0; lspci -vvv | grep -A15 Ethernet
We've seen something very similar, on our Dell PowerEdge 1855 Blade servers. Output from lspci and ethtool attached. After upgrading from 2.4.21-40.EL to kernel-smp-2.4.21-47.0.1.EL, we experienced strange network problems. It's a bit tricky to investigate, as the problem comes in bursts lasting a minute or five, and, as the machines are placed on an offsite location, often has gone away before we reach as far as the console. As the machines are in heavy production, it's not very tempting to reboot the servers with the newer kernel again. The problem looks more or less like described above. The blades looses packets, effectivily going off net for some minutes while under more or less heavy network load. Rolling back to 2.4.21-40.EL, the problem went away. Ingvar
Comment on attachment 147059 [details] uname -a; ethtool eth0; lspci -vvv | grep -A15 Ethernet Linux some.where.com 2.4.21-40.EL #1 Thu Feb 2 22:32:00 EST 2006 i686 i686 i386 GNU/Linux Settings for eth0: Supported ports: [ FIBRE ] Supported link modes: 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: FIBRE PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: umbg Wake-on: d Current message level: 0x00000007 (7) Link detected: yes Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 72 (4250ns min, 4500ns max), cache line size 10 Interrupt: pin A routed to IRQ 10 Region 0: I/O ports at ec00 [size=256] Region 1: Memory at dfdf0000 (64-bit, non-prefetchable) [size=64K] Region 3: Memory at dfde0000 (64-bit, non-prefetchable) [size=64K] Expansion ROM at dfe00000 [disabled] [size=1M] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [68] PCI-X non-bridge device. Command: DPERE- ERO- RBC=0 OST=4 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- 05:04.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) Subsystem: Dell: Unknown device 018a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (63750ns min), cache line size 10 Interrupt: pin A routed to IRQ 15 Region 0: Memory at dfbe0000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at dcc0 [size=64] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [e4] PCI-X non-bridge device. Command: DPERE- ERO+ RBC=0 OST=0 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 05:04.1 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) Subsystem: Dell: Unknown device 018a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (63750ns min), cache line size 10 Interrupt: pin B routed to IRQ 7 Region 0: Memory at dfbc0000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at dc80 [size=64] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [e4] PCI-X non-bridge device. Command: DPERE- ERO+ RBC=0 OST=0 Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple, DMMRBC=0, DMOST=0, DMCRS=0, RSCEM- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000
Ingvar, given that you are using RHEL3 I have to suggest that you use the normal RHEL support channels in order to get this issue resolved to your benefit. That will ensure that the issue you are experiencing receives the appropriate level of attention and support.
might be this is the same as he following Intel card bug: https://bugzilla.kernel.org/show_bug.cgi?id=15384 The only way I see to fix it is to blacklist all the E1000 adapters with the broken firmware. A temporary workaround is to disable RX checksum offloading via ethtool.
I meant disable TX checksum offloading...