Bug 165676

Summary: e1000 driver with Intel 82546EB controller drops packets
Product: Red Hat Enterprise Linux 3 Reporter: Jon. Hallett <jjh>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0CC: bjoern, petrides
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-26 16:20:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
uname -a; ethtool eth0; lspci -vvv | grep -A15 Ethernet none

Description Jon. Hallett 2005-08-11 11:00:22 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

Description of problem:
The e1000 driver intermittently drops received packets when running with Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01).  Packets with the same payload are consistently dropped.

The packet drops are particularly evident with RPC, where certain RPC replies and their retransmits (which I believe are identical) are all dropped, resulting in RPC time outs.  In our case this impacted NFS and automounting.


Version-Release number of selected component (if applicable):
kernel-2.4.21-32.0.1.EL

How reproducible:
Always

Steps to Reproduce:
Run

#/bin/sh

while true; do
    rpcinfo -u servername 100000 2
done

where servername is the name of a server running portmap.

Actual Results:  program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
[...]
rpcinfo: RPC: Port mapper failure - RPC: Timed out
program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
[...]
rpcinfo: RPC: Port mapper failure - RPC: Timed out
[etc.]

Expected Results:  program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
program 100000 version 2 ready and waiting
[...]

Additional info:

The problem occurs in both RHEL4 running kernel-2.6.9-11.EL and RHEL3 running kernel-2.4.21-32.0.1.EL.  Both kernels have e1000 version 5.6.10.1-k2-NAPI.

Nothing suspicious is logged to dmesg or /var/log/messages.

We have tried disabled RX checksum offloading using both ethtool and modules.conf/modprobe.conf, and we still saw packet loss.

We have three machines with 82546EB controllers, and they all drop packets.

The problem machines are connected to different switches from different manufacturers.  Some are running at 100Mbps and and others at 1Gbps.  All drop packets.

We have other machines running e1000 drivers which do not drop packets.  They have different versions of the controller.  For example, Intel Corp. 82545GM Gigabit Ethernet Controller (rev 04) works fine with e1000 version 5.6.10.1-k2-NAPI.

Comment 2 John W. Linville 2005-08-16 19:37:10 UTC
Please try the test kernels available here: 
 
   http://people.redhat.com/linville/kernels/rhel3/ 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Those both have e1000 drivers based on version 6.0.54-k2.  Please try to 
recreate the issue described above with these kernels and post the results 
here...thanks! 

Comment 3 Jon. Hallett 2005-08-17 09:03:35 UTC
Our RHEL4 boxes exhibit the same problem with the RPC test when running the test
kernel.

[root@moorhen tmp]# uname -a
Linux moorhen.ecs.soton.ac.uk 2.6.9-15.2.EL.jwltest.49smp #1 SMP Mon Aug 15
16:21:22 EDT 2005 i686 i686 i386 GNU/Linux

117:    program 100003 version 2 ready and waiting
        program 100003 version 3 ready and waiting
118:    program 100003 version 2 ready and waiting
        program 100003 version 3 ready and waiting
119:    program 100003 version 2 ready and waiting
        program 100003 version 3 ready and waiting
120:    program 100003 version 2 ready and waiting
        program 100003 version 3 ready and waiting
121:

rpcinfo: RPC: Port mapper failure - RPC: Timed out
program 100003 version 2 is not available
        program 100003 version 3 ready and waiting


Comment 4 John W. Linville 2005-09-09 13:24:08 UTC
Please post the output of running "ethtool -S" for the appropriate interface 
after conducting your RPC test and experiencing the failures...thanks! 

Comment 5 Jon. Hallett 2005-09-12 11:29:41 UTC
Sorry, but we have now replaced all our 82546EB interfaces and so are no longer
able to do tests.


Comment 6 John W. Linville 2005-09-26 16:20:33 UTC
CANTFIX, based on lack of available testing. 

Comment 7 Ingvar Hagelund 2007-01-31 22:28:45 UTC
Created attachment 147059 [details]
uname -a; ethtool eth0; lspci -vvv | grep -A15 Ethernet

Comment 8 Ingvar Hagelund 2007-01-31 22:30:20 UTC
We've seen something very similar, on our Dell PowerEdge 1855 Blade servers.
Output from lspci and ethtool attached.

After upgrading from 2.4.21-40.EL to kernel-smp-2.4.21-47.0.1.EL, we experienced
strange network problems. It's a bit tricky to investigate, as the problem comes
in bursts lasting a minute or five, and, as the machines are placed on an
offsite location, often has gone away before we reach as far as the console. As
the machines are in heavy production, it's not very tempting to reboot the
servers with the newer kernel again.

The problem looks more or less like described above. The blades looses packets,
effectivily going off net for some minutes while under more or less heavy
network load.

Rolling back to 2.4.21-40.EL, the problem went away.

Ingvar



Comment 9 Ingvar Hagelund 2007-01-31 22:33:12 UTC
Comment on attachment 147059 [details]
uname -a; ethtool eth0; lspci -vvv | grep -A15 Ethernet

Linux some.where.com 2.4.21-40.EL #1 Thu Feb 2 22:32:00 EST 2006 i686 i686 i386
GNU/Linux
Settings for eth0:
	Supported ports: [ FIBRE ]
	Supported link modes:	1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:	1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: FIBRE
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: off
	Supports Wake-on: umbg
	Wake-on: d
	Current message level: 0x00000007 (7)
	Link detected: yes
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
	Latency: 72 (4250ns min, 4500ns max), cache line size 10
	Interrupt: pin A routed to IRQ 10
	Region 0: I/O ports at ec00 [size=256]
	Region 1: Memory at dfdf0000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at dfde0000 (64-bit, non-prefetchable) [size=64K]
	Expansion ROM at dfe00000 [disabled] [size=1M]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/0
Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [68] PCI-X non-bridge device.
		Command: DPERE- ERO- RBC=0 OST=4
		Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple,
DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-
05:04.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet
Controller (rev 03)
	Subsystem: Dell: Unknown device 018a
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (63750ns min), cache line size 10
	Interrupt: pin A routed to IRQ 15
	Region 0: Memory at dfbe0000 (64-bit, non-prefetchable) [size=128K]
	Region 4: I/O ports at dcc0 [size=64]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [e4] PCI-X non-bridge device.
		Command: DPERE- ERO+ RBC=0 OST=0
		Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple,
DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-	Capabilities: [f0] Message Signalled
Interrupts: 64bit+ Queue=0/0 Enable-
		Address: 0000000000000000  Data: 0000

05:04.1 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet
Controller (rev 03)
	Subsystem: Dell: Unknown device 018a
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
	Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (63750ns min), cache line size 10
	Interrupt: pin B routed to IRQ 7
	Region 0: Memory at dfbc0000 (64-bit, non-prefetchable) [size=128K]
	Region 4: I/O ports at dc80 [size=64]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [e4] PCI-X non-bridge device.
		Command: DPERE- ERO+ RBC=0 OST=0
		Status: Bus=0 Dev=0 Func=0 64bit- 133MHz- SCD- USC-, DC=simple,
DMMRBC=0, DMOST=0, DMCRS=0, RSCEM-	Capabilities: [f0] Message Signalled
Interrupts: 64bit+ Queue=0/0 Enable-
		Address: 0000000000000000  Data: 0000

Comment 10 John W. Linville 2007-02-01 19:31:24 UTC
Ingvar, given that you are using RHEL3 I have to suggest that you use the 
normal RHEL support channels in order to get this issue resolved to your 
benefit.  That will ensure that the issue you are experiencing receives the 
appropriate level of attention and support.

Comment 11 Björn Jacke 2010-11-08 14:44:12 UTC
might be this is the same as he following Intel card bug:

https://bugzilla.kernel.org/show_bug.cgi?id=15384

The only way I see to fix it is to blacklist all the E1000 adapters with the broken firmware. A temporary workaround is to disable RX checksum offloading via ethtool.

Comment 12 Björn Jacke 2010-11-08 14:45:55 UTC
I meant disable TX checksum offloading...