540413 – e1000e: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang continues RHEL5.4

Bug 540413 - e1000e: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang continues RHEL5.4

Summary: e1000e: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang continues RHEL5.4

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Dean Nelson
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-11-23 11:04 UTC by Vincent S. Cojot
Modified:	2010-01-19 10:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	398921
Environment:
Last Closed:	2010-01-18 22:29:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vincent S. Cojot 2009-11-23 11:04:02 UTC

+++ This bug was initially created as a clone of Bug #398921 +++

Hello everyone,

I'm running RHEL5.4 x86_64 with latest updates (as of Nov 18th 2009) and kernel 2.6.18-164.6.1.el5 on several systems. One of the systems is a Sun Ultra 27 workstation (Nehalem Xeon) and runs into the same >4Gb problems.

When the system was upgraded from 3gb RAM to 12Gb RAM these same errors started to appear in a repetitive manner as soon as the system tapped into some "high" memory areas (I am unable to identify which areas but they appear to be around 4gb).

* How to reproduce: On the live system, start a few VMWare machines (each with RHEL5) and try something network intensive 'yum update' on them.. network performance slows to a crawl.

At first I tought my new ECC RAM was damaged but after days of memtester(1) and swapping out all DIMMS to isolate a fault, I was unable to find a defective DIMM (It's difficult to do so because they are ECC DIMMs).

Here's what I have gathered so far:
- The occurence rate depends on the Intel chipset being used (The Mobo e1000 chipset differs from the dual-e1000 PCI-E card I tried in order to isolate the problem). With the on-board e1000, I would get a continous stream of 'Detected Tx Unit Hang' as soon as the problem occured whereas with the PCI-E e1000, I would only get a few.

- Other drivers appears to suffer from the same kind of issue( maybe?) I disabled the on-board e1000 and used a Sun Cassini dual-gigabit card, I was able to reproduce the same network hangs (albeit without error reporting).

- As a workaround (before I try sf.net versions of the e1000 drive), I now boot the system with 'mem=4000m' and it now runs fine with the dual-e1000 PCI-E card. It's a shame because I really need the 12Gb RAM. :(

- On dual-xeon Dell Workstation with RHEL5.4 (same software) with dual-E5410 Xeons and a Broadcom Gigabit Ethernet (tg3), I cannot reproduce this problem.

Comment 1 Vincent S. Cojot 2009-11-23 11:05:54 UTC

[root@thorbardin ~]# uname -a
Linux thorbardin.lasthome.solace.krynn 2.6.18-164.6.1.el5 #1 SMP Tue Oct 27 11:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@thorbardin ~]# lspci
00:00.0 Host bridge: Intel Corporation X58 I/O Hub to ESI Port (rev 13)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 13)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 13)
00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 13)
00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 13)
00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 13)
00:14.3 PIC: Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers (rev 13)
00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller
00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
01:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
01:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
03:00.0 VGA compatible controller: nVidia Corporation G94 [Quadro FX 1800] (rev a1)
04:00.0 Fibre Channel: QLogic Corp. QLA2300 64-bit Fibre Channel Adapter (rev 01)
04:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
[root@thorbardin ~]# uname -a
Linux thorbardin.lasthome.solace.krynn 2.6.18-164.6.1.el5 #1 SMP Tue Oct 27 11:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@thorbardin ~]# modinfo e1000e|grep vers
filename:       /lib/modules/2.6.18-164.6.1.el5/kernel/drivers/net/e1000e/e1000e.ko
version:        1.0.2-k2
srcversion:     469E6E08131469CFCF43DFF

Comment 2 Vincent S. Cojot 2009-11-23 11:11:50 UTC

Example errors:

Nov 19 10:24:14 thorbardin kernel: eth0: Detected Tx Unit Hang:
Nov 19 10:24:14 thorbardin kernel:   TDH                  <67>
Nov 19 10:24:14 thorbardin kernel:   TDT                  <6e>
Nov 19 10:24:14 thorbardin kernel:   next_to_use          <6e>
Nov 19 10:24:14 thorbardin kernel:   next_to_clean        <67>
Nov 19 10:24:14 thorbardin kernel: buffer_info[next_to_clean]:
Nov 19 10:24:14 thorbardin kernel:   time_stamp           <10c4db1cf>
Nov 19 10:24:14 thorbardin kernel:   next_to_watch        <67>
Nov 19 10:24:14 thorbardin kernel:   jiffies              <10c4dbae0>
Nov 19 10:24:14 thorbardin kernel:   next_to_watch.status <0>
Nov 19 10:24:16 thorbardin kernel: eth0: Detected Tx Unit Hang:
Nov 19 10:24:16 thorbardin kernel:   TDH                  <67>
Nov 19 10:24:16 thorbardin kernel:   TDT                  <6e>
Nov 19 10:24:16 thorbardin kernel:   next_to_use          <6e>
Nov 19 10:24:16 thorbardin kernel:   next_to_clean        <67>
Nov 19 10:24:16 thorbardin kernel: buffer_info[next_to_clean]:
Nov 19 10:24:16 thorbardin kernel:   time_stamp           <10c4db1cf>
Nov 19 10:24:16 thorbardin kernel:   next_to_watch        <67>
Nov 19 10:24:16 thorbardin kernel:   jiffies              <10c4dc2b0>
Nov 19 10:24:16 thorbardin kernel:   next_to_watch.status <0>

Comment 3 Vincent S. Cojot 2009-11-23 12:01:20 UTC

Unfortunately, the driver that came with 5.4 doesn't have the 'ignore_64bit_dma=1' that recent e1000 drivers from Intel include..

Comment 4 Vincent S. Cojot 2009-12-03 14:49:10 UTC

Additionnal comment: using 'tso off' improves the situation somewhat up to the point where the system remains reachable through the network even though its experiencing slowdowns.
I'm now running with 12Gb but with tso off and with the 1.0.15 e100e driver on both eth0 and eth1 (multi-homed machine with iptables IPV4 filtering on eth1).

Comment 5 Matthew Kent 2009-12-03 23:53:53 UTC

(In reply to comment #4)
> Additionnal comment: using 'tso off' improves the situation somewhat up to the
> point where the system remains reachable through the network even though its
> experiencing slowdowns.
> I'm now running with 12Gb but with tso off and with the 1.0.15 e100e driver on
> both eth0 and eth1 (multi-homed machine with iptables IPV4 filtering on eth1).  

I'm seeing the same error here, how is the newer driver working out for you?

Comment 6 Vincent S. Cojot 2009-12-07 15:47:24 UTC

Just a comment: we don't have the system anymore. We'll see if we can reproduce this issue with the next nehalem box.

Comment 7 Andy Gospodarek 2009-12-07 21:58:40 UTC

If anyone would care to try the e1000e driver we plan to ship with 5.5, they can try my test kernels located here:

http://people.redhat.com/agospoda/#rhel5

I can't say for sure that they will help, but I'll put that offer out there in case anyone wants to try them.

Comment 8 Andy Gospodarek 2009-12-07 22:09:24 UTC

(In reply to comment #4)
> Additionnal comment: using 'tso off' improves the situation somewhat up to the
> point where the system remains reachable through the network even though its
> experiencing slowdowns.
> I'm now running with 12Gb but with tso off and with the 1.0.15 e100e driver on
> both eth0 and eth1 (multi-homed machine with iptables IPV4 filtering on eth1).  

So turning TSO off was only effective on the RHEL driver, the sourceforge driver, or both?

Comment 9 Vincent S. Cojot 2009-12-08 09:27:41 UTC

Hi Andy,
this would require some re-testing but from what I recall:
- tso was only tested with 1.0.15, not the RHEL driver.
- one of the interfaces (connected to a 100FDX switch and not to a Gbps switch) kept detecting error packets until we turned autoneg back on. The system wasn't losing packets anymore with 1.0.15 and tso off.

When we get another system, I'll test your kernels on it.

Thanks,

Comment 10 Andy Gospodarek 2010-01-18 22:29:37 UTC

Vincent, I don't really have much information that I can use to resolve this, so I'm going to close this with the flag insufficient data.

If you find this system again and can reproduce the problem (or if the system is a reference design we might have here), please reopen the bug and I will check it out.

Comment 11 Vincent S. Cojot 2010-01-19 10:51:52 UTC

Hi Andy, I agree with you.
I am not in a position where I can reproduce the problem due to lack of availability of a similar computer system.

The system I had access to is a current reference design from SUN Microsystems: a Sun Ultra 27 workstation with a Xeon W3540 cpu.

You might be able to have SUN ship you a machine if you'd be willing to try to reproduce it in the labs because they offer it as a 'try-n-buy' product (keep it for free during 60days and then send it back or buy it at a discounted rate). Anyone (even individuals) can use the SUN try-n-buy program so that might be an option.

Best regards,

Vincent

Note You need to log in before you can comment on or make changes to this bug.