+++ This bug was initially created as a clone of Bug #398921 +++ Hello everyone, I'm running RHEL5.4 x86_64 with latest updates (as of Nov 18th 2009) and kernel 2.6.18-164.6.1.el5 on several systems. One of the systems is a Sun Ultra 27 workstation (Nehalem Xeon) and runs into the same >4Gb problems. When the system was upgraded from 3gb RAM to 12Gb RAM these same errors started to appear in a repetitive manner as soon as the system tapped into some "high" memory areas (I am unable to identify which areas but they appear to be around 4gb). * How to reproduce: On the live system, start a few VMWare machines (each with RHEL5) and try something network intensive 'yum update' on them.. network performance slows to a crawl. At first I tought my new ECC RAM was damaged but after days of memtester(1) and swapping out all DIMMS to isolate a fault, I was unable to find a defective DIMM (It's difficult to do so because they are ECC DIMMs). Here's what I have gathered so far: - The occurence rate depends on the Intel chipset being used (The Mobo e1000 chipset differs from the dual-e1000 PCI-E card I tried in order to isolate the problem). With the on-board e1000, I would get a continous stream of 'Detected Tx Unit Hang' as soon as the problem occured whereas with the PCI-E e1000, I would only get a few. - Other drivers appears to suffer from the same kind of issue( maybe?) I disabled the on-board e1000 and used a Sun Cassini dual-gigabit card, I was able to reproduce the same network hangs (albeit without error reporting). - As a workaround (before I try sf.net versions of the e1000 drive), I now boot the system with 'mem=4000m' and it now runs fine with the dual-e1000 PCI-E card. It's a shame because I really need the 12Gb RAM. :( - On dual-xeon Dell Workstation with RHEL5.4 (same software) with dual-E5410 Xeons and a Broadcom Gigabit Ethernet (tg3), I cannot reproduce this problem.
[root@thorbardin ~]# uname -a Linux thorbardin.lasthome.solace.krynn 2.6.18-164.6.1.el5 #1 SMP Tue Oct 27 11:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@thorbardin ~]# lspci 00:00.0 Host bridge: Intel Corporation X58 I/O Hub to ESI Port (rev 13) 00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 13) 00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 13) 00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 13) 00:14.0 PIC: Intel Corporation 5520/5500/X58 I/O Hub System Management Registers (rev 13) 00:14.1 PIC: Intel Corporation 5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 13) 00:14.2 PIC: Intel Corporation 5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 13) 00:14.3 PIC: Intel Corporation 5520/5500/X58 I/O Hub Throttle Registers (rev 13) 00:1a.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4 00:1a.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2 00:1b.0 Audio device: Intel Corporation 82801JI (ICH10 Family) HD Audio Controller 00:1d.0 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1 00:1d.1 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2 00:1d.2 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3 00:1d.7 USB Controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller 00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller 00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller 01:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 01:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 03:00.0 VGA compatible controller: nVidia Corporation G94 [Quadro FX 1800] (rev a1) 04:00.0 Fibre Channel: QLogic Corp. QLA2300 64-bit Fibre Channel Adapter (rev 01) 04:04.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) [root@thorbardin ~]# uname -a Linux thorbardin.lasthome.solace.krynn 2.6.18-164.6.1.el5 #1 SMP Tue Oct 27 11:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux [root@thorbardin ~]# modinfo e1000e|grep vers filename: /lib/modules/2.6.18-164.6.1.el5/kernel/drivers/net/e1000e/e1000e.ko version: 1.0.2-k2 srcversion: 469E6E08131469CFCF43DFF
Example errors: Nov 19 10:24:14 thorbardin kernel: eth0: Detected Tx Unit Hang: Nov 19 10:24:14 thorbardin kernel: TDH <67> Nov 19 10:24:14 thorbardin kernel: TDT <6e> Nov 19 10:24:14 thorbardin kernel: next_to_use <6e> Nov 19 10:24:14 thorbardin kernel: next_to_clean <67> Nov 19 10:24:14 thorbardin kernel: buffer_info[next_to_clean]: Nov 19 10:24:14 thorbardin kernel: time_stamp <10c4db1cf> Nov 19 10:24:14 thorbardin kernel: next_to_watch <67> Nov 19 10:24:14 thorbardin kernel: jiffies <10c4dbae0> Nov 19 10:24:14 thorbardin kernel: next_to_watch.status <0> Nov 19 10:24:16 thorbardin kernel: eth0: Detected Tx Unit Hang: Nov 19 10:24:16 thorbardin kernel: TDH <67> Nov 19 10:24:16 thorbardin kernel: TDT <6e> Nov 19 10:24:16 thorbardin kernel: next_to_use <6e> Nov 19 10:24:16 thorbardin kernel: next_to_clean <67> Nov 19 10:24:16 thorbardin kernel: buffer_info[next_to_clean]: Nov 19 10:24:16 thorbardin kernel: time_stamp <10c4db1cf> Nov 19 10:24:16 thorbardin kernel: next_to_watch <67> Nov 19 10:24:16 thorbardin kernel: jiffies <10c4dc2b0> Nov 19 10:24:16 thorbardin kernel: next_to_watch.status <0>
Unfortunately, the driver that came with 5.4 doesn't have the 'ignore_64bit_dma=1' that recent e1000 drivers from Intel include..
Additionnal comment: using 'tso off' improves the situation somewhat up to the point where the system remains reachable through the network even though its experiencing slowdowns. I'm now running with 12Gb but with tso off and with the 1.0.15 e100e driver on both eth0 and eth1 (multi-homed machine with iptables IPV4 filtering on eth1).
(In reply to comment #4) > Additionnal comment: using 'tso off' improves the situation somewhat up to the > point where the system remains reachable through the network even though its > experiencing slowdowns. > I'm now running with 12Gb but with tso off and with the 1.0.15 e100e driver on > both eth0 and eth1 (multi-homed machine with iptables IPV4 filtering on eth1). I'm seeing the same error here, how is the newer driver working out for you?
Just a comment: we don't have the system anymore. We'll see if we can reproduce this issue with the next nehalem box.
If anyone would care to try the e1000e driver we plan to ship with 5.5, they can try my test kernels located here: http://people.redhat.com/agospoda/#rhel5 I can't say for sure that they will help, but I'll put that offer out there in case anyone wants to try them.
(In reply to comment #4) > Additionnal comment: using 'tso off' improves the situation somewhat up to the > point where the system remains reachable through the network even though its > experiencing slowdowns. > I'm now running with 12Gb but with tso off and with the 1.0.15 e100e driver on > both eth0 and eth1 (multi-homed machine with iptables IPV4 filtering on eth1). So turning TSO off was only effective on the RHEL driver, the sourceforge driver, or both?
Hi Andy, this would require some re-testing but from what I recall: - tso was only tested with 1.0.15, not the RHEL driver. - one of the interfaces (connected to a 100FDX switch and not to a Gbps switch) kept detecting error packets until we turned autoneg back on. The system wasn't losing packets anymore with 1.0.15 and tso off. When we get another system, I'll test your kernels on it. Thanks,
Vincent, I don't really have much information that I can use to resolve this, so I'm going to close this with the flag insufficient data. If you find this system again and can reproduce the problem (or if the system is a reference design we might have here), please reopen the bug and I will check it out.
Hi Andy, I agree with you. I am not in a position where I can reproduce the problem due to lack of availability of a similar computer system. The system I had access to is a current reference design from SUN Microsystems: a Sun Ultra 27 workstation with a Xeon W3540 cpu. You might be able to have SUN ship you a machine if you'd be willing to try to reproduce it in the labs because they offer it as a 'try-n-buy' product (keep it for free during 60days and then send it back or buy it at a discounted rate). Anyone (even individuals) can use the SUN try-n-buy program so that might be an option. Best regards, Vincent