Bug 194460 - e1000_clean_tx_irq: Detected Tx Unit Hang with Kernel 2.6.16-1.2122_FC5smp and Intel 82573V PCI-Express Ethernet
e1000_clean_tx_irq: Detected Tx Unit Hang with Kernel 2.6.16-1.2122_FC5smp an...
Status: CLOSED UPSTREAM
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
5
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: John W. Linville
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-06-08 06:58 EDT by wet
Modified: 2007-11-30 17:11 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-07-14 14:43:26 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
short script to attempt eeprom fix for 82573 (461 bytes, application/octet-stream)
2006-06-14 11:53 EDT, Jesse Brandeburg
no flags Details
patch to fix 6.3.9-k4 TSO (678 bytes, patch)
2006-06-16 13:53 EDT, Jesse Brandeburg
no flags Details | Diff

  None (edit)
Description wet 2006-06-08 06:58:22 EDT
+++ This bug was initially created as a clone of Bug #182215 +++

Description of problem:

After install of FC5, yum update I encounter problems when I copy big files
(~2GB) from a Win2K Workstation to the FC5 server using scp. Transfer stops
around 1.5 GB, in /var/log/messages I see:

Jun  7 15:53:27 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  7 15:53:27 titan kernel:   Tx Queue             <0>
Jun  7 15:53:27 titan kernel:   TDH                  <15>
Jun  7 15:53:27 titan kernel:   TDT                  <45>
Jun  7 15:53:27 titan kernel:   next_to_use          <45>
Jun  7 15:53:27 titan kernel:   next_to_clean        <15>
Jun  7 15:53:27 titan kernel: buffer_info[next_to_clean]
Jun  7 15:53:27 titan kernel:   time_stamp           <f26f835>
Jun  7 15:53:27 titan kernel:   next_to_watch        <18>
Jun  7 15:53:27 titan kernel:   jiffies              <f26fb14>
Jun  7 15:53:27 titan kernel:   next_to_watch.status <0>
Jun  7 15:53:29 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  7 15:53:29 titan kernel:   Tx Queue             <0>
Jun  7 15:53:29 titan kernel:   TDH                  <15>
Jun  7 15:53:29 titan kernel:   TDT                  <45>
Jun  7 15:53:29 titan kernel:   next_to_use          <45>
Jun  7 15:53:29 titan kernel:   next_to_clean        <15>
Jun  7 15:53:29 titan kernel: buffer_info[next_to_clean]
Jun  7 15:53:29 titan kernel:   time_stamp           <f26f835>
Jun  7 15:53:29 titan kernel:   next_to_watch        <18>
Jun  7 15:53:29 titan kernel:   jiffies              <f26fd08>
Jun  7 15:53:29 titan kernel:   next_to_watch.status <0>
Jun  7 15:53:31 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  7 15:53:31 titan kernel:   Tx Queue             <0>
Jun  7 15:53:31 titan kernel:   TDH                  <15>
Jun  7 15:53:31 titan kernel:   TDT                  <45>
Jun  7 15:53:31 titan kernel:   next_to_use          <45>
Jun  7 15:53:31 titan kernel:   next_to_clean        <15>
Jun  7 15:53:31 titan kernel: buffer_info[next_to_clean]
Jun  7 15:53:31 titan kernel:   time_stamp           <f26f835>
Jun  7 15:53:31 titan kernel:   next_to_watch        <18>
Jun  7 15:53:31 titan kernel:   jiffies              <f26fefd>
Jun  7 15:53:31 titan kernel:   next_to_watch.status <0>
Jun  7 15:53:33 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  7 15:53:33 titan kernel:   Tx Queue             <0>
Jun  7 15:53:33 titan kernel:   TDH                  <15>
Jun  7 15:53:33 titan kernel:   TDT                  <45>
Jun  7 15:53:33 titan kernel:   next_to_use          <45>
Jun  7 15:53:33 titan kernel:   next_to_clean        <15>
Jun  7 15:53:33 titan kernel: buffer_info[next_to_clean]
Jun  7 15:53:33 titan kernel:   time_stamp           <f26f835>
Jun  7 15:53:33 titan kernel:   next_to_watch        <18>
Jun  7 15:53:33 titan kernel:   jiffies              <f2700f1>
Jun  7 15:53:33 titan kernel:   next_to_watch.status <0>
Jun  7 15:53:34 titan kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun  7 15:53:34 titan kernel: br0: port 1(eth0) entering disabled state


Version-Release number of selected component (if applicable):
2.6.16-1.2122_FC5smp #1 SMP Sun May 21 15:18:32 EDT 2006 i686 i686 i386 GNU/Linux

How reproducible:
Whenever I use scp to transfer big files.

Steps to Reproduce:
On Win2k Workstation, run
pscp bigfile root@server:/some/where/

Actual results:
Transmission of file stops around ~1.5GB, I see above message in /var/log/messages.

Expected results:
No error messages, simple copy of a file should work.

Additional info:

Disabling TSO (using "/sbin/ethtool -K eth0 tso off") as found somewhere on the
web looks like a workaround, at least I could transfer a few 2GB files.

[root@titan ~]# lspci
00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub (rev 81)
00:01.0 PCI bridge: Intel Corporation E7230 PCI Express Root Port (rev 81)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1
(rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface
Bridge (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA
Storage Controllers cc=IDE (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
02:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09)
04:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet
Controller (Copper) (rev 03)
05:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet
Controller (Copper) (rev 03)
0a:00.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)

-----------------------
Comment 1 Jesse Brandeburg 2006-06-08 14:20:35 EDT
We're starting to get a lot of these reports, many of them are related to (and
can be fixed) by updating the eeprom for the 82573

Please send the output of ethtool -e for both interfaces.
Comment 2 Mustafa Mahudhawala 2006-06-14 02:36:41 EDT
One of our test (internal) file servers had the same problem yesterday, and it
took the network down along with it as well (very serious) ..

Only eth0 i.e. onboard 82573V was in use at the time of the problem.
Currently this interface has been downed, and the server is currently running of
the other onboard NIC.

# lspci
00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1
(rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express
Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface
Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller
(rev 01)
00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) Serial ATA
Storage Controller AHCI (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
03:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet
Controller (Copper) (rev 03)
04:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
04:05.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller (rev 05)

# lspci -n
00:00.0 Class 0600: 8086:2778
00:1c.0 Class 0604: 8086:27d0 (rev 01)
00:1c.4 Class 0604: 8086:27e0 (rev 01)
00:1c.5 Class 0604: 8086:27e2 (rev 01)
00:1d.0 Class 0c03: 8086:27c8 (rev 01)
00:1d.1 Class 0c03: 8086:27c9 (rev 01)
00:1d.2 Class 0c03: 8086:27ca (rev 01)
00:1d.3 Class 0c03: 8086:27cb (rev 01)
00:1d.7 Class 0c03: 8086:27cc (rev 01)
00:1e.0 Class 0604: 8086:244e (rev e1)
00:1f.0 Class 0601: 8086:27b8 (rev 01)
00:1f.1 Class 0101: 8086:27df (rev 01)
00:1f.2 Class 0106: 8086:27c1 (rev 01)
00:1f.3 Class 0c05: 8086:27da (rev 01)
03:00.0 Class 0200: 8086:108b (rev 03)
04:04.0 Class 0300: 1002:515e (rev 02)
04:05.0 Class 0200: 8086:1076 (rev 05)

ifconfig before taking down the problem NIC:

# cat ifconfig.out 
eth0      Link encap:Ethernet  HWaddr 00:13:20:D6:AD:E3
          inet addr:10.65.6.1  Bcast:10.65.6.255  Mask:255.255.255.0
          inet6 addr: fe80::213:20ff:fed6:ade3/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:84132882 errors:297966072 dropped:297966072
overruns:297966072 frame:0
          TX packets:10677632885 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:75213992657 (70.0 GiB)  TX bytes:854693824469 (795.9 GiB)
          Base address:0x2000 Memory:88100000-88120000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:6082 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6082 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:868538 (848.1 KiB)  TX bytes:868538 (848.1 KiB)

# ethtool -e eth0
Offset          Values
------          ------
0x0000          00 13 20 d6 ad e3 30 0b 46 f7 01 10 ff ff ff ff
0x0010          ff ff ff ff 6b 02 a3 30 86 80 8b 10 86 80 de 80
0x0020          00 00 00 20 14 7e 00 00 00 00 d8 00 00 00 00 27
0x0030          c9 6c 50 31 22 07 0b 04 84 09 00 00 00 c0 06 07
0x0040          08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060          00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 22 57
# ethtool -e eth1
Offset          Values
------          ------
0x0000          00 13 20 d6 ad e4 10 02 ff ff 00 10 ff ff ff ff
0x0010          ff ff ff ff 0b 64 a1 30 86 80 76 10 86 80 84 b2
0x0020          dd 20 22 22 00 00 90 2f 80 23 12 00 20 1e 12 00
0x0030          20 1e 12 00 20 1e 12 00 20 1e 09 00 00 02 00 00
0x0040          0c 00 a6 93 0b 28 00 00 00 04 ff ff ff ff ff ff
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06
0x0060          00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 83 18

# uname -rmpio
2.6.9-34.ELsmp x86_64 x86_64 x86_64 GNU/Linux
Comment 3 Jesse Brandeburg 2006-06-14 11:53:57 EDT
Created attachment 130866 [details]
short script to attempt eeprom fix for 82573

This script will attempt to identify an eeprom on an 82573 with a known issue,
and attempt to update the eeprom.  The machine will have to be rebooted for the
changes to the eeprom to take effect (have to force PCIe link to renegotiate).
Comment 4 Jesse Brandeburg 2006-06-14 11:58:48 EDT
There is a note missing (since the hardware failure) where the original
submitter attached his 82573 eeproms, both of which have the known issue that
can cause TX timeouts when TSO is enabled.

Comment 5 Jesse Brandeburg 2006-06-14 12:03:12 EDT
(In reply to comment #2)
> One of our test (internal) file servers had the same problem yesterday, and it
> took the network down along with it as well (very serious) ..

why did the network go down?  Didn't the adapter get reset after the tx timeout
and recover?

> Only eth0 i.e. onboard 82573V was in use at the time of the problem.
> Currently this interface has been downed, and the server is currently running of
> the other onboard NIC.

sorry to hear about the issue, your 82573 eeprom shows that it needs the eeprom
fix that the attached script repairs.

you can also try turning off TSO since it was pretty well broken in 2.6.9
anyway, and that should make the problem go away without an eeprom upgrade to
the 82573, IF it is the same issue being reported here.
Comment 6 wet 2006-06-16 10:34:17 EDT
(In reply to comment #3)
> This script will attempt to identify an eeprom on an 82573 with a known issue,
> and attempt to update the eeprom.

It was a little better after running the script and rebooting; however, I still
experience the aborted file transfers.
After disabling TSO again, everything looked fine
Comment 7 Jesse Brandeburg 2006-06-16 13:53:54 EDT
Created attachment 131062 [details]
patch to fix 6.3.9-k4 TSO

What driver version are you running? 6.3.9-k4 probably, and it doesn't have the
TSO workaround for 82573 that is needed.

you probably need this patch, later drivers (like from linville's test kernels)
already have this fix.
Comment 8 wet 2006-06-16 16:18:43 EDT
Thanks, I have:
  Intel(R) PRO/1000 Network Driver - version 6.3.9-k4-NAPI
and will try a new Kernel or driver in two weeks, when I'm back from vacation.
Until then, the server should work without TSO.


Comment 9 John W. Linville 2006-07-11 10:54:04 EDT
The e1000 driver in current FC5 kernels seems to have the patch from comment 7 
(or its descendant)...  Does this issue still occur w/ current FC5 kernels?
Comment 10 wet 2006-07-14 12:59:27 EDT
(In reply to comment #9)
> The e1000 driver in current FC5 kernels seems to have the patch from comment 7 
> (or its descendant)...  Does this issue still occur w/ current FC5 kernels?

I don't know whether the original issue is really gone, but at least someone in
the new kernel knows about the disable-TSO workaround and I had no problems
sending two 1.8GB files.

titan kernel: Intel(R) PRO/1000 Network Driver - version 7.0.33-k2-NAPI
..
titan kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex
titan kernel: e1000: eth0: e1000_watchdog_task: 10/100 speed: disabling TSO
Comment 11 John W. Linville 2006-07-14 14:43:26 EDT
Sounds like things are fixed (at least well enough)...

Note You need to log in before you can comment on or make changes to this bug.