+++ This bug was initially created as a clone of Bug #182215 +++ Description of problem: After install of FC5, yum update I encounter problems when I copy big files (~2GB) from a Win2K Workstation to the FC5 server using scp. Transfer stops around 1.5 GB, in /var/log/messages I see: Jun 7 15:53:27 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 7 15:53:27 titan kernel: Tx Queue <0> Jun 7 15:53:27 titan kernel: TDH <15> Jun 7 15:53:27 titan kernel: TDT <45> Jun 7 15:53:27 titan kernel: next_to_use <45> Jun 7 15:53:27 titan kernel: next_to_clean <15> Jun 7 15:53:27 titan kernel: buffer_info[next_to_clean] Jun 7 15:53:27 titan kernel: time_stamp <f26f835> Jun 7 15:53:27 titan kernel: next_to_watch <18> Jun 7 15:53:27 titan kernel: jiffies <f26fb14> Jun 7 15:53:27 titan kernel: next_to_watch.status <0> Jun 7 15:53:29 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 7 15:53:29 titan kernel: Tx Queue <0> Jun 7 15:53:29 titan kernel: TDH <15> Jun 7 15:53:29 titan kernel: TDT <45> Jun 7 15:53:29 titan kernel: next_to_use <45> Jun 7 15:53:29 titan kernel: next_to_clean <15> Jun 7 15:53:29 titan kernel: buffer_info[next_to_clean] Jun 7 15:53:29 titan kernel: time_stamp <f26f835> Jun 7 15:53:29 titan kernel: next_to_watch <18> Jun 7 15:53:29 titan kernel: jiffies <f26fd08> Jun 7 15:53:29 titan kernel: next_to_watch.status <0> Jun 7 15:53:31 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 7 15:53:31 titan kernel: Tx Queue <0> Jun 7 15:53:31 titan kernel: TDH <15> Jun 7 15:53:31 titan kernel: TDT <45> Jun 7 15:53:31 titan kernel: next_to_use <45> Jun 7 15:53:31 titan kernel: next_to_clean <15> Jun 7 15:53:31 titan kernel: buffer_info[next_to_clean] Jun 7 15:53:31 titan kernel: time_stamp <f26f835> Jun 7 15:53:31 titan kernel: next_to_watch <18> Jun 7 15:53:31 titan kernel: jiffies <f26fefd> Jun 7 15:53:31 titan kernel: next_to_watch.status <0> Jun 7 15:53:33 titan kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 7 15:53:33 titan kernel: Tx Queue <0> Jun 7 15:53:33 titan kernel: TDH <15> Jun 7 15:53:33 titan kernel: TDT <45> Jun 7 15:53:33 titan kernel: next_to_use <45> Jun 7 15:53:33 titan kernel: next_to_clean <15> Jun 7 15:53:33 titan kernel: buffer_info[next_to_clean] Jun 7 15:53:33 titan kernel: time_stamp <f26f835> Jun 7 15:53:33 titan kernel: next_to_watch <18> Jun 7 15:53:33 titan kernel: jiffies <f2700f1> Jun 7 15:53:33 titan kernel: next_to_watch.status <0> Jun 7 15:53:34 titan kernel: NETDEV WATCHDOG: eth0: transmit timed out Jun 7 15:53:34 titan kernel: br0: port 1(eth0) entering disabled state Version-Release number of selected component (if applicable): 2.6.16-1.2122_FC5smp #1 SMP Sun May 21 15:18:32 EDT 2006 i686 i686 i386 GNU/Linux How reproducible: Whenever I use scp to transfer big files. Steps to Reproduce: On Win2k Workstation, run pscp bigfile root@server:/some/where/ Actual results: Transmission of file stops around ~1.5GB, I see above message in /var/log/messages. Expected results: No error messages, simple copy of a file should work. Additional info: Disabling TSO (using "/sbin/ethtool -K eth0 tso off") as found somewhere on the web looks like a workaround, at least I could transfer a few 2GB files. [root@titan ~]# lspci 00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub (rev 81) 00:01.0 PCI bridge: Intel Corporation E7230 PCI Express Root Port (rev 81) 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) 00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) Serial ATA Storage Controllers cc=IDE (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09) 02:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09) 04:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03) 05:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03) 0a:00.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) -----------------------
We're starting to get a lot of these reports, many of them are related to (and can be fixed) by updating the eeprom for the 82573 Please send the output of ethtool -e for both interfaces.
One of our test (internal) file servers had the same problem yesterday, and it took the network down along with it as well (very serious) .. Only eth0 i.e. onboard 82573V was in use at the time of the problem. Currently this interface has been downed, and the server is currently running of the other onboard NIC. # lspci 00:00.0 Host bridge: Intel Corporation E7230 Memory Controller Hub 00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01) 00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01) 00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01) 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) 00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2 (rev 01) 00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3 (rev 01) 00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4 (rev 01) 00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) 00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01) 00:1f.2 SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) Serial ATA Storage Controller AHCI (rev 01) 00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01) 03:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet Controller (Copper) (rev 03) 04:04.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) 04:05.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) # lspci -n 00:00.0 Class 0600: 8086:2778 00:1c.0 Class 0604: 8086:27d0 (rev 01) 00:1c.4 Class 0604: 8086:27e0 (rev 01) 00:1c.5 Class 0604: 8086:27e2 (rev 01) 00:1d.0 Class 0c03: 8086:27c8 (rev 01) 00:1d.1 Class 0c03: 8086:27c9 (rev 01) 00:1d.2 Class 0c03: 8086:27ca (rev 01) 00:1d.3 Class 0c03: 8086:27cb (rev 01) 00:1d.7 Class 0c03: 8086:27cc (rev 01) 00:1e.0 Class 0604: 8086:244e (rev e1) 00:1f.0 Class 0601: 8086:27b8 (rev 01) 00:1f.1 Class 0101: 8086:27df (rev 01) 00:1f.2 Class 0106: 8086:27c1 (rev 01) 00:1f.3 Class 0c05: 8086:27da (rev 01) 03:00.0 Class 0200: 8086:108b (rev 03) 04:04.0 Class 0300: 1002:515e (rev 02) 04:05.0 Class 0200: 8086:1076 (rev 05) ifconfig before taking down the problem NIC: # cat ifconfig.out eth0 Link encap:Ethernet HWaddr 00:13:20:D6:AD:E3 inet addr:10.65.6.1 Bcast:10.65.6.255 Mask:255.255.255.0 inet6 addr: fe80::213:20ff:fed6:ade3/64 Scope:Link UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:84132882 errors:297966072 dropped:297966072 overruns:297966072 frame:0 TX packets:10677632885 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:75213992657 (70.0 GiB) TX bytes:854693824469 (795.9 GiB) Base address:0x2000 Memory:88100000-88120000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:6082 errors:0 dropped:0 overruns:0 frame:0 TX packets:6082 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:868538 (848.1 KiB) TX bytes:868538 (848.1 KiB) # ethtool -e eth0 Offset Values ------ ------ 0x0000 00 13 20 d6 ad e3 30 0b 46 f7 01 10 ff ff ff ff 0x0010 ff ff ff ff 6b 02 a3 30 86 80 8b 10 86 80 de 80 0x0020 00 00 00 20 14 7e 00 00 00 00 d8 00 00 00 00 27 0x0030 c9 6c 50 31 22 07 0b 04 84 09 00 00 00 c0 06 07 0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff 0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0x0060 00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 22 57 # ethtool -e eth1 Offset Values ------ ------ 0x0000 00 13 20 d6 ad e4 10 02 ff ff 00 10 ff ff ff ff 0x0010 ff ff ff ff 0b 64 a1 30 86 80 76 10 86 80 84 b2 0x0020 dd 20 22 22 00 00 90 2f 80 23 12 00 20 1e 12 00 0x0030 20 1e 12 00 20 1e 12 00 20 1e 09 00 00 02 00 00 0x0040 0c 00 a6 93 0b 28 00 00 00 04 ff ff ff ff ff ff 0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06 0x0060 00 01 00 40 1c 12 07 40 ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 83 18 # uname -rmpio 2.6.9-34.ELsmp x86_64 x86_64 x86_64 GNU/Linux
Created attachment 130866 [details] short script to attempt eeprom fix for 82573 This script will attempt to identify an eeprom on an 82573 with a known issue, and attempt to update the eeprom. The machine will have to be rebooted for the changes to the eeprom to take effect (have to force PCIe link to renegotiate).
There is a note missing (since the hardware failure) where the original submitter attached his 82573 eeproms, both of which have the known issue that can cause TX timeouts when TSO is enabled.
(In reply to comment #2) > One of our test (internal) file servers had the same problem yesterday, and it > took the network down along with it as well (very serious) .. why did the network go down? Didn't the adapter get reset after the tx timeout and recover? > Only eth0 i.e. onboard 82573V was in use at the time of the problem. > Currently this interface has been downed, and the server is currently running of > the other onboard NIC. sorry to hear about the issue, your 82573 eeprom shows that it needs the eeprom fix that the attached script repairs. you can also try turning off TSO since it was pretty well broken in 2.6.9 anyway, and that should make the problem go away without an eeprom upgrade to the 82573, IF it is the same issue being reported here.
(In reply to comment #3) > This script will attempt to identify an eeprom on an 82573 with a known issue, > and attempt to update the eeprom. It was a little better after running the script and rebooting; however, I still experience the aborted file transfers. After disabling TSO again, everything looked fine
Created attachment 131062 [details] patch to fix 6.3.9-k4 TSO What driver version are you running? 6.3.9-k4 probably, and it doesn't have the TSO workaround for 82573 that is needed. you probably need this patch, later drivers (like from linville's test kernels) already have this fix.
Thanks, I have: Intel(R) PRO/1000 Network Driver - version 6.3.9-k4-NAPI and will try a new Kernel or driver in two weeks, when I'm back from vacation. Until then, the server should work without TSO.
The e1000 driver in current FC5 kernels seems to have the patch from comment 7 (or its descendant)... Does this issue still occur w/ current FC5 kernels?
(In reply to comment #9) > The e1000 driver in current FC5 kernels seems to have the patch from comment 7 > (or its descendant)... Does this issue still occur w/ current FC5 kernels? I don't know whether the original issue is really gone, but at least someone in the new kernel knows about the disable-TSO workaround and I had no problems sending two 1.8GB files. titan kernel: Intel(R) PRO/1000 Network Driver - version 7.0.33-k2-NAPI .. titan kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex titan kernel: e1000: eth0: e1000_watchdog_task: 10/100 speed: disabling TSO
Sounds like things are fixed (at least well enough)...