Description of problem: A lot of data mismatch error are logged in /var/log/messages on a RX800 S3 with RHEL5.1 and the native driver e1000 7.3.20-k2: .. 09:02:55 RX800S3 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang .. 09:02:55 RX800S3 kernel: Tx Queue <0> .. 09:02:55 RX800S3 kernel: TDH <23> .. 09:02:55 RX800S3 kernel: TDT <48> .. 09:02:55 RX800S3 kernel: next_to_use <48> .. 09:02:55 RX800S3 kernel: next_to_clean <1f> .. 09:02:55 RX800S3 kernel: buffer_info[next_to_clean] .. 09:02:55 RX800S3 kernel: time_stamp <101170d9f> .. 09:02:55 RX800S3 kernel: next_to_watch <25> .. 09:02:55 RX800S3 kernel: jiffies <101171080> .. 09:02:55 RX800S3 kernel: next_to_watch.status <0> .. 09:02:57 RX800S3 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang .. 09:02:57 RX800S3 kernel: Tx Queue <0> .. 09:02:57 RX800S3 kernel: TDH <23> .. 09:02:57 RX800S3 kernel: TDT <48> .. 09:02:57 RX800S3 kernel: next_to_use <48> .. 09:02:57 RX800S3 kernel: next_to_clean <1f> .. 09:02:57 RX800S3 kernel: buffer_info[next_to_clean] .. 09:02:57 RX800S3 kernel: time_stamp <101170d9f> .. 09:02:57 RX800S3 kernel: next_to_watch <25> .. 09:02:57 RX800S3 kernel: jiffies <101171274> .. 09:02:57 RX800S3 kernel: next_to_watch.status <0> .. 09:02:58 RX800S3 kernel: NETDEV WATCHDOG: eth0: transmit timed out .. 09:03:02 RX800S3 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Version-Release number of selected component (if applicable): 2.6.18-53.el5 How reproducible: always with RX800 S3, Intel Pro1000 LAN adapter and RHEL5.1 Steps to Reproduce: doing a stress test over NFS Additional info: The error happens only on RX800 S3 systems with RHEL5.1 (32bit, 64bit, with and without XEN) and is reproducible with different Intel Pro1000 LAN-adapters in different PCI slots. The same test with the onboard LAN-port (Broadcom, tg3) is working fine. Disabling TSO does indeed make the problem go away 0a:01.0 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) 0a:01.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev 01) # dmesg | grep -i e1000 e1000: 0000:0a:01.0: e1000_probe: (PCI-X:133MHz:64-bit) 00:0e:0c:51:b1:78 e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: 0000:0a:01.1: e1000_probe: (PCI-X:133MHz:64-bit) 00:0e:0c:51:b1:79 e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX # ethtool -k eth0 Offload parameters for eth0: Cannot get device udp large send offload settings: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off # ethtool -i eth0 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: N/A bus-info: 0000:0a:01.0
Created attachment 297607 [details] ifconfig output, ethtool -k,-i outputs and others
Could you check if this issue still reproduces with kernel available at http://people.redhat.com/agospoda/#rhel5 ? That kernel is updated and has some test patches, so would be good to check if that still reproduces it. thanks, Flavio
The issue is still seen with the latest kernel from gospo. Flavio
There are probably still a few bits (watchdog timer stuff) that might be in the rhel5 e1000 driver that are NOT upstream though it was promised they would get there. I don't think it's worth removing since it will cause another bug to appear again, but we could consider removing those bits and retesting. Is there ANY chance we can get this reproduced on a non-customer system?
If this is only seen under load, this patch should apply just fine on RHEL5 and can be used along with new module parameters to work around this issue: http://people.redhat.com/agospoda/rhel4/0019-e1000-add-module-parameter-to-set-transmit-descript.patch Please see the following entry for how to use this new module parameter to try and workaround issues with the 82545/6. https://bugzilla.redhat.com/show_bug.cgi?id=334411#c47
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results.
Anders, Can you tell me what tuning parameters they used? I'd like to know what they used for TxDescPower and TxDescriptors. Thanks!
I don't think disabling TSO is a guaranteed way to prevent this problem, but if it works for the customer I would say they should continue to do that.
in kernel-2.6.18-118.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
Hello, this is also happening on RHEL 4.7. Is there any available *official* fix? Thanks
There will be a fix for 4.8. See bug 334411
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html
I just had this again on 5.3 kernel 2.6.18-128.1.16.el5PAE TSO is disabled. I had this in the past with Fedora and disabling TSO: ethtool -K eth0 tso off solved the problem. No luck now. However the system didn't hung this time. Aug 6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Aug 6 04:10:14 localhost kernel: Tx Queue <0> Aug 6 04:10:14 localhost kernel: TDH <2d> Aug 6 04:10:14 localhost kernel: TDT <2d> Aug 6 04:10:14 localhost kernel: next_to_use <2d> Aug 6 04:10:14 localhost kernel: next_to_clean <d9> Aug 6 04:10:14 localhost kernel: buffer_info[next_to_clean] Aug 6 04:10:14 localhost kernel: time_stamp <7aa7f69> Aug 6 04:10:14 localhost kernel: next_to_watch <d9> Aug 6 04:10:14 localhost kernel: jiffies <7aa845f> Aug 6 04:10:14 localhost kernel: next_to_watch.status <1> Aug 6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Aug 6 04:10:52 localhost kernel: Tx Queue <0> Aug 6 04:10:52 localhost kernel: TDH <42> Aug 6 04:10:52 localhost kernel: TDT <42> Aug 6 04:10:52 localhost kernel: next_to_use <42> Aug 6 04:10:52 localhost kernel: next_to_clean <21> Aug 6 04:10:52 localhost kernel: buffer_info[next_to_clean] Aug 6 04:10:52 localhost kernel: time_stamp <7ab0523> Aug 6 04:10:52 localhost kernel: next_to_watch <24> Aug 6 04:10:52 localhost kernel: jiffies <7ab0964> Aug 6 04:10:52 localhost kernel: next_to_watch.status <1> Aug 6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Aug 6 04:10:54 localhost kernel: Tx Queue <0> Aug 6 04:10:54 localhost kernel: TDH <ca> Aug 6 04:10:54 localhost kernel: TDT <ca> Aug 6 04:10:54 localhost kernel: next_to_use <ca> Aug 6 04:10:54 localhost kernel: next_to_clean <a1> Aug 6 04:10:54 localhost kernel: buffer_info[next_to_clean] Aug 6 04:10:54 localhost kernel: time_stamp <7ab0cfa> Aug 6 04:10:54 localhost kernel: next_to_watch <a4> Aug 6 04:10:54 localhost kernel: jiffies <7ab11bf> Aug 6 04:10:54 localhost kernel: next_to_watch.status <1> Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50 Memory at dd620000 (64-bit, non-prefetchable) [size=128K] I/O ports at 3000 [size=64] Capabilities: [dc] Power Management version 2 Capabilities: [e4] PCI-X non-bridge device Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- ethtool -i eth0 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: N/A bus-info: 0000:04:02.0 ethtool -k eth0 Offload parameters for eth0: Cannot get device udp large send offload settings: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off regards, Giannis
(In reply to comment #38) > I just had this again > on 5.3 kernel 2.6.18-128.1.16.el5PAE > > TSO is disabled. I had this in the past with Fedora > and disabling TSO: > ethtool -K eth0 tso off > solved the problem. No luck now. However the system > didn't hung this time. > > Aug 6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:14 localhost kernel: Tx Queue <0> > Aug 6 04:10:14 localhost kernel: TDH <2d> > Aug 6 04:10:14 localhost kernel: TDT <2d> > Aug 6 04:10:14 localhost kernel: next_to_use <2d> > Aug 6 04:10:14 localhost kernel: next_to_clean <d9> > Aug 6 04:10:14 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:14 localhost kernel: time_stamp <7aa7f69> > Aug 6 04:10:14 localhost kernel: next_to_watch <d9> > Aug 6 04:10:14 localhost kernel: jiffies <7aa845f> > Aug 6 04:10:14 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:52 localhost kernel: Tx Queue <0> > Aug 6 04:10:52 localhost kernel: TDH <42> > Aug 6 04:10:52 localhost kernel: TDT <42> > Aug 6 04:10:52 localhost kernel: next_to_use <42> > Aug 6 04:10:52 localhost kernel: next_to_clean <21> > Aug 6 04:10:52 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:52 localhost kernel: time_stamp <7ab0523> > Aug 6 04:10:52 localhost kernel: next_to_watch <24> > Aug 6 04:10:52 localhost kernel: jiffies <7ab0964> > Aug 6 04:10:52 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:54 localhost kernel: Tx Queue <0> > Aug 6 04:10:54 localhost kernel: TDH <ca> > Aug 6 04:10:54 localhost kernel: TDT <ca> > Aug 6 04:10:54 localhost kernel: next_to_use <ca> > Aug 6 04:10:54 localhost kernel: next_to_clean <a1> > Aug 6 04:10:54 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:54 localhost kernel: time_stamp <7ab0cfa> > Aug 6 04:10:54 localhost kernel: next_to_watch <a4> > Aug 6 04:10:54 localhost kernel: jiffies <7ab11bf> > Aug 6 04:10:54 localhost kernel: next_to_watch.status <1> > > Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev > 03) > Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50 > Memory at dd620000 (64-bit, non-prefetchable) [size=128K] > I/O ports at 3000 [size=64] > Capabilities: [dc] Power Management version 2 > Capabilities: [e4] PCI-X non-bridge device > Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 > Enable- > Giannis, there are really only 3 ways to try and workaround the known problem with this hardware. 1. Disable TSO. (You have tried this.) 2. Use the new module option described below (usually combined with an increase in the number of ring buffers so that you can keep the same amount of packet memory). /* Transmit Descriptor Power * * Valid Range: 6-12 * This value represents the size-order of each transmit descriptor. * The valid size for descriptors would be 2^6 (64) to 2^12 (4096) bytes * each. As this value decreases one may want to consider increasing * the TxDescriptors value to maintain the same amount of frame memory. * * Default Value: 12 */ E1000_PARAM(TxDescPower, "Binary exponential size (2^X) of each transmit descriptor"); 3. Effectively disable adaptive interrupt modulation, by setting the module option InterruptThrottleRate=8000 for all devices. /* Interrupt Throttle Rate (interrupts/sec) * * Valid Range: 100-100000 (0=off, 1=dynamic, 3=dynamic conservative) */ E1000_PARAM(InterruptThrottleRate, "Interrupt Throttling Rate"); Unfortunately many of our users have reported that the only method to truly stop seeing these errors is to use a different network adapter.
In this(In reply to comment #38) > I just had this again > on 5.3 kernel 2.6.18-128.1.16.el5PAE > > TSO is disabled. I had this in the past with Fedora > and disabling TSO: > ethtool -K eth0 tso off > solved the problem. No luck now. However the system > didn't hung this time. In the past with Fedora *on this system*? what kind of system is this? Can you please attach lspci output? > Aug 6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:14 localhost kernel: Tx Queue <0> > Aug 6 04:10:14 localhost kernel: TDH <2d> > Aug 6 04:10:14 localhost kernel: TDT <2d> Since TDH==TDT here, this is a "false hang" where the hardware actually completed all available work, and something went wrong in the writeback process. OR, if these messages were all that was in your log, ie there was no NETDEV_WATCHDOG message this is actually a false hang, indicating that your system is for some reason taking an extremely long time to transmit some packets, and they sit in the hardware tx ring for longer than two seconds, and are eventually completed. Are you running at 10Mb or 100Mb? Do you have flow control enabled? Can you please send the output of ethtool -S eth0 after one of these messages in the log? > Aug 6 04:10:14 localhost kernel: next_to_use <2d> > Aug 6 04:10:14 localhost kernel: next_to_clean <d9> > Aug 6 04:10:14 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:14 localhost kernel: time_stamp <7aa7f69> > Aug 6 04:10:14 localhost kernel: next_to_watch <d9> > Aug 6 04:10:14 localhost kernel: jiffies <7aa845f> > Aug 6 04:10:14 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:52 localhost kernel: Tx Queue <0> > Aug 6 04:10:52 localhost kernel: TDH <42> > Aug 6 04:10:52 localhost kernel: TDT <42> > Aug 6 04:10:52 localhost kernel: next_to_use <42> > Aug 6 04:10:52 localhost kernel: next_to_clean <21> > Aug 6 04:10:52 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:52 localhost kernel: time_stamp <7ab0523> > Aug 6 04:10:52 localhost kernel: next_to_watch <24> > Aug 6 04:10:52 localhost kernel: jiffies <7ab0964> > Aug 6 04:10:52 localhost kernel: next_to_watch.status <1> > Aug 6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx > Unit Hang > Aug 6 04:10:54 localhost kernel: Tx Queue <0> > Aug 6 04:10:54 localhost kernel: TDH <ca> > Aug 6 04:10:54 localhost kernel: TDT <ca> > Aug 6 04:10:54 localhost kernel: next_to_use <ca> > Aug 6 04:10:54 localhost kernel: next_to_clean <a1> > Aug 6 04:10:54 localhost kernel: buffer_info[next_to_clean] > Aug 6 04:10:54 localhost kernel: time_stamp <7ab0cfa> > Aug 6 04:10:54 localhost kernel: next_to_watch <a4> > Aug 6 04:10:54 localhost kernel: jiffies <7ab11bf> > Aug 6 04:10:54 localhost kernel: next_to_watch.status <1> > > Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev > 03) > Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50 > Memory at dd620000 (64-bit, non-prefetchable) [size=128K] > I/O ports at 3000 [size=64] > Capabilities: [dc] Power Management version 2 > Capabilities: [e4] PCI-X non-bridge device > Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 > Enable- > > > ethtool -i eth0 > driver: e1000 > version: 7.3.20-k2-NAPI > firmware-version: N/A > bus-info: 0000:04:02.0 > > ethtool -k eth0 > Offload parameters for eth0: > Cannot get device udp large send offload settings: Operation not supported > rx-checksumming: on > tx-checksumming: on > scatter-gather: on > tcp segmentation offload: off > udp fragmentation offload: off > generic segmentation offload: off I mostly agree with what Andy said in his post, but in this case I'm not sure his statements would apply.
Created attachment 356571 [details] lspci -vvv
I've attached the lspci output. System dual Xeon(TM) CPU 3.20GHz @ 4G Ram running as an ftp/http mirror with htb enabled. eth0 is connected to a 1 Gigabit port. System used to be a Fedora. At that time I had come to this problem again http://bugzilla.kernel.org/show_bug.cgi?id=9808 and I solved it by disabling TSO. System had random cold hungs. I had this again (cold hungs) when I moved to Centos and disabling TSO solved it again. This is the first time I've come to this with the TSO disabled. This time the system didn't hung. However it was slow... Other system options: net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 2500 net.ipv4.ip_conntrack_max=131072 This time it happened right after my data tranfer from a backup server. I was rsyncing the data back to the server for more than one day at 700-800Mbps with no problem. 10 minutes after the transfer was finished and I enabled the services I had the Tx Unit Hang. Bear in mind that I've updated to 2.6.18-128.4.1.el5PAE today. If you need any more info I would be glad to help Giannis
Can you paste the output of: # ethtool -a eth0 # ethtool eth0 from any time the system is in use. And also # ethtool -S eth0 after a failure like you have seen in comment #38?
ethtool -a eth0 Pause parameters for eth0: Autonegotiate: on RX: on TX: off ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: umbg Wake-on: g Current message level: 0x00000007 (7) Link detected: yes If a have a failure I will post ethtool -S eth0 Giannis
Hi, I had another Tx Unit Hang this morning on this same machine (5.4). 2.6.18-164.10.1.el5PAE TSO is disabled. I will post all the detail: Jan 13 04:02:20 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 13 04:02:21 host kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jan 13 04:02:21 host kernel: Tx Queue <0> Jan 13 04:02:21 host kernel: TDH <e7> Jan 13 04:02:21 host kernel: TDT <e7> Jan 13 04:02:21 host kernel: next_to_use <e7> Jan 13 04:02:21 host kernel: next_to_clean <bd> Jan 13 04:02:21 host kernel: buffer_info[next_to_clean] Jan 13 04:02:21 host kernel: time_stamp <88b1b22> Jan 13 04:02:21 host kernel: next_to_watch <bf> Jan 13 04:02:21 host kernel: jiffies <88b3b60> Jan 13 04:02:21 host kernel: next_to_watch.status <1> Jan 13 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX # ethtool -i eth0 driver: e1000 version: 7.3.20-k2-NAPI firmware-version: N/A bus-info: 0000:04:02.0 # ethtool -k eth0 Offload parameters for eth0: Cannot get device udp large send offload settings: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off generic-receive-offload: off # ethtool eth0 Settings for eth0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: umbg Wake-on: g Current message level: 0x00000007 (7) Link detected: yes # ethtool -a eth0 Pause parameters for eth0: Autonegotiate: on RX: on TX: off # ethtool -S eth0 NIC statistics: rx_packets: 939575758 tx_packets: 1521712731 rx_bytes: 263510394579 tx_bytes: 2171648060330 rx_broadcast: 293 tx_broadcast: 292 rx_multicast: 0 tx_multicast: 13754 rx_errors: 0 tx_errors: 0 tx_dropped: 0 multicast: 0 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_no_buffer_count: 596322 rx_missed_errors: 302320 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_abort_late_coll: 0 tx_deferred_ok: 0 tx_single_coll_ok: 0 tx_multi_coll_ok: 0 tx_timeout_count: 1 tx_restart_queue: 1933172 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_align_errors: 0 tx_tcp_seg_good: 747 tx_tcp_seg_failed: 0 rx_flow_control_xon: 0 rx_flow_control_xoff: 0 tx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_long_byte_count: 263510394579 rx_csum_offload_good: 939560507 rx_csum_offload_errors: 6022 rx_header_split: 0 alloc_rx_buff_failed: 0 tx_smbus: 0 rx_smbus: 0 dropped_smbus: 0 04:02.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (63750ns min), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 233 Region 0: Memory at dd620000 (64-bit, non-prefetchable) [size=128K] Region 4: I/O ports at 3000 [size=64] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [e4] PCI-X non-bridge device Command: DPERE- ERO+ RBC=512 OST=1 Status: Dev=04:02.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz- Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 best regards, Giannis
Giannis, thank you for posting that detailed information in comment #44 and comment #45. I see that flow control is enabled, but the statistics do not indicate XOFF frames were received. Jesse, that seems to rule out flow-control as the problem, right?
yes, flow control is confirmed not the issue. This seems to be a slightly different report than before because we are now seeing NETDEV WATCHDOG which means that transmits were not completed for a long time. the hangs in comment 38 and 45 are showing that TDH==TDT, which means that the hardware has finished processing tx packets. At this stage the only way we can figure out what is going on is to get a full descriptor ring dump from the e1000_dump function (or get a pci-x bus trace). Typically the driver is not getting the DD bit back from the descriptor writeback in these cases, usually due to some weird race, but usually these issues aren't reported in Intel systems. I have a prototype patch I made for upstream that I will attach, Andy, not sure if you could build a kernel or driver for Kapetanakis. The patch builds against net-next but I've not done much testing on it. He will need to either load the module with debug_dump=2 module option or modify sysfs parameter of the same name at runtime. This will dump a ton to dmesg/syslog so sometimes decreasing tx/rx descriptors can be helpful in reducing the amount of data dumped (maybe 80/80 or 128/128 using ethtool -G) btw, I have a system of that class/vintage in my office, but I don't have 4GB of ram. The slot that you're in, in that machine, is typically a shared PCI-X slot, is there a chance you could rearrange the adapters so the adaptec and 82546 switch slots? It might make a difference, but I realize this might not be an easy experiment on a production machine.
Andy, I don't if you're referring to me, but this a production server (http/ftp official mirror for tons of sites, including fedora, centos etc). Thus I cannot play with custom kernels... In advance the network interface is on board. Also I didn't have another issue the last 10 days, so we can't be sure if something makes a change or not... best rgds Giannis
I was wrong. I had one more yesterday: Jan 19 04:02:21 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 19 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX Jan 19 04:05:26 host kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jan 19 04:05:26 host kernel: Tx Queue <0> Jan 19 04:05:26 host kernel: TDH <3> Jan 19 04:05:26 host kernel: TDT <3> Jan 19 04:05:26 host kernel: next_to_use <3> Jan 19 04:05:26 host kernel: next_to_clean <bb> Jan 19 04:05:26 host kernel: buffer_info[next_to_clean] Jan 19 04:05:26 host kernel: time_stamp <27746cb3> Jan 19 04:05:26 host kernel: next_to_watch <bd> Jan 19 04:05:26 host kernel: jiffies <27747376> Jan 19 04:05:26 host kernel: next_to_watch.status <1> These NETDEV WATCHDOG: eth0: transmit timed out are quite often. At least one every day. But there is not always a Tx Unit Hang following Jan 20 04:02:20 host kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 20 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Created attachment 385761 [details] e1000 dump code upstream proposal (for reference) untested upstream patch.
Giannis, Jesse, Hi, I think it could be a related to this https://bugzilla.redhat.com/show_bug.cgi?id=499355#c8 Maybe it would be good to give a try. Flavio
Also it might be related to htb code. I'm runnning QoS on this machine https://bugzilla.redhat.com/show_bug.cgi?id=481546
Kapetanakis, is there a chance you can try running with the patch from comment 50? If you need me to build you an e1000 driver from sourceforge with the patch applied I can do that. Should this bug be reopened?
I could do that, but how would you know if it fixes my problem? The system is up for 14 days with no problem...except for the NETDEV WATCHDOG: eth0: transmit timed out message. Anyway, do you want me to try e1000-8.0.16.tar.gz with your patch on it?
I wasn't wanting the sourceforge driver to fix your problem, I (perhaps wrongly) assumed that you would need a "stand alone" driver to replace your redhat e1000. I can actually build you a driver source *for* your kernel, from the redhat sources with my patch applied, if I know exactly what kernel you're running. Also, it would help answer some questions for me if you could attach your /proc/interrupts and your dmesg. One of the things about this 7320 system is that the kernel won't allow interrupt affinity due to some system bug, so irqs are moving every interrupt to the next processor. I have a 7320 system in my office. This is probably not related but worth mentioning because it can encourage some racy code behavior. There are also some test kernels for a backport bug in https://bugzilla.redhat.com/show_bug.cgi?id=499355, but that bug is against RHEL4.
You're right we should apply the patch on the running driver... and not on sourceforge's driver. Yes you can send me the patch to apply it on my kernel sources. Best if only recompiles the e1000 module and not the whole monster. I will attach dmesg and /proc/interrupts Giannis
Created attachment 389387 [details] /proc/interrupts
Created attachment 389388 [details] dmesg
Sorry forgot to add kernel version: PAE 2.6.18-164.11.1.el5PAE it's on dmesg, but never mind :)