436966 – e1000_clean_tx_irq: Detected Tx Unit Hang - 82546EB

Bug 436966 - e1000_clean_tx_irq: Detected Tx Unit Hang - 82546EB

Summary: e1000_clean_tx_irq: Detected Tx Unit Hang - 82546EB

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	beta
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	KernelPrio5.3
TreeView+	depends on / blocked

Reported:	2008-03-11 13:15 UTC by Flavio Leitner
Modified:	2018-12-01 15:50 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 19:42:59 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ifconfig output, ethtool -k,-i outputs and others (5.58 KB, text/plain) 2008-03-11 13:15 UTC, Flavio Leitner	no flags	Details
lspci -vvv (24.79 KB, text/plain) 2009-08-06 19:01 UTC, Kapetanakis Giannis	no flags	Details
e1000 dump code upstream proposal (for reference) (11.23 KB, patch) 2010-01-20 19:29 UTC, Jesse Brandeburg	no flags	Details \| Diff
/proc/interrupts (1.50 KB, application/octet-stream) 2010-02-07 15:13 UTC, Kapetanakis Giannis	no flags	Details
dmesg (26.36 KB, application/octet-stream) 2010-02-07 15:13 UTC, Kapetanakis Giannis	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Flavio Leitner 2008-03-11 13:15:07 UTC

Description of problem:

A lot of data mismatch error are logged in /var/log/messages on a RX800 S3 with
RHEL5.1 and the native driver e1000 7.3.20-k2:

.. 09:02:55 RX800S3 kernel: e1000: eth0: e1000_clean_tx_irq: Detected
Tx Unit Hang
.. 09:02:55 RX800S3 kernel: Tx Queue <0>
.. 09:02:55 RX800S3 kernel: TDH <23>
.. 09:02:55 RX800S3 kernel: TDT <48>
.. 09:02:55 RX800S3 kernel: next_to_use <48>
.. 09:02:55 RX800S3 kernel: next_to_clean <1f>
.. 09:02:55 RX800S3 kernel: buffer_info[next_to_clean]
.. 09:02:55 RX800S3 kernel: time_stamp <101170d9f>
.. 09:02:55 RX800S3 kernel: next_to_watch <25>
.. 09:02:55 RX800S3 kernel: jiffies <101171080>
.. 09:02:55 RX800S3 kernel: next_to_watch.status <0>
.. 09:02:57 RX800S3 kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
.. 09:02:57 RX800S3 kernel: Tx Queue <0>
.. 09:02:57 RX800S3 kernel: TDH <23>
.. 09:02:57 RX800S3 kernel: TDT <48>
.. 09:02:57 RX800S3 kernel: next_to_use <48>
.. 09:02:57 RX800S3 kernel: next_to_clean <1f>
.. 09:02:57 RX800S3 kernel: buffer_info[next_to_clean]
.. 09:02:57 RX800S3 kernel: time_stamp <101170d9f>
.. 09:02:57 RX800S3 kernel: next_to_watch <25>
.. 09:02:57 RX800S3 kernel: jiffies <101171274>
.. 09:02:57 RX800S3 kernel: next_to_watch.status <0>
.. 09:02:58 RX800S3 kernel: NETDEV WATCHDOG: eth0: transmit timed out
.. 09:03:02 RX800S3 kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up
1000 Mbps Full Duplex, Flow Control: RX/TX

Version-Release number of selected component (if applicable):
2.6.18-53.el5

How reproducible:
always with RX800 S3, Intel Pro1000 LAN adapter and RHEL5.1

Steps to Reproduce:
doing a stress test over NFS

Additional info:
The error happens only on RX800 S3 systems with RHEL5.1 (32bit, 64bit, with and
without XEN) and is reproducible with different Intel Pro1000 LAN-adapters in
different PCI slots.

The same test with the onboard LAN-port (Broadcom, tg3) is working fine.

Disabling TSO does indeed make the problem go away

0a:01.0 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet
Controller (Copper) (rev 01)
0a:01.1 Ethernet controller: Intel Corporation 82546EB Gigabit Ethernet
Controller (Copper) (rev 01)

# dmesg | grep -i e1000
e1000: 0000:0a:01.0: e1000_probe: (PCI-X:133MHz:64-bit) 00:0e:0c:51:b1:78
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: 0000:0a:01.1: e1000_probe: (PCI-X:133MHz:64-bit) 00:0e:0c:51:b1:79
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: RX/TX

# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

# ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: N/A
bus-info: 0000:0a:01.0

Comment 1 Flavio Leitner 2008-03-11 13:15:08 UTC

Created attachment 297607 [details]
ifconfig output, ethtool -k,-i  outputs and others

Comment 3 Flavio Leitner 2008-03-11 13:36:37 UTC

Could you check if this issue still reproduces with kernel available at
http://people.redhat.com/agospoda/#rhel5 ?

That kernel is updated and has some test patches, so would be good to 
check if that still reproduces it.

thanks,
Flavio

Comment 5 Flavio Leitner 2008-03-19 14:24:04 UTC

The issue is still seen with the latest kernel from gospo.
Flavio

Comment 6 Andy Gospodarek 2008-03-19 18:28:22 UTC

There are probably still a few bits (watchdog timer stuff) that might be in the
rhel5 e1000 driver that are NOT upstream though it was promised they would get
there.  I don't think it's worth removing since it will cause another bug to
appear again, but we could consider removing those bits and retesting.

Is there ANY chance we can get this reproduced on a non-customer system?

Comment 7 Andy Gospodarek 2008-05-28 15:09:26 UTC

If this is only seen under load, this patch should apply just fine on RHEL5 and
can be used along with new module parameters to work around this issue:

http://people.redhat.com/agospoda/rhel4/0019-e1000-add-module-parameter-to-set-transmit-descript.patch

Please see the following entry for how to use this new module parameter to try
and workaround issues with the 82545/6.

https://bugzilla.redhat.com/show_bug.cgi?id=334411#c47

Comment 11 Andy Gospodarek 2008-06-16 13:42:35 UTC

My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.

Comment 13 Andy Gospodarek 2008-06-20 12:27:01 UTC

Anders, 

Can you tell me what tuning parameters they used?  I'd like to know what they
used for TxDescPower and TxDescriptors.

Thanks!

Comment 25 Andy Gospodarek 2008-09-09 21:43:47 UTC

I don't think disabling TSO is a guaranteed way to prevent this problem, but if it works for the customer I would say they should continue to do that.

Comment 29 Don Zickus 2008-10-06 15:55:30 UTC

in kernel-2.6.18-118.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 31 Juan J. Cavallaro 2008-10-20 13:17:39 UTC

Hello, this is also happening on RHEL 4.7. Is there any available *official* fix?

Thanks

Comment 32 Andy Gospodarek 2008-10-20 15:03:21 UTC

There will be a fix for 4.8.  See bug 334411

Comment 34 errata-xmlrpc 2009-01-20 19:42:59 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 38 Kapetanakis Giannis 2009-08-06 09:16:21 UTC

I just had this again
on 5.3 kernel 2.6.18-128.1.16.el5PAE

TSO is disabled. I had this in the past with Fedora
and disabling TSO:  
ethtool -K eth0 tso off
solved the problem. No luck now. However the system
didn't hung this time.

Aug  6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Aug  6 04:10:14 localhost kernel:   Tx Queue             <0>
Aug  6 04:10:14 localhost kernel:   TDH                  <2d>
Aug  6 04:10:14 localhost kernel:   TDT                  <2d>
Aug  6 04:10:14 localhost kernel:   next_to_use          <2d>
Aug  6 04:10:14 localhost kernel:   next_to_clean        <d9>
Aug  6 04:10:14 localhost kernel: buffer_info[next_to_clean]
Aug  6 04:10:14 localhost kernel:   time_stamp           <7aa7f69>
Aug  6 04:10:14 localhost kernel:   next_to_watch        <d9>
Aug  6 04:10:14 localhost kernel:   jiffies              <7aa845f>
Aug  6 04:10:14 localhost kernel:   next_to_watch.status <1>
Aug  6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Aug  6 04:10:52 localhost kernel:   Tx Queue             <0>
Aug  6 04:10:52 localhost kernel:   TDH                  <42>
Aug  6 04:10:52 localhost kernel:   TDT                  <42>
Aug  6 04:10:52 localhost kernel:   next_to_use          <42>
Aug  6 04:10:52 localhost kernel:   next_to_clean        <21>
Aug  6 04:10:52 localhost kernel: buffer_info[next_to_clean]
Aug  6 04:10:52 localhost kernel:   time_stamp           <7ab0523>
Aug  6 04:10:52 localhost kernel:   next_to_watch        <24>
Aug  6 04:10:52 localhost kernel:   jiffies              <7ab0964>
Aug  6 04:10:52 localhost kernel:   next_to_watch.status <1>
Aug  6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Aug  6 04:10:54 localhost kernel:   Tx Queue             <0>
Aug  6 04:10:54 localhost kernel:   TDH                  <ca>
Aug  6 04:10:54 localhost kernel:   TDT                  <ca>
Aug  6 04:10:54 localhost kernel:   next_to_use          <ca>
Aug  6 04:10:54 localhost kernel:   next_to_clean        <a1>
Aug  6 04:10:54 localhost kernel: buffer_info[next_to_clean]
Aug  6 04:10:54 localhost kernel:   time_stamp           <7ab0cfa>
Aug  6 04:10:54 localhost kernel:   next_to_watch        <a4>
Aug  6 04:10:54 localhost kernel:   jiffies              <7ab11bf>
Aug  6 04:10:54 localhost kernel:   next_to_watch.status <1>

Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)
        Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
        Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50
        Memory at dd620000 (64-bit, non-prefetchable) [size=128K]
        I/O ports at 3000 [size=64]
        Capabilities: [dc] Power Management version 2
        Capabilities: [e4] PCI-X non-bridge device
        Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-


ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: N/A
bus-info: 0000:04:02.0

ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

regards,

Giannis

Comment 39 Andy Gospodarek 2009-08-06 11:51:51 UTC

(In reply to comment #38)
> I just had this again
> on 5.3 kernel 2.6.18-128.1.16.el5PAE
> 
> TSO is disabled. I had this in the past with Fedora
> and disabling TSO:  
> ethtool -K eth0 tso off
> solved the problem. No luck now. However the system
> didn't hung this time.
> 
> Aug  6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Aug  6 04:10:14 localhost kernel:   Tx Queue             <0>
> Aug  6 04:10:14 localhost kernel:   TDH                  <2d>
> Aug  6 04:10:14 localhost kernel:   TDT                  <2d>
> Aug  6 04:10:14 localhost kernel:   next_to_use          <2d>
> Aug  6 04:10:14 localhost kernel:   next_to_clean        <d9>
> Aug  6 04:10:14 localhost kernel: buffer_info[next_to_clean]
> Aug  6 04:10:14 localhost kernel:   time_stamp           <7aa7f69>
> Aug  6 04:10:14 localhost kernel:   next_to_watch        <d9>
> Aug  6 04:10:14 localhost kernel:   jiffies              <7aa845f>
> Aug  6 04:10:14 localhost kernel:   next_to_watch.status <1>
> Aug  6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Aug  6 04:10:52 localhost kernel:   Tx Queue             <0>
> Aug  6 04:10:52 localhost kernel:   TDH                  <42>
> Aug  6 04:10:52 localhost kernel:   TDT                  <42>
> Aug  6 04:10:52 localhost kernel:   next_to_use          <42>
> Aug  6 04:10:52 localhost kernel:   next_to_clean        <21>
> Aug  6 04:10:52 localhost kernel: buffer_info[next_to_clean]
> Aug  6 04:10:52 localhost kernel:   time_stamp           <7ab0523>
> Aug  6 04:10:52 localhost kernel:   next_to_watch        <24>
> Aug  6 04:10:52 localhost kernel:   jiffies              <7ab0964>
> Aug  6 04:10:52 localhost kernel:   next_to_watch.status <1>
> Aug  6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Aug  6 04:10:54 localhost kernel:   Tx Queue             <0>
> Aug  6 04:10:54 localhost kernel:   TDH                  <ca>
> Aug  6 04:10:54 localhost kernel:   TDT                  <ca>
> Aug  6 04:10:54 localhost kernel:   next_to_use          <ca>
> Aug  6 04:10:54 localhost kernel:   next_to_clean        <a1>
> Aug  6 04:10:54 localhost kernel: buffer_info[next_to_clean]
> Aug  6 04:10:54 localhost kernel:   time_stamp           <7ab0cfa>
> Aug  6 04:10:54 localhost kernel:   next_to_watch        <a4>
> Aug  6 04:10:54 localhost kernel:   jiffies              <7ab11bf>
> Aug  6 04:10:54 localhost kernel:   next_to_watch.status <1>
> 
> Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev
> 03)
>         Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
>         Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50
>         Memory at dd620000 (64-bit, non-prefetchable) [size=128K]
>         I/O ports at 3000 [size=64]
>         Capabilities: [dc] Power Management version 2
>         Capabilities: [e4] PCI-X non-bridge device
>         Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0
> Enable-
> 

Giannis, there are really only 3 ways to try and workaround the known problem with this hardware.

1.  Disable TSO. (You have tried this.)

2.  Use the new module option described below (usually combined with an increase in the number of ring buffers so that you can keep the same amount of packet memory).

/* Transmit Descriptor Power
 *
 * Valid Range: 6-12
 * This value represents the size-order of each transmit descriptor.
 * The valid size for descriptors would be 2^6 (64) to 2^12 (4096) bytes
 * each.  As this value decreases one may want to consider increasing
 * the TxDescriptors value to maintain the same amount of frame memory.
 *
 * Default Value: 12
 */
E1000_PARAM(TxDescPower, "Binary exponential size (2^X) of each transmit descriptor");

3.  Effectively disable adaptive interrupt modulation, by setting the module option InterruptThrottleRate=8000 for all devices.

/* Interrupt Throttle Rate (interrupts/sec)
 *
 * Valid Range: 100-100000 (0=off, 1=dynamic, 3=dynamic conservative)
 */
E1000_PARAM(InterruptThrottleRate, "Interrupt Throttling Rate");

Unfortunately many of our users have reported that the only method to truly stop seeing these errors is to use a different network adapter.

Comment 40 Jesse Brandeburg 2009-08-06 16:13:17 UTC

In this(In reply to comment #38)
> I just had this again
> on 5.3 kernel 2.6.18-128.1.16.el5PAE
> 
> TSO is disabled. I had this in the past with Fedora
> and disabling TSO:  
> ethtool -K eth0 tso off
> solved the problem. No luck now. However the system
> didn't hung this time.

In the past with Fedora *on this system*?

what kind of system is this?  Can you please attach lspci output?

> Aug  6 04:10:14 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Aug  6 04:10:14 localhost kernel:   Tx Queue             <0>
> Aug  6 04:10:14 localhost kernel:   TDH                  <2d>
> Aug  6 04:10:14 localhost kernel:   TDT                  <2d>

Since TDH==TDT here, this is a "false hang" where the hardware actually completed all available work, and something went wrong in the writeback process.  OR, if these messages were all that was in your log, ie there was no NETDEV_WATCHDOG message this is actually a false hang, indicating that your system is for some reason taking an extremely long time to transmit some packets, and they sit in the hardware tx ring for longer than two seconds, and are eventually completed.

Are you running at 10Mb or 100Mb?  Do you have flow control enabled?  Can you please send the output of ethtool -S eth0 after one of these messages in the log?

> Aug  6 04:10:14 localhost kernel:   next_to_use          <2d>
> Aug  6 04:10:14 localhost kernel:   next_to_clean        <d9>
> Aug  6 04:10:14 localhost kernel: buffer_info[next_to_clean]
> Aug  6 04:10:14 localhost kernel:   time_stamp           <7aa7f69>
> Aug  6 04:10:14 localhost kernel:   next_to_watch        <d9>
> Aug  6 04:10:14 localhost kernel:   jiffies              <7aa845f>
> Aug  6 04:10:14 localhost kernel:   next_to_watch.status <1>
> Aug  6 04:10:52 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Aug  6 04:10:52 localhost kernel:   Tx Queue             <0>
> Aug  6 04:10:52 localhost kernel:   TDH                  <42>
> Aug  6 04:10:52 localhost kernel:   TDT                  <42>
> Aug  6 04:10:52 localhost kernel:   next_to_use          <42>
> Aug  6 04:10:52 localhost kernel:   next_to_clean        <21>
> Aug  6 04:10:52 localhost kernel: buffer_info[next_to_clean]
> Aug  6 04:10:52 localhost kernel:   time_stamp           <7ab0523>
> Aug  6 04:10:52 localhost kernel:   next_to_watch        <24>
> Aug  6 04:10:52 localhost kernel:   jiffies              <7ab0964>
> Aug  6 04:10:52 localhost kernel:   next_to_watch.status <1>
> Aug  6 04:10:54 localhost kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx
> Unit Hang
> Aug  6 04:10:54 localhost kernel:   Tx Queue             <0>
> Aug  6 04:10:54 localhost kernel:   TDH                  <ca>
> Aug  6 04:10:54 localhost kernel:   TDT                  <ca>
> Aug  6 04:10:54 localhost kernel:   next_to_use          <ca>
> Aug  6 04:10:54 localhost kernel:   next_to_clean        <a1>
> Aug  6 04:10:54 localhost kernel: buffer_info[next_to_clean]
> Aug  6 04:10:54 localhost kernel:   time_stamp           <7ab0cfa>
> Aug  6 04:10:54 localhost kernel:   next_to_watch        <a4>
> Aug  6 04:10:54 localhost kernel:   jiffies              <7ab11bf>
> Aug  6 04:10:54 localhost kernel:   next_to_watch.status <1>
> 
> Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev
> 03)
>         Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
>         Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 50
>         Memory at dd620000 (64-bit, non-prefetchable) [size=128K]
>         I/O ports at 3000 [size=64]
>         Capabilities: [dc] Power Management version 2
>         Capabilities: [e4] PCI-X non-bridge device
>         Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0
> Enable-
> 
> 
> ethtool -i eth0
> driver: e1000
> version: 7.3.20-k2-NAPI
> firmware-version: N/A
> bus-info: 0000:04:02.0
> 
> ethtool -k eth0
> Offload parameters for eth0:
> Cannot get device udp large send offload settings: Operation not supported
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp segmentation offload: off
> udp fragmentation offload: off
> generic segmentation offload: off

I mostly agree with what Andy said in his post, but in this case I'm not sure his statements would apply.

Comment 41 Kapetanakis Giannis 2009-08-06 19:01:57 UTC

Created attachment 356571 [details]
lspci -vvv

Comment 42 Kapetanakis Giannis 2009-08-06 19:14:02 UTC

I've attached the lspci output.
System dual Xeon(TM) CPU 3.20GHz @ 4G Ram
running as an ftp/http mirror with htb enabled.

eth0 is connected to a 1 Gigabit port.
System used to be a Fedora. At that time I had come to this
problem again http://bugzilla.kernel.org/show_bug.cgi?id=9808
and I solved it by disabling TSO. System had random
cold hungs.

I had this again (cold hungs) when I moved to Centos
and disabling TSO solved it again.

This is the first time I've come to this with the TSO disabled.
This time the system didn't hung. However it was slow...

Other system options:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216 
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 2500
net.ipv4.ip_conntrack_max=131072

This time it happened right after my data tranfer from a backup server.
I was rsyncing the data back to the server for more than one day
at 700-800Mbps with no problem. 10 minutes after the transfer was
finished and I enabled the services I had the Tx Unit Hang.

Bear in mind that I've updated to 2.6.18-128.4.1.el5PAE today.
If you need any more info I would be glad to help

Giannis

Comment 43 Andy Gospodarek 2009-08-12 17:53:16 UTC

Can you paste the output of:

# ethtool -a eth0

# ethtool eth0

from any time the system is in use. 

And also

# ethtool -S eth0

after a failure like you have seen in comment #38?

Comment 44 Kapetanakis Giannis 2009-08-13 08:10:05 UTC

ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX:             on
TX:             off

ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
        Link detected: yes

If a have a failure I will post ethtool -S eth0

Giannis

Comment 45 Kapetanakis Giannis 2010-01-13 10:33:36 UTC

Hi,

I had another Tx Unit Hang this morning on this same machine (5.4).
2.6.18-164.10.1.el5PAE

TSO is disabled. I will post all the detail:

Jan 13 04:02:20 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jan 13 04:02:21 host kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jan 13 04:02:21 host kernel:   Tx Queue             <0>
Jan 13 04:02:21 host kernel:   TDH                  <e7>
Jan 13 04:02:21 host kernel:   TDT                  <e7>
Jan 13 04:02:21 host kernel:   next_to_use          <e7>
Jan 13 04:02:21 host kernel:   next_to_clean        <bd>
Jan 13 04:02:21 host kernel: buffer_info[next_to_clean]
Jan 13 04:02:21 host kernel:   time_stamp           <88b1b22>
Jan 13 04:02:21 host kernel:   next_to_watch        <bf>
Jan 13 04:02:21 host kernel:   jiffies              <88b3b60>
Jan 13 04:02:21 host kernel:   next_to_watch.status <1>
Jan 13 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

# ethtool -i eth0
driver: e1000
version: 7.3.20-k2-NAPI
firmware-version: N/A
bus-info: 0000:04:02.0

# ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off

# ethtool eth0
Settings for eth0:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
        Link detected: yes

# ethtool -a eth0
Pause parameters for eth0:
Autonegotiate:  on
RX:             on
TX:             off


# ethtool -S eth0
NIC statistics:
     rx_packets: 939575758
     tx_packets: 1521712731
     rx_bytes: 263510394579
     tx_bytes: 2171648060330
     rx_broadcast: 293
     tx_broadcast: 292
     rx_multicast: 0
     tx_multicast: 13754
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 596322
     rx_missed_errors: 302320
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 1
     tx_restart_queue: 1933172
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 747
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 263510394579
     rx_csum_offload_good: 939560507
     rx_csum_offload_errors: 6022
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0


04:02.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03)
        Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (63750ns min), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 233
        Region 0: Memory at dd620000 (64-bit, non-prefetchable) [size=128K]
        Region 4: I/O ports at 3000 [size=64]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [e4] PCI-X non-bridge device
                Command: DPERE- ERO+ RBC=512 OST=1
                Status: Dev=04:02.0 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=16 RSCEM- 266MHz- 533MHz-
        Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000

best regards,

Giannis

Comment 46 Andy Gospodarek 2010-01-20 16:22:03 UTC

Giannis, thank you for posting that detailed information in comment #44 and comment #45.

I see that flow control is enabled, but the statistics do not indicate  XOFF frames were received.

Jesse, that seems to rule out flow-control as the problem, right?

Comment 47 Jesse Brandeburg 2010-01-20 17:41:13 UTC

yes, flow control is confirmed not the issue. This seems to be a slightly different report than before because we are now seeing NETDEV WATCHDOG which means that transmits were not completed for a long time.

the hangs in comment 38 and 45 are showing that TDH==TDT, which means that the hardware has finished processing tx packets. At this stage the only way we can figure out what is going on is to get a full descriptor ring dump from the e1000_dump function (or get a pci-x bus trace). Typically the driver is not getting the DD bit back from the descriptor writeback in these cases, usually due to some weird race, but usually these issues aren't reported in Intel systems.

I have a prototype patch I made for upstream that I will attach, Andy, not sure if you could build a kernel or driver for Kapetanakis. The patch builds against net-next but I've not done much testing on it.

He will need to either load the module with debug_dump=2 module option or modify sysfs parameter of the same name at runtime.

This will dump a ton to dmesg/syslog so sometimes decreasing tx/rx descriptors can be helpful in reducing the amount of data dumped (maybe 80/80 or 128/128 using ethtool -G)

btw, I have a system of that class/vintage in my office, but I don't have 4GB of ram.

The slot that you're in, in that machine, is typically a shared PCI-X slot, is there a chance you could rearrange the adapters so the adaptec and 82546 switch slots? It might make a difference, but I realize this might not be an easy experiment on a production machine.

Comment 48 Kapetanakis Giannis 2010-01-20 18:03:03 UTC

Andy, I don't if you're referring to me, but this a production server
(http/ftp official mirror for tons of sites, including fedora, centos etc).

Thus I cannot play with custom kernels...

In advance the network interface is on board. Also I didn't have another issue the last 10 days, so we can't be sure if something makes a change or not...

best rgds

Giannis

Comment 49 Kapetanakis Giannis 2010-01-20 18:13:18 UTC

I was wrong.

I had one more yesterday:

Jan 19 04:02:21 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jan 19 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Jan 19 04:05:26 host kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jan 19 04:05:26 host kernel:   Tx Queue             <0>
Jan 19 04:05:26 host kernel:   TDH                  <3>
Jan 19 04:05:26 host kernel:   TDT                  <3>
Jan 19 04:05:26 host kernel:   next_to_use          <3>
Jan 19 04:05:26 host kernel:   next_to_clean        <bb>
Jan 19 04:05:26 host kernel: buffer_info[next_to_clean]
Jan 19 04:05:26 host kernel:   time_stamp           <27746cb3>
Jan 19 04:05:26 host kernel:   next_to_watch        <bd>
Jan 19 04:05:26 host kernel:   jiffies              <27747376>
Jan 19 04:05:26 host kernel:   next_to_watch.status <1>

These NETDEV WATCHDOG: eth0: transmit timed out
are quite often. At least one every day.
But there is not always a Tx Unit Hang following

Jan 20 04:02:20 host kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jan 20 04:02:24 host kernel: e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Comment 50 Jesse Brandeburg 2010-01-20 19:29:42 UTC

Created attachment 385761 [details]
e1000 dump code upstream proposal (for reference)

untested upstream patch.

Comment 51 Flavio Leitner 2010-01-20 21:07:56 UTC

Giannis, Jesse,

Hi, I think it could be a related to this 
https://bugzilla.redhat.com/show_bug.cgi?id=499355#c8
Maybe it would be good to give a try.

Flavio

Comment 52 Kapetanakis Giannis 2010-01-21 10:29:51 UTC

Also it might be related to htb code. I'm runnning QoS on this machine
https://bugzilla.redhat.com/show_bug.cgi?id=481546

Comment 53 Jesse Brandeburg 2010-02-02 17:32:52 UTC

Kapetanakis, is there a chance you can try running with the patch from comment 50?

If you need me to build you an e1000 driver from sourceforge with the patch applied I can do that.  Should this bug be reopened?

Comment 54 Kapetanakis Giannis 2010-02-05 11:01:28 UTC

I could do that, but how would you know if it fixes my problem?
The system is up for 14 days with no problem...except for the 
NETDEV WATCHDOG: eth0: transmit timed out
message.

Anyway, do you want me to try e1000-8.0.16.tar.gz with your patch on it?

Comment 55 Jesse Brandeburg 2010-02-05 17:38:08 UTC

I wasn't wanting the sourceforge driver to fix your problem, I (perhaps wrongly) assumed that you would need a "stand alone" driver to replace your redhat e1000.

I can actually build you a driver source *for* your kernel, from the redhat sources with my patch applied, if I know exactly what kernel you're running.

Also, it would help answer some questions for me if you could attach your /proc/interrupts and your dmesg.

One of the things about this 7320 system is that the kernel won't allow interrupt affinity due to some system bug, so irqs are moving every interrupt to the next processor.  I have a 7320 system in my office.  This is probably not related but worth mentioning because it can encourage some racy code behavior.

There are also some test kernels for a backport bug in https://bugzilla.redhat.com/show_bug.cgi?id=499355, but that bug is against RHEL4.

Comment 56 Kapetanakis Giannis 2010-02-07 15:12:16 UTC

You're right we should apply the patch on the running driver...
and not on sourceforge's driver.

Yes you can send me the patch to apply it on my kernel sources.
Best if only recompiles the e1000 module and not the whole monster.

I will attach dmesg and /proc/interrupts

Giannis

Comment 57 Kapetanakis Giannis 2010-02-07 15:13:02 UTC

Created attachment 389387 [details]
/proc/interrupts

Comment 58 Kapetanakis Giannis 2010-02-07 15:13:28 UTC

Created attachment 389388 [details]
dmesg

Comment 59 Kapetanakis Giannis 2010-02-07 15:17:55 UTC

Sorry forgot to add kernel version:
PAE 2.6.18-164.11.1.el5PAE 

it's on dmesg, but never mind :)

Note You need to log in before you can comment on or make changes to this bug.