Bug 398921 - e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang continues on Fedora 8
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang continues on Fedora 8
Status: CLOSED WONTFIX
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
8
All Linux
low Severity medium
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
: 249185 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-11-25 20:23 EST by Greg Morgan
Modified: 2010-10-25 07:58 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 504873 540413 (view as bug list)
Environment:
Last Closed: 2009-01-09 00:20:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Greg Morgan 2007-11-25 20:23:17 EST
Description of problem:
Please see f5 bug 200656, ES bug 248787, f6 bug 219496, and f7 bug 249185.  My
tx hang issues on the same hardware in bug 200656 continue on the F8 kernel.  I
am coping a bunch of wav files from my NFS server to a 400gig USB drive when
the error occurs.  However, simple web surfing can also cause problems.
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang

Version-Release number of selected component (if applicable):
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI
2.6.23.1-42.fc8

How reproducible:
e1000 driver on selected hardware.  The 7.3.15-k2-NAPI Intel kernel driver does
not have this problem.  I have not tried Xen on f8 yet.  The Xen kernel ran the
7.3.15 driver and was a work around for the problem. 

History of the Problem on the ECS AMD Sempron Motherboard.

Intel
Driver		Kernel					TK Hang Issues
7.3.15-k2-NAPI	Fedora 5 kernel ?               	Rock Solid
7.3.??-k2-NAPI	Fedora 6 kernel ?               	Not installed.
7.3.15-k2-NAPI	/boot/vmlinuz-2.6.20-2925.11.fc7xen	Rock Solid
7.3.15-k2-NAPI	/boot/vmlinuz-2.6.20-2925.9.fc7xen	Rock Solid
7.3.20-k2-NAPI	/boot/vmlinuz-2.6.21-1.3228.fc7		TX Issues Encountered
7.3.20-k2-NAPI	/boot/vmlinuz-2.6.22.1-27.fc7		TX Issues Encountered
7.3.20-k2-NAPI  /boot/vmlinuz-2.6.23.1-42.fc8           TX Issues Encountered



I tried bug 249185 Comment #3 as posted by Chuck Ebbert of Red Hat suggestion
to use	 

" One workaround to try is turning off TSO:

    # ethtool -K eth0 tso off

The problem still exists with the ethtool command modifications.

Also in the f5 post the Intel adapter was evaluated for firmware fix problems. 
The Intel adapter did not have these problems either.


I think I picked up a no-name gigabit card that I can try as a work around or if
my wife is off her computer, I can try installing the Xen kernel to see if the
7.3.15-k2-NAPI is still available.

Please advise.
Comment 1 Greg Morgan 2007-11-25 20:30:10 EST
Oh full message in the log file

Nov 25 18:19:03 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Nov 25 18:19:03 mowgli kernel:   Tx Queue             <0>
Nov 25 18:19:03 mowgli kernel:   TDH                  <5a>
Nov 25 18:19:03 mowgli kernel:   TDT                  <5a>
Nov 25 18:19:03 mowgli kernel:   next_to_use          <5a>
Nov 25 18:19:03 mowgli kernel:   next_to_clean        <6e>
Nov 25 18:19:03 mowgli kernel: buffer_info[next_to_clean]
Nov 25 18:19:03 mowgli kernel:   time_stamp           <37cf825>
Nov 25 18:19:03 mowgli kernel:   next_to_watch        <6e>
Nov 25 18:19:03 mowgli kernel:   jiffies              <37d1100>
Nov 25 18:19:03 mowgli kernel:   next_to_watch.status <0>
Nov 25 18:19:05 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
messages.

Bug 249185 was updated with the same information.
Comment 2 Chuck Ebbert 2007-11-26 12:04:24 EST
Can you try adding this line to /etc/modprobe.conf and then rebooting?

options e1000 InterruptThrottleRate=0

Some other workarounds are also at:
http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449]
Comment 3 Chuck Ebbert 2007-11-27 18:43:12 EST
Also related to bug 400561
Comment 4 (GalaxyMaster) 2007-11-28 11:21:28 EST
I'd like to point that at least Intel ESB2/Gilgal (82563EB) NIC (for instance,
this NIC is used on Supermicro motherboards like this:
http://www.supermicro.com/products/motherboard/Xeon1333/5000V/X7DVL-E.cfm)
requires at driver version 7.6.5-NAPI or later.  Although driver versions before
7.6.5-NAPI announce support for 0x8086:0x1096 the fact is that the system with
such a NIC becomes unreachable via network in 5-10 minutes after the boot.

I have also reported the same bug on OpenVZ bugzilla:
http://bugzilla.openvz.org/show_bug.cgi?id=530#c6
Comment 5 Greg Morgan 2007-12-13 00:38:59 EST
I updated bug 248787 Comment #12.  Essentially, the network service died when
trying to copy 346gig to a usb 2.0 drive on the client.  I've never seen that
before with this problem.

I wonder if the reason "# ethtool -K eth0 tso off" did not work for me is
because I have an early generation chip.
lspci
00:09.0 Ethernet controller: Intel Corporation 82544GC Gigabit Ethernet
Controller (Copper) (rev 02)

As per bug 200656 comment #3
1.) I still have a reduced number of interrupts configured in the bios.
2.) This modprobe line has worked in the pasted
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
I will be happy try try Chuck Ebbert modprobe.conf setting as per Comment #2
above. Note that I am the one posting the modprobe.conf configration in the
http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449
link as dr_kludge. ;-)
3.) Swapping cards proved that there were no hardware problems with the Intel
Card.  The card that I have swapped with before is a
00:0b.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet
Controller (rev 02)

As per bug 200656 comment #10 the test driver 7.3.15_tdhdump-NAPIsolved the problem.

My statement in bug 200656 Comment #12 was incorrect. I spoke too soon and
closed the bug report.

Some additional information.

ethtool -e eth0
Offset          Values
------          ------
0x0000          00 02 b3 96 09 9b 20 02 ff ff ff ff ff ff ff ff 
0x0010          29 a6 07 47 0b 66 12 11 86 80 0c 10 86 80 04 f2 
0x0020          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0030          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0040          ff db 11 00 11 37 ff ff ff ff ff ff ff ff ff ff 
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 
0x0060          fc 00 00 40 0f 10 ff ff ff ff ff ff ff ff ff ff 
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 76 b9 

lspci -vv
00:09.0 Ethernet controller: Intel Corporation 82544GC Gigabit Ethernet
Controller (Copper) (rev 02)
        Subsystem: Intel Corporation PRO/1000 T Desktop Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (63750ns min), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at eb020000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at eb000000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at c000 [size=32]
        [virtual] Expansion ROM at 58000000 [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [e4] PCI-X non-bridge device
                Command: DPERE- ERO+ RBC=512 OST=1
                Status: Dev=00:00.0 64bit- 133MHz- SCD- USC- DC=simple
DMMRBC=2048 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz-
        Capabilities: [f0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0
Enable-
                Address: 0000000000000000  Data: 0000

Comment 6 Greg Morgan 2007-12-13 00:49:07 EST
Just cus I posted a bunch of junk in comment #5 above, I'll follow these steps
and report back as time permits.

1.) Try just the /etc/modprobe.conf
alias eth0 e1000
options e1000 InterruptThrottleRate=0
and reboot.

2.) Try the longer
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0

3.) Try the f8 Xen kernel.  I have not installed it yet.  However, in the prior
Fedora series, the Xen kernel used a driver that fixed the problem while the
non-Xen kernel prodcued the tx hang messages and performance issues.

4.) Try the f8 kernel-2.6.23.9-85.fc8 test kernel as noted in bug 400561 comment
#23 via
su -c 'yum --enablerepo=updates-testing update kernel'
Comment 7 Greg Morgan 2007-12-13 01:53:09 EST
As per comment #6 I implemented 
1.) Try just the /etc/modprobe.conf
alias eth0 e1000
options e1000 InterruptThrottleRate=0
and reboot.

I performed an yum update, there were three packages to install and started the
same copy as reported in this initial bug report.  The only change was to the go
directly to a ide hard drive verses a usb 2.0 hard drive.  The results are
Dec 12 23:43:33 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Dec 12 23:43:33 mowgli kernel:   Tx Queue             <0>
Dec 12 23:43:33 mowgli kernel:   TDH                  <f>
Dec 12 23:43:33 mowgli kernel:   TDT                  <f>
Dec 12 23:43:33 mowgli kernel:   next_to_use          <f>
Dec 12 23:43:33 mowgli kernel:   next_to_clean        <22>
Dec 12 23:43:33 mowgli kernel: buffer_info[next_to_clean]
Dec 12 23:43:33 mowgli kernel:   time_stamp           <62ee7>
Dec 12 23:43:33 mowgli kernel:   next_to_watch        <22>
Dec 12 23:43:33 mowgli kernel:   jiffies              <635d8>
Dec 12 23:43:33 mowgli kernel:   next_to_watch.status <0>
Comment 8 Greg Morgan 2007-12-13 02:10:22 EST
As per comment #6 I implemented 
2.) Try the longer
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0

I started the
same copy as reported in this initial bug report.  The only change was to the go
directly to a ide hard drive verses a usb 2.0 hard drive.  The results are
Dec 12 23:57:49 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Dec 12 23:57:49 mowgli kernel:   Tx Queue             <0>
Dec 12 23:57:49 mowgli kernel:   TDH                  <a25>
Dec 12 23:57:49 mowgli kernel:   TDT                  <a25>
Dec 12 23:57:49 mowgli kernel:   next_to_use          <a25>
Dec 12 23:57:49 mowgli kernel:   next_to_clean        <a39>
Dec 12 23:57:49 mowgli kernel: buffer_info[next_to_clean]
Dec 12 23:57:49 mowgli kernel:   time_stamp           <ffffad8f>
Dec 12 23:57:49 mowgli kernel:   next_to_watch        <a39>
Dec 12 23:57:49 mowgli kernel:   jiffies              <ffffb888>
Dec 12 23:57:49 mowgli kernel:   next_to_watch.status <0>

Just for grins here's the log file showing all the modprobe options being
implemented correctly.
Dec 12 23:54:22 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI
Dec 12 23:54:22 mowgli kernel: Copyright (c) 1999-2006 Intel Corporation.
Dec 12 23:54:22 mowgli kernel: ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 17
(level, low) -> IRQ 19
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_validate_option:
Transmit Descriptors set to 4096
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_validate_option:
Receive Descriptors set to 4096
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_validate_option:
Checksum Offload Disabled
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_validate_option: Flow
Control Enabled
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_validate_option:
Transmit Interrupt Delay set to 0
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_validate_option:
Receive Interrupt Delay set to 0
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_check_options:
Interrupt Throttling Rate (ints/sec) turned off
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_check_copper_options:
Using Autonegotiation at 1000 Mbps Full Duplex only
Dec 12 23:54:22 mowgli kernel: e1000: 0000:00:09.0: e1000_probe:
(PCI:33MHz:32-bit) 00:02:b3:96:09:9b

Installing Xen kernels next...
Comment 9 Greg Morgan 2007-12-13 02:46:23 EST
This is a new development.

As per comment #6 I implemented 
3.) Try the f8 Xen kernel.  I have not installed it yet.  However, in the prior
Fedora series, the Xen kernel used a driver that fixed the problem while the
non-Xen kernel prodcued the tx hang messages and performance issues.

by using

yum install xen-libs.i386 kernel-xen.i686   kernel-xen-2.6-doc.noarch
kernel-xen-devel.i686   xen.i386 xen-devel.i386

Also note that the 
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
were in effect.

Now this is very interesting.  The Xen kernel has been updated to the same Intel
driver that the stock kernel is using and produces the tx unit hangs.
Dec 13 00:11:30 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI
****but****
where du -sh reports that I only copied __58M__ and received a tx unit hang
message, the Xen kernel has let me copy __5.2G__ without a single tx unit hang
message.  How does the Xen eth0, peth0, virbr0 combination of drivers prevent
the tx hang messages?  Because now it looks like the driver is a problem in a
stock kernel but Xen shields the problem away from the system in a Xen kernel!?

Let me try another test on this Zen kernel with out the modprobe settings.

I am now up to 8.9G during a runtime of 26 minutes before the reboot.


Comment 10 Greg Morgan 2007-12-13 03:00:53 EST
This is a test of comment #6 and comment #9 without the modprobe settings as
shown in log file

Dec 13 00:45:01 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI
Dec 13 00:45:01 mowgli kernel: Copyright (c) 1999-2006 Intel Corporation.
Dec 13 00:45:01 mowgli kernel: ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 17
(level, low) -> IRQ 19
Dec 13 00:45:01 mowgli kernel: e1000: 0000:00:09.0: e1000_probe:
(PCI:33MHz:32-bit) 00:02:b3:96:09:9b
Dec 13 00:45:01 mowgli kernel: e1000: eth0: e1000_probe: Intel(R) PRO/1000
Network Connection

I am already into 9 minutes of uptime and 2.2G copied without a tx hang message.

I'll try the "yum --enablerepo=updates-testing update kernel" in the next couple
of days.  For now I'll set grub to use the Xen kernel because it is working with
or without the modprobe settings.


Comment 11 Greg Morgan 2007-12-23 02:46:17 EST
The 2.6.23.9-85.fc8 was pushed to stable by the time I installed.

Linux version 2.6.23.9-85.fc8
...
Dec 22 23:28:50 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI
...
Dec 22 23:35:23 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
Dec 22 23:35:23 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
Dec 22 23:35:23 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Dec 22 23:35:23 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Dec 22 23:35:23 mowgli kernel:   Tx Queue             <0>
Dec 22 23:35:23 mowgli kernel:   Tx Queue             <0>
Dec 22 23:35:23 mowgli kernel:   TDH                  <b>
Dec 22 23:35:23 mowgli kernel:   TDH                  <b>
Dec 22 23:35:23 mowgli kernel:   TDT                  <b>
Dec 22 23:35:23 mowgli kernel:   TDT                  <b>
Dec 22 23:35:23 mowgli kernel:   next_to_use          <b>
Dec 22 23:35:23 mowgli kernel:   next_to_use          <b>
Dec 22 23:35:23 mowgli kernel:   next_to_clean        <1f>
Dec 22 23:35:23 mowgli kernel:   next_to_clean        <1f>
Dec 22 23:35:23 mowgli kernel: buffer_info[next_to_clean]
Dec 22 23:35:23 mowgli kernel: buffer_info[next_to_clean]
Dec 22 23:35:23 mowgli kernel:   time_stamp           <27c56>
Dec 22 23:35:23 mowgli kernel:   time_stamp           <27c56>
Dec 22 23:35:23 mowgli kernel:   next_to_watch        <1f>
Dec 22 23:35:23 mowgli kernel:   next_to_watch        <1f>
Dec 22 23:35:23 mowgli kernel:   jiffies              <29810>
Dec 22 23:35:23 mowgli kernel:   jiffies              <29810>
Dec 22 23:35:23 mowgli kernel:   next_to_watch.status <0>
Dec 22 23:35:23 mowgli kernel:   next_to_watch.status <0>
Dec 22 23:35:26 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX
Dec 22 23:35:26 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX

I will try the Xen Kernel again....

Linux version 2.6.21-2952.fc8xen 
...
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI
...

and there were no problems.  I also posted information in bug 400561 comment #26
and bug 400561 comment #27.
Comment 12 Christopher Brown 2008-01-09 12:11:31 EST
*** Bug 249185 has been marked as a duplicate of this bug. ***
Comment 13 Christopher Brown 2008-02-13 19:38:49 EST
Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel?
Comment 14 Greg Morgan 2008-03-11 06:34:25 EDT
In response to Christopher's question in comment #13, comment #9 was the
illuminating development.  With both the Xen Kernel and the normal kernel using
the same version of the Intel e1000 driver, the Xen Kernel "somehow buffers" the
e1000 and prevents the TX Unit Hang.  Moreover, if I perform updates with yum
and forget to check to see if the kernel changed and thus the grub menu, then a
non-Xen kernel will generate the TX Unit Hang message.  Kernel 2.6.24.3-12.fc8
generated the messages below.  My wife said her a web page was freezing when I
realized that a non-Xen kernel was being used with this hardware.

Mar 10 07:03:07 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Mar 10 07:03:07 mowgli kernel:   Tx Queue             <0>
Mar 10 07:03:07 mowgli kernel:   Tx Queue             <0>
Mar 10 07:03:07 mowgli kernel:   TDH                  <b0>
Mar 10 07:03:07 mowgli kernel:   TDH                  <b0>
Mar 10 07:03:07 mowgli kernel:   TDT                  <b0>
Mar 10 07:03:07 mowgli kernel:   TDT                  <b0>
Mar 10 07:03:07 mowgli kernel:   next_to_use          <b0>
Mar 10 07:03:07 mowgli kernel:   next_to_use          <b0>
Mar 10 07:03:07 mowgli kernel:   next_to_clean        <6f>
Mar 10 07:03:07 mowgli kernel:   next_to_clean        <6f>
Mar 10 07:03:07 mowgli kernel: buffer_info[next_to_clean]
Mar 10 07:03:07 mowgli kernel: buffer_info[next_to_clean]
Mar 10 07:03:07 mowgli kernel:   time_stamp           <5e90259f>
Mar 10 07:03:07 mowgli kernel:   time_stamp           <5e90259f>
Mar 10 07:03:07 mowgli kernel:   next_to_watch        <6f>
Mar 10 07:03:07 mowgli kernel:   next_to_watch        <6f>
Mar 10 07:03:07 mowgli kernel:   jiffies              <5e904468>
Mar 10 07:03:07 mowgli kernel:   jiffies              <5e904468>
Mar 10 07:03:07 mowgli kernel:   next_to_watch.status <0>
Mar 10 07:03:07 mowgli kernel:   next_to_watch.status <0>
Comment 15 Greg Morgan 2008-03-12 01:55:10 EDT
In response to Christopher's question in comment #13, here's some additional
information on my experience with the TH Unit Hang issue.  I've posted several
reports but two that may be of interest are here
http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449
and here
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=200656#c11

IN the SF tracker a POC driver was posted as "e1000-7.3.15tdh.tar.gz"  The
"driver may help if TDH==TDT in your tx hang" and so from above comment #14 we have
Mar 10 07:03:07 mowgli kernel:   TDH                  <b0>
Mar 10 07:03:07 mowgli kernel:   TDH                  <b0>
Mar 10 07:03:07 mowgli kernel:   TDT                  <b0>
Mar 10 07:03:07 mowgli kernel:   TDT                  <b0>

The driver code is located here 
http://sourceforge.net/tracker/download.php?group_id=42302&atid=447449&file_id=198849&aid=1463045

go_jessie went on to say,
"It is not our final version of the fix, and probably will
only help people that have the signature in their traces of 
TDH                  <cb>
TDT                  <cb>
where TDH equals TDT.

"if your TDH does not equal TDT then it is likely you are
having a hardware problem for some reason or another.

"the TDHclean driver may well have some problems, as it has not been tested
as thoroughly as our production drivers.  It is more of a proof of concept.
 Unfortunately I haven't had time yet to figure out a way to integrate it
into our production code.  Your info is very useful however, as it does
point to some problem in the TDH based clean up code.

"We still don't have any systems here to reproduce this error (i.e. it is
fairly rare, and system dependent)

I was puzzled at the thought that a buggy BIOS may be part of the problem
especially since as I understand it, the kernel replaces most all of the single
task BIOS code with multi task kernel code of its own.

Once again, the thing that is most interesting is that since a reboot into Xen
kernel,

   Mar 10 08:20:45 mowgli kernel: Linux version 2.6.21-2952.fc8xen
(kojibuilder@hammer2.fedora.redhat.com) (gcc version 4.1.2 20070925 (Red Hat
4.1.2-33)) #1 SMP Mon Nov 19 07:06:55 EST 2007,

I have had no problems. Both the Xen and non-Xen kernel show the same Intel
driver version respectively, but for some reason the Xen kernel does not produce
the TH Unit hang messages.  That is to me the most illuminating new information.

Linux version 2.6.21-2952.fc8xen (kojibuilder@hammer2.fedora.redhat.com) (gcc
version 4.1.2 20070925 (Red Hat 4.1.2-33))
Mar 10 08:20:45 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI

Linux version 2.6.23.14-107.fc8 (mockbuild@xenbuilder4.fedora.phx.redhat.com)
(gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)
Feb 13 18:50:22 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI

Comment 16 Jesse Brandeburg 2008-03-12 12:39:27 EDT
we have quite a bit more information about this bug than we did in the past.

we have been able to confirm that on 2.6.18 stock kernels on some AMD systems
the memory the driver allocates for tx_ring->desc using pci_alloc_consistent is
not actually consistent as the linux kernel guarantees it to be.

Using bus analyzers and extra driver debugging we can see that the driver
updates the ->desc memory and then tells the adapter to fetch it.  The adapter
then does a DMA and sees the *previous* version of that memory location.

This has to be a misconfiguration of either the memory controller inside the
processor, or somehow a miscommunication where the Host Bridge does not send the
snoop cycles to the memory controller to let it know there are DMA transactions
going to main memory.

my theory at this point is that xen is either setting up the processor with mtrr
(see cat /proc/mtrr for both kernels) in a better way, or something else that is
similar.

we are having a difficult time getting any technical documentation (and
expertise) for suggesting a fix to this issue for non-xen kernels.
Comment 17 Greg Morgan 2008-03-18 07:57:54 EDT
I used find and a small bash script to create two files during a boot of the Xen
and non-Xen kernels.  While using a gvim -d on the two files, I found it
interesting that even cpu MHZ info was slightly different between the two
kernels. I understand that some of these differences come from the drivers that
will be loaded.  However, I'd think that the iomem would report the same system
memory.  Also note the difference of the timer_stats versions.

Is there something more that would like from these files besides the information
below or do you have another tool that you'd like me to use?

Xen kernel
./mtrr
reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=0
reg01: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=0
reg02: base=0xd0000000 (3328MB), size= 128MB: write-combining, count=1
./timer_stats
Timer Stats Version: v0.1

./buddyinfo
Node 0, zone      DMA      1    433    582    474    387    142     63     37  
  24      1     98 
Node 0, zone  HighMem      0      1      0      1      0      1      1      0  
   0      0      0 
./slabinfo
slabinfo - version: 2.1

./iomem
00000000-0009efff : System RAM
  00000000-00000000 : Crash kernel
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000cefff : Video ROM
000d0000-000d3fff : pnp 00:00
000d8000-000d97ff : Adapter ROM
000d9800-000dbfff : pnp 00:00
000f0000-000fffff : System ROM
00100000-3ffeffff : System RAM
3fff0000-3fff2fff : ACPI Non-volatile Storage
3fff3000-3fffffff : ACPI Tables
50000000-53ffffff : PCI CardBus #02
54000000-57ffffff : PCI CardBus #02
58000000-5801ffff : 0000:00:09.0
d0000000-dfffffff : PCI Bus #01
  d0000000-dfffffff : 0000:01:00.0
e0000000-e7ffffff : 0000:00:00.0
e8000000-e9ffffff : PCI Bus #01
  e8000000-e8ffffff : 0000:01:00.0
  e9000000-e901ffff : 0000:01:00.0
eb000000-eb01ffff : 0000:00:09.0
  eb000000-eb01ffff : e1000
eb020000-eb03ffff : 0000:00:09.0
  eb020000-eb03ffff : e1000
eb040000-eb040fff : 0000:00:0b.0
  eb040000-eb040fff : yenta_socket
eb045000-eb0450ff : 0000:00:10.4
  eb045000-eb0450ff : ehci_hcd
fec00000-fec00fff : reserved
fee00000-fee00fff : reserved
ffff0000-ffffffff : reserved
./ioports




non-Xen
./mtrr
reg00: base=0x00000000 (   0MB), size=1024MB: write-back, count=1
reg01: base=0xe0000000 (3584MB), size= 128MB: write-combining, count=1
reg02: base=0xd0000000 (3328MB), size= 128MB: write-combining, count=1
./timer_stats
Timer Stats Version: v0.2

./buddyinfo
Node 0, zone      DMA      5      5      5      3      5      4      2      2  
   3      1      1 
Node 0, zone   Normal      2     15    446    395    351    121     51     12  
   9      5    129 
Node 0, zone  HighMem      1      1      1      1      1      1      0      0  
   0      0      0 
./slabinfo
slabinfo - version: 2.1

./iomem
00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000cefff : Video ROM
000d0000-000d3fff : pnp 00:00
000d8000-000d97ff : Adapter ROM
000d9800-000dbfff : pnp 00:00
000f0000-000fffff : System ROM
00100000-3ffeffff : System RAM
  00400000-0062fa08 : Kernel code
  0062fa09-0074a723 : Kernel data
  00797000-0084ae77 : Kernel bss
3fff0000-3fff2fff : ACPI Non-volatile Storage
3fff3000-3fffffff : ACPI Tables
50000000-53ffffff : PCI CardBus #02
54000000-57ffffff : PCI CardBus #02
58000000-5801ffff : 0000:00:09.0
d0000000-dfffffff : PCI Bus #01
  d0000000-dfffffff : 0000:01:00.0
e0000000-e7ffffff : 0000:00:00.0
e8000000-e9ffffff : PCI Bus #01
  e8000000-e8ffffff : 0000:01:00.0
  e9000000-e901ffff : 0000:01:00.0
eb000000-eb01ffff : 0000:00:09.0
  eb000000-eb01ffff : e1000
eb020000-eb03ffff : 0000:00:09.0
  eb020000-eb03ffff : e1000
eb040000-eb040fff : 0000:00:0b.0
  eb040000-eb040fff : yenta_socket
eb045000-eb0450ff : 0000:00:10.4
  eb045000-eb0450ff : ehci_hcd
fec00000-fec00fff : reserved
fee00000-fee00fff : reserved
fff80000-fffeffff : pnp 00:00
ffff0000-ffffffff : reserved
./ioports




# Essentially, I did this.
cd ~/
touch xen.txt
pushd /proc
find . -maxdepth 1 -type f -exec ~/procinfo.sh '{}' ~/xen.txt \;
pushd
# reboot in normal kernel.
touch xen_non.txt
pushd
find . -maxdepth 1 -type f -exec ~/procinfo.sh '{}' ~/xen_non.txt \;



[root@mowgli ~]# more procinfo.sh 
#!/bin/bash
case "$1" in
   # skip big file.
   ./kcore)
      echo $1 >> $2
      ;;
   # stock kernel issues
   # Permission denied
   ./sys/kernel/sched_nr_migrate)
      echo $1 >> $2
      ;;
   # Permission denied
   ./sys/net/ipv4/route/flush)
      echo $1 >> $2
      ;;
   # Permission denied
   ./sys/net/ipv6/route/flush)
      echo $1 >> $2
      ;;
   # Invalid argument
   ./sys/fs/binfmt_misc/register)
      echo $1 >> $2
      ;;
   # Input/output error
   ./sysrq-trigger)
      echo $1 >> $2
      ;;
   #xen issues
   # Device or resource busy
   ./acpi/event)
      echo $1 >> $2
      ;;
   # Invalid argument
   ./xen/privcmd)
      echo $1 >> $2
      ;;
   *)
      echo $1 >> $2
      cat $1 >> $2
      ;;
esac
[root@mowgli ~]# 






Comment 18 Bug Zapper 2008-11-26 03:41:21 EST
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 19 Bug Zapper 2009-01-09 00:20:04 EST
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.
Comment 20 Thomas Müller 2009-06-09 16:20:02 EDT
It looks like this bug resurfaced :(

My system was working perfectly for over a year. A couple of days ago I updated the kernel to the newest one available for Fedora 10 and rebooted.
The first time I tried to transfer some larger files the system completely locked up and I had to power-cycle it.

From this point on I have not been able to transfer any larger file across my network without
a) getting “e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang”
b) complete lock up

I've tried to use the older kernel that used to work fine, but it has the same problem now.


I've just updated to Fedora 11, but I still have the same problem.

This is 100% reproducible, if I start to transfer some large file.

I've also tried the module option “InterruptThrottleRate=0”, but it made no difference.


Any ideas on how to fix this?
Comment 21 Vincent S. Cojot 2009-11-23 06:18:24 EST
(In reply to comment #20)
> It looks like this bug resurfaced :(

Did you find any workaround yet? The same kind of bug hit me when I upgraded the RAM (3Gb to 12Gb) on a RHEL5.4 machine...
Comment 22 Oliver Schinagl 2009-12-30 15:36:25 EST
Ironically, I stumbled upon it aswel, I am running gentoo however, so this seems to be a generic kernel thing.

The 3 -> 12Gb upgrade is interesting. I still have a 32bit system, so i'm using PAE at the moment. I used this install on a Xeon (64bit capable) with 3gb using PAE. I since swapped the motherboard for an AMD Phenom 2 one, and also went to 8Gb. I stayed 32bit and only recompiled my kernel with correct drivers/cpu architecture (still remaining 32bit). The nic is the same (pci e1000). And when transfering anything more then ... say 150mb worth, it chokes badly. So it appears a memory related thing maybe? hard to say however. For now I used ethtool to disable tx offloading ... i'll test again sometime to see if it helps.

ethtool -K eth0 tso off
Comment 23 Murz 2010-10-25 07:58:12 EDT
Same problem on fresh kernel and e1000 module on Debian Lenny:

# modinfo e1000|grep ^version
version:        7.3.21-k5-NAPI

# lspci
03:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05)

# uname -a
Linux hostname 2.6.32-bpo.5-amd64 #1 SMP Sat Sep 18 19:03:14 UTC 2010 x86_64 GNU/Linux

But the problem shows not often, 1 time in 2-3 days.

Will "ethtool -K eth0 tso off" solve the problem?

Note You need to log in before you can comment on or make changes to this bug.