249185 – Non Xen Kernel generates 'Detected Tx Unit Hang' messages on e1000 driver while Xen enabled Kernels do not.

Bug 249185 - Non Xen Kernel generates 'Detected Tx Unit Hang' messages on e1000 driver while Xen enabled Kernels do not.

Summary: Non Xen Kernel generates 'Detected Tx Unit Hang' messages on e1000 driver whi...

Keywords:
Status:	CLOSED DUPLICATE of bug 398921
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	7
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-07-22 10:12 UTC by Greg Morgan
Modified:	2008-01-09 17:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-01-09 17:11:29 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Greg Morgan 2007-07-22 10:12:36 UTC

Description of problem:

Please see f5 bug 200656, ES bug 248787, and f6 bug 219496 .  I am working on
bug 249136 concerning USB problems.  The f7 Zen Kernel 2.6.20-2925.11.fc7xen has
a problem with the USB issue.  When I rebooted to the new 2.6.22.1-27 kernel to
test the kernel as fix for USB issues, my tx hang issues on the same hardware in
 bug 200656 appeared.  However, the 2.6.20-2925.11.fc7xen kernel is rock solid.
 I am coping a bunch of wav files from my NFS server to a 400gig USB drive when
the error occurs.  However, simple web surfing can also cause problems.

Version-Release number of selected component (if applicable):
2.6.22.1-27 kernel

How reproducible:
Switching from the 2.6.22.1-27 kernel back to the 2.6.20-2925.11.fc7xen then
back to the 2.6.22.1-27 kernel will produce the error, fix the error and the
produce the error respectfully.

Steps to Reproduce:
1.Upgrade to the 2.6.22.1-27 kernel with an e1000 card.
2.
3.
  
Actual results:

Study performance on a heavy load.

Expected results:

Long delays in coping data or surfing the web.

Additional info:
I just closed the bug on FC5 yesterday thinking that the problem was gone.

Jul 22 02:35:26 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Jul 22 02:35:26 mowgli kernel:   Tx Queue             <0>
Jul 22 02:35:26 mowgli kernel:   TDH                  <61>
Jul 22 02:35:26 mowgli kernel:   TDT                  <61>
Jul 22 02:35:26 mowgli kernel:   next_to_use          <61>
Jul 22 02:35:26 mowgli kernel:   next_to_clean        <75>
Jul 22 02:35:26 mowgli kernel: buffer_info[next_to_clean]
Jul 22 02:35:26 mowgli kernel:   time_stamp           <1e8878f>
Jul 22 02:35:26 mowgli kernel:   next_to_watch        <75>
Jul 22 02:35:26 mowgli kernel:   jiffies              <1e8a5c0>
Jul 22 02:35:26 mowgli kernel:   next_to_watch.status <0>
Jul 22 02:35:27 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 22 02:35:30 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX

Comment 1 Greg Morgan 2007-07-23 01:07:51 UTC

The working driver is
Jul 21 17:31:03 mowgli kernel: input: PC Speaker as /class/input/input3
Jul 21 17:31:03 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.15-k2-NAPI
Jul 21 17:31:03 mowgli kernel: Copyright (c) 1999-2006 Intel Corporation.

The TK hang driver is 
Jul 21 17:22:59 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI
Jul 21 17:22:59 mowgli kernel: Copyright (c) 1999-2006 Intel Corporation.

uptime
 18:05:04 up 15:28,...
with no TX hang errors on the 7.3.15 driver while coping 320G from an NFS Intel
gigabit enabled server, if that helps.

Comment 2 Greg Morgan 2007-07-23 02:03:49 UTC

OK so I revised the Summary and provide you with one of those handy Time-Life
charts.  ;-)  The real issue here is that the Xen kernels available in my grub
menu do not generate the TX hang error messages while the regular fc7 kernels
generate the TX hang messages.  Trying to use gvim to create the table below
between reboots on an NFS mounted home directory was very unresponsive with the
TX hanging kernels.  The Summary message was updated accordingly.

Intel
Driver		Kernel					TK Hang Issues
7.3.15-k2-NAPI	/boot/vmlinuz-2.6.20-2925.11.fc7xen	Rock Solid
7.3.15-k2-NAPI	/boot/vmlinuz-2.6.20-2925.9.fc7xen	Rock Solid
7.3.20-k2-NAPI	/boot/vmlinuz-2.6.21-1.3228.fc7		TX Issues Encountered
7.3.20-k2-NAPI	/boot/vmlinuz-2.6.22.1-27.fc7		TX Issues Encountered

Comment 3 Chuck Ebbert 2007-07-23 19:46:55 UTC

One workaround to try is turning off TSO:

    # ethtool -K eth0 tso off

Comment 4 Christopher Brown 2007-09-20 11:00:02 UTC

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug and will try and assist you in resolving it if I can.

There hasn't been much activity on this bug for a while. Could you tell me if
you are still having problems with the latest kernel? Did Chuck's suggestion
work for you?

If the problem no longer exists then please close this bug or I'll do so in a
few days if there is no additional information lodged.

Cheers
Chris

Comment 5 Greg Morgan 2007-10-01 16:11:30 UTC

Chris,

Ack.  I'll check the machine tonight with the ethtool from comment #3.  Note
that this is the same host mentioned in bug 200656. I did perform some of the
ethool operations in that bug.  However, I know that I am still using the Xen
kernels at this point with no problems at all. The Xen Kernel e1000
7.3.15-k2-NAPI driver is very similar to the 7.3.15tdh code that was provided me
in bug 200656 comment #10.

Regards,
Greg

Comment 6 Christopher Brown 2007-11-19 15:16:12 UTC

Hi Greg,

Any change using ethtool?

Cheers
Chris

Comment 7 Greg Morgan 2007-11-26 01:27:52 UTC

Chris,  Sorry for the many delays.  The problem still exists with the ethtool
command.  This problem also exists in f8 as I posted here in bug 398921.  The
7.3.15-k2-NAPI Intel driver fixed the problem but the 7.3.20-k2-NAPI Intel
driver regressed? or added back a similar problem that creates the 
Nov 25 18:19:03 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Nov 25 18:19:03 mowgli kernel:   Tx Queue             <0>
Nov 25 18:19:03 mowgli kernel:   TDH                  <5a>
Nov 25 18:19:03 mowgli kernel:   TDT                  <5a>
Nov 25 18:19:03 mowgli kernel:   next_to_use          <5a>
Nov 25 18:19:03 mowgli kernel:   next_to_clean        <6e>
Nov 25 18:19:03 mowgli kernel: buffer_info[next_to_clean]
Nov 25 18:19:03 mowgli kernel:   time_stamp           <37cf825>
Nov 25 18:19:03 mowgli kernel:   next_to_watch        <6e>
Nov 25 18:19:03 mowgli kernel:   jiffies              <37d1100>
Nov 25 18:19:03 mowgli kernel:   next_to_watch.status <0>
Nov 25 18:19:05 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
messages.

The problem occurs under load but can occur during web surfing.

Comment 8 Christopher Brown 2008-01-09 17:11:29 UTC

Hi Greg,

Thanks for the update. I'm closing this as a dupe of bug 398921 then - thanks
for filing that one. 2.6.24 is just around the corner so we could see what that
brings ... :)

*** This bug has been marked as a duplicate of 398921 ***

Note You need to log in before you can comment on or make changes to this bug.