Bug 200656

Summary:	Revisit 194460 and 182215 via Xen Detected Tx Unit Hang with Kernel 2.6.17-1.2157_FC5xen0
Product:	[Fedora] Fedora	Reporter:	Greg Morgan <drkludge>
Component:	kernel-xen	Assignee:	Xen Maintainance List <xen-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5	CC:	bstein, jesse.brandeburg, pcfe, saurik
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
URL:	http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-07-21 23:07:24 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Greg Morgan 2006-07-30 02:42:05 UTC

Description of problem:
The WAF, Wife Acceptance Factor, causes me to revisit bugs 194460 and 182215. 
The wife and kids have complained about using the newest piece of junk I own. 
The board has is at least a Sempron processor with match memory and all that
should make a difference.  I come to find out that my name brand Intel e1000
card is at the center of the problem.

The /var/log/messages showed the periodic 

Jul 29 15:25:16 mowgli kernel: NETDEV WATCHDOG: peth0: transmit timed out
Jul 29 15:25:16 mowgli kernel: xenbr0: port 2(peth0) entering disabled state
Jul 29 15:25:19 mowgli kernel: e1000: peth0: e1000_watchdog_task: NIC Link is Up
1000 Mbps Full Duplex
Jul 29 15:25:19 mowgli kernel: xenbr0: port 2(peth0) entering learning state
Jul 29 15:25:19 mowgli kernel: xenbr0: topology change detected, propagating
Jul 29 15:25:19 mowgli kernel: xenbr0: port 2(peth0) entering forwarding state
Jul 29 15:26:41 mowgli kernel: e1000: peth0: e1000_clean_tx_irq: Detected Tx
Unit Hang
Jul 29 15:26:41 mowgli kernel:   Tx Queue             <0>
Jul 29 15:26:41 mowgli kernel:   TDH                  <7f>
Jul 29 15:26:41 mowgli kernel:   TDT                  <7f>
Jul 29 15:26:41 mowgli kernel:   next_to_use          <7f>
Jul 29 15:26:41 mowgli kernel:   next_to_clean        <93>
Jul 29 15:26:41 mowgli kernel: buffer_info[next_to_clean]
Jul 29 15:26:41 mowgli kernel:   time_stamp           <2ca4ba>
Jul 29 15:26:41 mowgli kernel:   next_to_watch        <93>
Jul 29 15:26:41 mowgli kernel:   jiffies              <2ca66b>
Jul 29 15:26:41 mowgli kernel:   next_to_watch.status <0>

After Google searchs I looked in RH buzilla and found the other two bug reports.
I upgraded the kernel to 2.6.17-1.2157_FC5xen0. On reboot I received another of
the "detected Tx Unit Hang" on boot of the system.  Things seemed stable until I
used the "ping -i 0 -q hostname" from several computers.  From the same host
nothing happened.  After I added four more computers the "detected Tx Unit Hang"
appeared.  As instructed in the google searches or bug reports
/sbin/ethtool -K eth0 tso off
There have been no "detected Tx Unit Hang" messages for the last two hours with
the final tso off message.

So the 2.6.17-1.2157 kernel with Intel(R) PRO/1000 Network Driver - version
7.0.33-k2-NAPI e1000 driver was an improvement.  The final requirement was the
tso off.  The one fix alone would have statisfied my desktop users. ;-)

Here's some questions for FC6 since release candidates are in progress.
1.) I think Fedora core will take the black eye since Intel may not want to
support the older e1000 hardware that generates this error.  It looks like a
udev rule with the kernel program match and assignment keys needs to be created.
 If the list of unsupported e1000 chips are detected, then "/sbin/ethtool -K
eth0 tso off" needs to be executed. The actual chip could be used as another
name for the hardware.  The other option is to change /etc/rc.d/init.d/network
script needs to call ethtool.  In either case, Intel needs to provide a list of
orphaned chips for the rules to be written.

2.) ethtool-3-1.2.1 generates "Cannot get driver information: Operation not
supported" with "ethtool -e eth0" command and the xen0 kernel.  The same ethtool
worked with the prior 2.6.16-1.2080.FC5xen0.  The same version does not work
with the 2.6.17-1.2157_FC5xen0 kernel and the Intel 7.0.33-k2-NAPI driver.  Does
a bug need to be filed against ethtool, kernel, or the e1000 driver?

3.) The Intel fixeep.sh as posted here
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=130866 is broken because
it relies on the ethtool. With -x turned on the script reports

./fixeep.sh eth0
+ '[' eth0 == '' ']'
++ ethtool -e eth0
Cannot get driver information: Operation not supported
++ grep 0x0010
++ awk '{print $16}'
+ var=
+++ echo
+++ tr 02468ace 13579bdf
++ echo
+ new=
+ '[' == ']'
+ echo your eeprom is up to date, no changes made
your eeprom is up to date, no changes made

So we see that fixeep.sh tries to query the card; cannot read the card; and says
the eeprom is up to date.  The script reports a false sense of security for users.

4.) Based on the information above should this bug report be made a blocker for
FC 6 until the udev rules are written?  You see the poor newbies are going to
think the FC 6 distro is at issue here when it may be an Intel policy to not
support old hardware.  (I do find this Intel policy troubling by the way.  It
appears like the driver author(s) control the life cycle of the hardware.  For
example, my old e100 cards work fine but Intel may not be supporting the
hardware.)  The udev rules would support the poor folks with Intel e1000 LOM. 
Moreover, Fedora/Redhat/Linux will take a bad rap because the e1000 card does
not perform as well with the "tso" turned off.

5.) Is this a 64 bit processor (Sepron has some of that in it) and older Intel
chip issue or just that the chip has RC82540OEM as the part number?  

6.) I have three Intel cards with RC82540OEM on chip and two Intel cards with
the 82544GC reported in the /var/log/messages.  The two 82544GC cards look
responsive to the fixeep.sh until the 2.6.17-1.2157_FC5 kernel is applied.  Do
you need additional information from me at this point?

7.) My sister-in-law encourages me to buy Intel hardware because she works
there.  Slowly my hardware has been replaced other vendor stuff.  Look if this
is Intel end-of-life issues, when I go to update the gigabit adapters, then it
looks like I should switch to another vendor.  I am willing to do one of two things:
a.) If Intel has an upgrade program and wants to take my three RC82540OEM
adapters back for testing new drivers in exchange for some newer supported
adapters, then I am willing to make the trade. ;-)
b.) I was going to set up my mythTV box with one of the RC82540OEM adapters this
weekend.  However, I can throw a box on a DMZ so that you can play with the
Intel adapter if that will help.

The box has now run four four hours without the TX hang.  I'll see what the WAF
 and Kid Acceptance Factor is in actual use and report back later. ;-)

Comment 1 Greg Morgan 2006-07-30 04:00:40 UTC

So much for that.

I rebooted.
Performed the /sbin/ethtool -K eth0 tso off
Brought up thunderbird to read mail from the imap server along with firefox as a
normal user.
I see e1000: peth0: e1000_clean_tx_irq: Detected Tx Unit Hang all over the
place.  The prior report was as root with ping commands going.  Perhaps the post
above fixes a couple of ping commands but does not fix the problem while using
real applications.

There's got to be something else.  Perhaps I need to boot with a live CD and try
the fixeep.sh then reboot and try the tso off.

Comment 2 Greg Morgan 2006-07-30 16:32:13 UTC

I rebooted under 2.6.16-1.2080_FC5 with the 6.3.9-k4-NAPI e1000 driver but
without Xen.  I did this so that I could execute ethtool via fixeep.sh.  

For the RC82540OEM chip adapter I receive this
./fixeep.sh eth0
+ '[' eth0 == '' ']'
++ ethtool -e eth0
++ grep 0x0010
++ awk '{print $16}'
+ var=20
+++ echo 0
+++ tr 02468ace 13579bdf
++ echo 21
+ new=21
+ '[' 20 == 21 ']'
+ echo executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21
executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21
+ ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21
Cannot set EEPROM data: Bad address

The ethtool -e eth0 output provides this
Offset          Values
------          ------
0x0000          00 07 e9 15 0d 59 00 02 ff ff ff ff ff ff ff ff
0x0010          84 a7 08 08 0a 66 2e 00 86 80 00 00 00 00 20 b2
0x0020          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0030          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0040          cf 00 61 78 0b 28 00 00 c8 04 ff ff ff ff ff ff
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06
0x0060          e4 01 00 40 04 11 ff ff ff ff ff ff ff ff ff ff
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 97 fb

Likewise, for the 82544GC chip adapter that does not have the TX hang problem I
receive this
./fixeep.sh eth0
+ '[' eth0 == '' ']'
++ ethtool -e eth0
++ grep 0x0010
++ awk '{print $16}'
+ var=04
+++ tr 02468ace 13579bdf
+++ echo 4
++ echo 05
+ new=05
+ '[' 04 == 05 ']'
+ echo executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x05
executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x05
+ ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x05
Cannot set EEPROM data: Bad address

The ethtool -e eth0 output provides this
Offset          Values
------          ------
0x0000          00 02 b3 96 09 9b 20 02 ff ff ff ff ff ff ff ff
0x0010          29 a6 07 47 0b 66 12 11 86 80 0c 10 86 80 04 f2
0x0020          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0030          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0040          ff db 11 00 11 37 ff ff ff ff ff ff ff ff ff ff
0x0050          ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060          fc 00 00 40 0f 10 ff ff ff ff ff ff ff ff ff ff
0x0070          ff ff ff ff ff ff ff ff ff ff ff ff ff ff 76 b9

I am unable to eliminate the firmware as a soure of my continues tx hang
problems via fixeep.sh.  fixeep.sh produces "Cannot set EEPROM data: Bad
address" error messages.  Please advise.

Comment 3 Greg Morgan 2006-07-31 06:53:22 UTC

Speaking of end-of-life hardware, I have an Intel Netport attached to an hp
laserjet hat I got at the thrift store for $19.  Since I didn't need the
parallel port, I shut off parallel hardware in the BIOS and freed up an
interrupt.  That gave eth0 (peth0) its very own interrupt.  eth0 was sharing an
interrupt with the USB 2.0 stuff.  This did not solve the problem, however.
cat /proc/interrupts
           CPU0
  1:       2210        Phys-irq  i8042
  8:          1        Phys-irq  rtc
  9:          1        Phys-irq  acpi
 12:     128698        Phys-irq  i8042
 14:      25275        Phys-irq  ide0
 15:      54702        Phys-irq  ide1
 17:    1055321        Phys-irq  peth0
 18:      41851        Phys-irq  VIA8237
 19:          0        Phys-irq  uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3,
uhci_hcd:usb4, ehci_hcd:usb5
256:    1167094     Dynamic-irq  timer0
257:          0     Dynamic-irq  resched0
258:          0     Dynamic-irq  callfunc0
259:         85     Dynamic-irq  xenbus
260:          0     Dynamic-irq  console
NMI:          0
LOC:          0
ERR:          0
MIS:          0

I swapped the cards between two computers.  The card that was rock solid in one
computer started failing in the other.  I noticed this pattern on several other
posts i.e http://lkml.org/lkml/2005/12/19/144
http://www.gatago.com/linux/kernel/14660762.html :

Working System/Card
amd XP 1800+
on a pc133 memory system
512 Meg.

Failing System/Card
AMD Sempron 2600+
400 Front Side Bus
1 gig  memory in two matching 512Meg Dimm # It hurt when I bought it.
= Detected Tx Unit Hang

These guys made me think about the problem in a different light
http://www.2cpu.com/forums/showthread.php?t=75798 . Since ethtool would not work
with the Intel(R) PRO/1000 Network Driver - version 7.0.33-k2-NAPI, I added the
following to my /etc/modprobe.conf settings.  Note the options line is one
contiguous line.  The alias eth0 e1000 line was already in the modprobe.conf file.:

...
alias eth0 e1000
#
# Attempt to fix e1000_clean_tx_irq: Detected Tx Unit Hang
# http://www.2cpu.com/forums/showthread.php?t=75798
# http://www.gatago.com/linux/kernel/14660762.html
# http://lkml.org/lkml/2005/12/19/144
# http://support.intel.com/support/network/sb/CS-009209.htm
# http://support.intel.com/support/network/sb/cs-009918.htm
# ftp://download.intel.com/design/network/applnots/ap450.pdf
# http://agenda.clustermonkey.net/index.php/Tuning_Intel_e1000_NICs
# http://downloadmirror.intel.com/df-support/9180/ENG/README.txt
#
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0

Hopefully this provides some insight into the problem.  The card works great on
an old ecs piece of junk with no errors.  Put the same card in a faster ecs
piece of junk and the "NETDEV WATCHDOG" generates "Detected Tx Unit Hang"
messages as noted in the RxIntDelay notes here
http://support.intel.com/support/network/sb/CS-009209.htm .

The above modprobe settings made the e1000 and the system usable again.  I still
receive a few "Tx Unit Hang" messages but I was able to wget a Berry Linux ISO
while browsing distrowatch.com; read email from the imap server; play full wave
audio from the same imap/nfs server.  The messages appeared about five minutes
apart.  For the WAF, she would think that it was a momentary pause on the web
site not realizing that four TX Unit Hang messages just appeared in
/var/log/messages.

Theory: The faster hardware requires that the e1000 use larger buffers and
stuff.  Can anyone suggest a fix for the driver then or at least an improvement
on the modprobe.conf settings above as a work around?  The Wife Allocated Time,
he Wife Allocated Time, WAT, has well been spent. ;-)

Comment 4 Auke Kok 2006-07-31 20:08:10 UTC

(In reply to comment #2)
> I rebooted under 2.6.16-1.2080_FC5 with the 6.3.9-k4-NAPI e1000 driver but
> without Xen.  I did this so that I could execute ethtool via fixeep.sh.  
> 
> For the RC82540OEM chip adapter I receive this
> ./fixeep.sh eth0
> + '[' eth0 == '' ']'
> ++ ethtool -e eth0
> ++ grep 0x0010
> ++ awk '{print $16}'
> + var=20
> +++ echo 0
> +++ tr 02468ace 13579bdf
> ++ echo 21
> + new=21
> + '[' 20 == 21 ']'
> + echo executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21
> executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21
> + ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21
> Cannot set EEPROM data: Bad address
> 
> The ethtool -e eth0 output provides this
> Offset          Values
> ------          ------
> 0x0000          00 07 e9 15 0d 59 00 02 ff ff ff ff ff ff ff ff
> 0x0010          84 a7 08 08 0a 66 2e 00 86 80 00 00 00 00 20 b2

This hardware does not need the specific fixup mentioned.

Comment 5 Auke Kok 2006-07-31 20:24:01 UTC

(In reply to comment #0)
> Here's some questions for FC6 since release candidates are in progress.
> 1.) I think Fedora core will take the black eye since Intel may not want to
> support the older e1000 hardware that generates this error.

all of the chipsets mentioned are supported by our driver, there are no plans to
discontinue linux support for them either.


> 3.) The Intel fixeep.sh as posted here
> https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=130866 is broken because
> it relies on the ethtool. With -x turned on the script reports
> 
> ./fixeep.sh eth0
> + '[' eth0 == '' ']'
> ++ ethtool -e eth0
> Cannot get driver information: Operation not supported
> ++ grep 0x0010
> ++ awk '{print $16}'
> + var=
> +++ echo
> +++ tr 02468ace 13579bdf
> ++ echo
> + new=
> + '[' == ']'
> + echo your eeprom is up to date, no changes made
> your eeprom is up to date, no changes made
> 
> So we see that fixeep.sh tries to query the card; cannot read the card; and says
> the eeprom is up to date.  The script reports a false sense of security for users.

the e1000 driver must be loaded before you can use this script. It's indeed not
as userfriendly as it can be but without loading e1000.ko there is no way to
read the EEPROM :)

> For
> example, my old e100 cards work fine but Intel may not be supporting the
> hardware.

Same as above - all PCI-based e100's are supported by Intel. There is no plan to
discontine support for certain e100's either.

Comment 6 Greg Morgan 2006-08-01 12:38:21 UTC

Auke, Thank you for the quick reply.

ethtool -e is at the heart of the fixeep.sh script. The 7.0.33-k2-NAPI version
of the e1000 driver and the ethtool-3-1.2.1--version 3--of ethtool look like
they do not work together as noted in the command line output. lsmod shows that
the driver is loaded.  ethtool -e will not produce a table to grep against with
the new driver.  
[root@mowgli ~]# lsmod
...
e1000                 109881  0
...
[root@mowgli ~]# ethtool -e eth0
Cannot get driver information: Operation not supported
[root@mowgli ~]# ethtool -i eth0
Cannot get driver information: Operation not supported


[root@mowgli ~]# ./fixeep.sh eth0
Cannot get driver information: Operation not supported
your eeprom is up to date, no changes made

I understand that the script is a quick hack but without reading the eeprom
table the above output is not correct.

Comment 7 Jesse Brandeburg 2006-08-21 19:57:38 UTC

(In reply to comment #3)
> I swapped the cards between two computers.  The card that was rock solid in one
> computer started failing in the other.  I noticed this pattern on several other
> posts i.e http://lkml.org/lkml/2005/12/19/144
> http://www.gatago.com/linux/kernel/14660762.html :

probably indicates that the physical chip itself is not to blame.

> Working System/Card
> amd XP 1800+
> on a pc133 memory system
> 512 Meg.
> 
> Failing System/Card
> AMD Sempron 2600+
> 400 Front Side Bus
> 1 gig  memory in two matching 512Meg Dimm # It hurt when I bought it.
> = Detected Tx Unit Hang

are these systems identical in every way besides the processor?  You are most
likely running into a bios problem with how it configures the chipset for
"failing" system.
 
> options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
> RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
> 
> Hopefully this provides some insight into the problem.  The card works great on

well your machine doesn't have an 82573, so it doesn't need the eeprom fix.  You
likely need a newer version of ethtool (application) to work correctly with the
eeprom dump, but anyway that is irrelevant to the TX hang discussion here.

> Theory: The faster hardware requires that the e1000 use larger buffers and
> stuff.  Can anyone suggest a fix for the driver then or at least an improvement
> on the modprobe.conf settings above as a work around?  The Wife Allocated
Time, WAT, has well been spent. ;-)

> Jul 29 15:26:41 mowgli kernel:   TDH                  <7f>
> Jul 29 15:26:41 mowgli kernel:   TDT                  <7f>
> Jul 29 15:26:41 mowgli kernel:   next_to_use          <7f>

From this output you posted in a previous entry to this bug, I can tell that
your hardware is actually not hanging.  The driver is waiting for a bit to be
set that the hardware almost assurredly wrote, but for some reason never shows
up in host memory.

We actually see a few of these issues, it is not related to just AMD platforms
but it seems that in particular the VIA KT600 chipsets were very prone to have
this problem.  In almost all cases there is something misconfigured in the
chipset by the BIOS that causes these writes to host memory from the e1000
adapter to disappear.

I have a driver patch that can attempt to work around this issue at the cost of
slightly higher cpu utilization for all transmit clean up, are you interested to
try?

Comment 8 Greg Morgan 2006-09-01 23:16:06 UTC

(In reply to comment #7)
> (In reply to comment #3)
> > I swapped the cards between two computers.  The card that was rock solid in one
> > computer started failing in the other.  I noticed this pattern on several other
> > posts i.e http://lkml.org/lkml/2005/12/19/144
> > http://www.gatago.com/linux/kernel/14660762.html :
> 
> probably indicates that the physical chip itself is not to blame.

OK.  That is good news.

> 
> > Working System/Card
> > amd XP 1800+
> > on a pc133 memory system
> > 512 Meg.

Additional information on working systems.
None of the hardware is pushed.

System kaa
ESC K7S5A Release 11/21/2001 S
Bios 62-1121-001131-00101111-040201-SiS735-K7S5A
AMD Athlon XP 1800+
Blue dim slots filled = 512 Meg of PC2100
SDR/DDR CAS Latency SPD
SDR/DDR RAS Active Time 6T
SSR/DDR RAS Precharge Time 4T
Auto Detect DIMM/PCI CLK enabled

System bagheera  (As reported in this bug)
ESC K7S5A
ESC K7S5A Release 10/29/2002 S
Bios 62-1029-001131-00101111-040201-SiS735-K7S5A
AMD Athlon XP 1800+
Blue dim slots filled = 512 Meg of PC2100 (Correction not pc133)
DRAM/CPU 133/133 MHZ
SDR/DDR CAS Latency SPD
SDR/DDR RAS Active Time 6T
SSR/DDR RAS Precharge Time 4T
Auto Detect DIMM/PCI CLK enabled

> > 
> > Failing System/Card
> > AMD Sempron 2600+
> > 400 Front Side Bus
> > 1 gig  memory in two matching 512Meg Dimm # It hurt when I bought it.
> > = Detected Tx Unit Hang
> 

Additional information on failing system.
Hardware is not pushed.

System mowgli
ECS K7FSB
KT600-A Ver:1.1E 09/13/2004
has original Bios of 09/13/2004-KT600-8237-6A6LYE1FC-00
AMD Sempron 2600+
Current FSB Frequency 166MHZ
Current DRAM Frequency 200MHZ
DRAM Timing Auto by SPD
DRAM CAS Latency 2.5
Bank Interleave 2 bank
Precharge to Active (TRP) 5T
Active to Precharge (TRAS) 7T
Active to CMD (TRCD) 5T
Dram Burst Length 4
Dram command Rate 2T command
Write Recovery Time 3T


> are these systems identical in every way besides the processor?  You are most
> likely running into a bios problem with how it configures the chipset for
> "failing" system.


Additional system information above.  Systems kaa and bagheera are almost
identical and have no problems with the current e1000 drivers.  These two
systems are the Athlon XP 1800+ chips.  System mowgli is the Sempron 2600+ with
the problem.  System mowgli is newer and has twice the memory, etc.
 
>  
> > options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
> > RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
> > 
> > Hopefully this provides some insight into the problem.  The card works great on
> 
> well your machine doesn't have an 82573, so it doesn't need the eeprom fix.  You
> likely need a newer version of ethtool (application) to work correctly with the


Understood that it is not critical. ethtool on FC 5 is ethtool-3-1.2.1 based on
rpm -q --whatprovides ethtool.  Should this be updated for FC6 or for FC5 for
that matter?  I don't see how to check in the FC6 stuff about package versions.
 SF site says that ethool 4 and 5 was just released on 9/1/2006
http://sourceforge.net/project/showfiles.php?group_id=3242  I know this is not
an Intel question but should a blocker be made for FC6?


> eeprom dump, but anyway that is irrelevant to the TX hang discussion here.
> 
> > Theory: The faster hardware requires that the e1000 use larger buffers and
> > stuff.  Can anyone suggest a fix for the driver then or at least an improvement
> > on the modprobe.conf settings above as a work around?  The Wife Allocated
> Time, WAT, has well been spent. ;-)
> 
> > Jul 29 15:26:41 mowgli kernel:   TDH                  <7f>
> > Jul 29 15:26:41 mowgli kernel:   TDT                  <7f>
> > Jul 29 15:26:41 mowgli kernel:   next_to_use          <7f>
> 
> From this output you posted in a previous entry to this bug, I can tell that
> your hardware is actually not hanging.  The driver is waiting for a bit to be
> set that the hardware almost assurredly wrote, but for some reason never shows
> up in host memory.


Is this also a Xen problem along with a driver and BIOS problem?
http://wiki.xensource.com/xenwiki/XenFaq#head-4ce9767df34fe1c9cf4f85f7e07cb10110eae9b7
All there computers are running Xen.



> 
> We actually see a few of these issues, it is not related to just AMD platforms
> but it seems that in particular the VIA KT600 chipsets were very prone to have
> this problem.  In almost all cases there is something misconfigured in the
> chipset by the BIOS that causes these writes to host memory from the e1000
> adapter to disappear.
> 
> I have a driver patch that can attempt to work around this issue at the cost of
> slightly higher cpu utilization for all transmit clean up, are you interested to
> try?



Tee Hee "slightly higher CPU utilization" I don't think mowgli works too hard as
a desktop machine.  I'd be happy to try the driver.  

In a controlled way I can try a BIOS update first before trying your new driver
or I can go straight to the driver.  Do you have any preference?

Comment 9 Greg Morgan 2006-09-01 23:45:22 UTC

> Understood that it is not critical. ethtool on FC 5 is ethtool-3-1.2.1 based on
> rpm -q --whatprovides ethtool.  Should this be updated for FC6 or for FC5 for
> that matter?  I don't see how to check in the FC6 stuff about package versions.
>  SF site says that ethool 4 and 5 was just released on 9/1/2006
> http://sourceforge.net/project/showfiles.php?group_id=3242  I know this is not
> an Intel question but should a blocker be made for FC6?
> 

Bug 205000 was created for this concern.

Comment 10 Greg Morgan 2006-10-15 19:16:30 UTC

Update: Jesse provided me with a driver to test.  The driver has been in use for
48 hours without problems.  I had my son try some of the things he did before
during this time: Battle for Wesnoth; web surfing; and playing full wav audio
from the NFS server.  The version of the driver is 7.3.15_tdhdump-NAPI

The driver was installed by
cp tar file to /usr/src/redhat/SOURCES/e1000-7.3.15tdh.tar.gz
unstar the file.
rpmbuild -ba e1000.spec
rpm -ivh /usr/src/redhat/RPMS/i386/e1000-7.3.15tdh-1.i386.rpm
reboot.

I still had my work around options of 
options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3
RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0
in the /etc/modprobe.conf file.  I will comment these out and see how the stock
options work for this version of the driver.

Comment 11 Greg Morgan 2006-10-18 07:25:39 UTC

Since 10/15/2006 at 12:04 the modprobe options have been removed as noted in
comment #10 . I still do not have any of the TX hang issues reported in /var/log
files.  The 7.3.15_tdhdump-NAPI driver appears to have solved the issues as
reported in this and other bug reports.  Later this week I can try massive copy
of, say, ISO files as an additional test.  Note that even simple web surfing
could generate TX hang issues.  Hence, the multiple ISO file copy at one time
should be no problem at this point.

Thank you for the resolution to this problem and allowing me to participate in
the formation of a solution.

Comment 12 Greg Morgan 2007-07-21 23:07:24 UTC

This problem was been resolved in Fedora 7.  I did not have time to install
Fedora 6 on this hardware configuration so I don't know that Fedora 6 had a
resolution.  Thanks to all the people that assisted me.  Regards, Greg

Comment 13 Jay Freeman 2008-03-30 21:36:01 UTC

Is it possible to obtain the fixed version of this driver from anywhere? Is it 
equivalent to http://people.redhat.com/agospoda/rhel5/e1000-7.3.15tdh.patch?