Bug 200656
Summary: | Revisit 194460 and 182215 via Xen Detected Tx Unit Hang with Kernel 2.6.17-1.2157_FC5xen0 | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Greg Morgan <drkludge> |
Component: | kernel-xen | Assignee: | Xen Maintainance List <xen-maint> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5 | CC: | bstein, jesse.brandeburg, pcfe, saurik |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://sourceforge.net/tracker/index.php?func=detail&aid=1463045&group_id=42302&atid=447449 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-07-21 23:07:24 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Greg Morgan
2006-07-30 02:42:05 UTC
So much for that. I rebooted. Performed the /sbin/ethtool -K eth0 tso off Brought up thunderbird to read mail from the imap server along with firefox as a normal user. I see e1000: peth0: e1000_clean_tx_irq: Detected Tx Unit Hang all over the place. The prior report was as root with ping commands going. Perhaps the post above fixes a couple of ping commands but does not fix the problem while using real applications. There's got to be something else. Perhaps I need to boot with a live CD and try the fixeep.sh then reboot and try the tso off. I rebooted under 2.6.16-1.2080_FC5 with the 6.3.9-k4-NAPI e1000 driver but without Xen. I did this so that I could execute ethtool via fixeep.sh. For the RC82540OEM chip adapter I receive this ./fixeep.sh eth0 + '[' eth0 == '' ']' ++ ethtool -e eth0 ++ grep 0x0010 ++ awk '{print $16}' + var=20 +++ echo 0 +++ tr 02468ace 13579bdf ++ echo 21 + new=21 + '[' 20 == 21 ']' + echo executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21 executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21 + ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21 Cannot set EEPROM data: Bad address The ethtool -e eth0 output provides this Offset Values ------ ------ 0x0000 00 07 e9 15 0d 59 00 02 ff ff ff ff ff ff ff ff 0x0010 84 a7 08 08 0a 66 2e 00 86 80 00 00 00 00 20 b2 0x0020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0x0030 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0x0040 cf 00 61 78 0b 28 00 00 c8 04 ff ff ff ff ff ff 0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06 0x0060 e4 01 00 40 04 11 ff ff ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 97 fb Likewise, for the 82544GC chip adapter that does not have the TX hang problem I receive this ./fixeep.sh eth0 + '[' eth0 == '' ']' ++ ethtool -e eth0 ++ grep 0x0010 ++ awk '{print $16}' + var=04 +++ tr 02468ace 13579bdf +++ echo 4 ++ echo 05 + new=05 + '[' 04 == 05 ']' + echo executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x05 executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x05 + ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x05 Cannot set EEPROM data: Bad address The ethtool -e eth0 output provides this Offset Values ------ ------ 0x0000 00 02 b3 96 09 9b 20 02 ff ff ff ff ff ff ff ff 0x0010 29 a6 07 47 0b 66 12 11 86 80 0c 10 86 80 04 f2 0x0020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0x0030 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0x0040 ff db 11 00 11 37 ff ff ff ff ff ff ff ff ff ff 0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0x0060 fc 00 00 40 0f 10 ff ff ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 76 b9 I am unable to eliminate the firmware as a soure of my continues tx hang problems via fixeep.sh. fixeep.sh produces "Cannot set EEPROM data: Bad address" error messages. Please advise. Speaking of end-of-life hardware, I have an Intel Netport attached to an hp laserjet hat I got at the thrift store for $19. Since I didn't need the parallel port, I shut off parallel hardware in the BIOS and freed up an interrupt. That gave eth0 (peth0) its very own interrupt. eth0 was sharing an interrupt with the USB 2.0 stuff. This did not solve the problem, however. cat /proc/interrupts CPU0 1: 2210 Phys-irq i8042 8: 1 Phys-irq rtc 9: 1 Phys-irq acpi 12: 128698 Phys-irq i8042 14: 25275 Phys-irq ide0 15: 54702 Phys-irq ide1 17: 1055321 Phys-irq peth0 18: 41851 Phys-irq VIA8237 19: 0 Phys-irq uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, ehci_hcd:usb5 256: 1167094 Dynamic-irq timer0 257: 0 Dynamic-irq resched0 258: 0 Dynamic-irq callfunc0 259: 85 Dynamic-irq xenbus 260: 0 Dynamic-irq console NMI: 0 LOC: 0 ERR: 0 MIS: 0 I swapped the cards between two computers. The card that was rock solid in one computer started failing in the other. I noticed this pattern on several other posts i.e http://lkml.org/lkml/2005/12/19/144 http://www.gatago.com/linux/kernel/14660762.html : Working System/Card amd XP 1800+ on a pc133 memory system 512 Meg. Failing System/Card AMD Sempron 2600+ 400 Front Side Bus 1 gig memory in two matching 512Meg Dimm # It hurt when I bought it. = Detected Tx Unit Hang These guys made me think about the problem in a different light http://www.2cpu.com/forums/showthread.php?t=75798 . Since ethtool would not work with the Intel(R) PRO/1000 Network Driver - version 7.0.33-k2-NAPI, I added the following to my /etc/modprobe.conf settings. Note the options line is one contiguous line. The alias eth0 e1000 line was already in the modprobe.conf file.: ... alias eth0 e1000 # # Attempt to fix e1000_clean_tx_irq: Detected Tx Unit Hang # http://www.2cpu.com/forums/showthread.php?t=75798 # http://www.gatago.com/linux/kernel/14660762.html # http://lkml.org/lkml/2005/12/19/144 # http://support.intel.com/support/network/sb/CS-009209.htm # http://support.intel.com/support/network/sb/cs-009918.htm # ftp://download.intel.com/design/network/applnots/ap450.pdf # http://agenda.clustermonkey.net/index.php/Tuning_Intel_e1000_NICs # http://downloadmirror.intel.com/df-support/9180/ENG/README.txt # options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0 Hopefully this provides some insight into the problem. The card works great on an old ecs piece of junk with no errors. Put the same card in a faster ecs piece of junk and the "NETDEV WATCHDOG" generates "Detected Tx Unit Hang" messages as noted in the RxIntDelay notes here http://support.intel.com/support/network/sb/CS-009209.htm . The above modprobe settings made the e1000 and the system usable again. I still receive a few "Tx Unit Hang" messages but I was able to wget a Berry Linux ISO while browsing distrowatch.com; read email from the imap server; play full wave audio from the same imap/nfs server. The messages appeared about five minutes apart. For the WAF, she would think that it was a momentary pause on the web site not realizing that four TX Unit Hang messages just appeared in /var/log/messages. Theory: The faster hardware requires that the e1000 use larger buffers and stuff. Can anyone suggest a fix for the driver then or at least an improvement on the modprobe.conf settings above as a work around? The Wife Allocated Time, he Wife Allocated Time, WAT, has well been spent. ;-) (In reply to comment #2) > I rebooted under 2.6.16-1.2080_FC5 with the 6.3.9-k4-NAPI e1000 driver but > without Xen. I did this so that I could execute ethtool via fixeep.sh. > > For the RC82540OEM chip adapter I receive this > ./fixeep.sh eth0 > + '[' eth0 == '' ']' > ++ ethtool -e eth0 > ++ grep 0x0010 > ++ awk '{print $16}' > + var=20 > +++ echo 0 > +++ tr 02468ace 13579bdf > ++ echo 21 > + new=21 > + '[' 20 == 21 ']' > + echo executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21 > executing command: ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21 > + ethtool -E eth0 magic 0x108c8086 offset 0x1e value 0x21 > Cannot set EEPROM data: Bad address > > The ethtool -e eth0 output provides this > Offset Values > ------ ------ > 0x0000 00 07 e9 15 0d 59 00 02 ff ff ff ff ff ff ff ff > 0x0010 84 a7 08 08 0a 66 2e 00 86 80 00 00 00 00 20 b2 This hardware does not need the specific fixup mentioned. (In reply to comment #0) > Here's some questions for FC6 since release candidates are in progress. > 1.) I think Fedora core will take the black eye since Intel may not want to > support the older e1000 hardware that generates this error. all of the chipsets mentioned are supported by our driver, there are no plans to discontinue linux support for them either. > 3.) The Intel fixeep.sh as posted here > https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=130866 is broken because > it relies on the ethtool. With -x turned on the script reports > > ./fixeep.sh eth0 > + '[' eth0 == '' ']' > ++ ethtool -e eth0 > Cannot get driver information: Operation not supported > ++ grep 0x0010 > ++ awk '{print $16}' > + var= > +++ echo > +++ tr 02468ace 13579bdf > ++ echo > + new= > + '[' == ']' > + echo your eeprom is up to date, no changes made > your eeprom is up to date, no changes made > > So we see that fixeep.sh tries to query the card; cannot read the card; and says > the eeprom is up to date. The script reports a false sense of security for users. the e1000 driver must be loaded before you can use this script. It's indeed not as userfriendly as it can be but without loading e1000.ko there is no way to read the EEPROM :) > For > example, my old e100 cards work fine but Intel may not be supporting the > hardware. Same as above - all PCI-based e100's are supported by Intel. There is no plan to discontine support for certain e100's either. Auke, Thank you for the quick reply. ethtool -e is at the heart of the fixeep.sh script. The 7.0.33-k2-NAPI version of the e1000 driver and the ethtool-3-1.2.1--version 3--of ethtool look like they do not work together as noted in the command line output. lsmod shows that the driver is loaded. ethtool -e will not produce a table to grep against with the new driver. [root@mowgli ~]# lsmod ... e1000 109881 0 ... [root@mowgli ~]# ethtool -e eth0 Cannot get driver information: Operation not supported [root@mowgli ~]# ethtool -i eth0 Cannot get driver information: Operation not supported [root@mowgli ~]# ./fixeep.sh eth0 Cannot get driver information: Operation not supported your eeprom is up to date, no changes made I understand that the script is a quick hack but without reading the eeprom table the above output is not correct. (In reply to comment #3) > I swapped the cards between two computers. The card that was rock solid in one > computer started failing in the other. I noticed this pattern on several other > posts i.e http://lkml.org/lkml/2005/12/19/144 > http://www.gatago.com/linux/kernel/14660762.html : probably indicates that the physical chip itself is not to blame. > Working System/Card > amd XP 1800+ > on a pc133 memory system > 512 Meg. > > Failing System/Card > AMD Sempron 2600+ > 400 Front Side Bus > 1 gig memory in two matching 512Meg Dimm # It hurt when I bought it. > = Detected Tx Unit Hang are these systems identical in every way besides the processor? You are most likely running into a bios problem with how it configures the chipset for "failing" system. > options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 > RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0 > > Hopefully this provides some insight into the problem. The card works great on well your machine doesn't have an 82573, so it doesn't need the eeprom fix. You likely need a newer version of ethtool (application) to work correctly with the eeprom dump, but anyway that is irrelevant to the TX hang discussion here. > Theory: The faster hardware requires that the e1000 use larger buffers and > stuff. Can anyone suggest a fix for the driver then or at least an improvement > on the modprobe.conf settings above as a work around? The Wife Allocated Time, WAT, has well been spent. ;-) > Jul 29 15:26:41 mowgli kernel: TDH <7f> > Jul 29 15:26:41 mowgli kernel: TDT <7f> > Jul 29 15:26:41 mowgli kernel: next_to_use <7f> From this output you posted in a previous entry to this bug, I can tell that your hardware is actually not hanging. The driver is waiting for a bit to be set that the hardware almost assurredly wrote, but for some reason never shows up in host memory. We actually see a few of these issues, it is not related to just AMD platforms but it seems that in particular the VIA KT600 chipsets were very prone to have this problem. In almost all cases there is something misconfigured in the chipset by the BIOS that causes these writes to host memory from the e1000 adapter to disappear. I have a driver patch that can attempt to work around this issue at the cost of slightly higher cpu utilization for all transmit clean up, are you interested to try? (In reply to comment #7) > (In reply to comment #3) > > I swapped the cards between two computers. The card that was rock solid in one > > computer started failing in the other. I noticed this pattern on several other > > posts i.e http://lkml.org/lkml/2005/12/19/144 > > http://www.gatago.com/linux/kernel/14660762.html : > > probably indicates that the physical chip itself is not to blame. OK. That is good news. > > > Working System/Card > > amd XP 1800+ > > on a pc133 memory system > > 512 Meg. Additional information on working systems. None of the hardware is pushed. System kaa ESC K7S5A Release 11/21/2001 S Bios 62-1121-001131-00101111-040201-SiS735-K7S5A AMD Athlon XP 1800+ Blue dim slots filled = 512 Meg of PC2100 SDR/DDR CAS Latency SPD SDR/DDR RAS Active Time 6T SSR/DDR RAS Precharge Time 4T Auto Detect DIMM/PCI CLK enabled System bagheera (As reported in this bug) ESC K7S5A ESC K7S5A Release 10/29/2002 S Bios 62-1029-001131-00101111-040201-SiS735-K7S5A AMD Athlon XP 1800+ Blue dim slots filled = 512 Meg of PC2100 (Correction not pc133) DRAM/CPU 133/133 MHZ SDR/DDR CAS Latency SPD SDR/DDR RAS Active Time 6T SSR/DDR RAS Precharge Time 4T Auto Detect DIMM/PCI CLK enabled > > > > Failing System/Card > > AMD Sempron 2600+ > > 400 Front Side Bus > > 1 gig memory in two matching 512Meg Dimm # It hurt when I bought it. > > = Detected Tx Unit Hang > Additional information on failing system. Hardware is not pushed. System mowgli ECS K7FSB KT600-A Ver:1.1E 09/13/2004 has original Bios of 09/13/2004-KT600-8237-6A6LYE1FC-00 AMD Sempron 2600+ Current FSB Frequency 166MHZ Current DRAM Frequency 200MHZ DRAM Timing Auto by SPD DRAM CAS Latency 2.5 Bank Interleave 2 bank Precharge to Active (TRP) 5T Active to Precharge (TRAS) 7T Active to CMD (TRCD) 5T Dram Burst Length 4 Dram command Rate 2T command Write Recovery Time 3T > are these systems identical in every way besides the processor? You are most > likely running into a bios problem with how it configures the chipset for > "failing" system. Additional system information above. Systems kaa and bagheera are almost identical and have no problems with the current e1000 drivers. These two systems are the Athlon XP 1800+ chips. System mowgli is the Sempron 2600+ with the problem. System mowgli is newer and has twice the memory, etc. > > > options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 > > RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0 > > > > Hopefully this provides some insight into the problem. The card works great on > > well your machine doesn't have an 82573, so it doesn't need the eeprom fix. You > likely need a newer version of ethtool (application) to work correctly with the Understood that it is not critical. ethtool on FC 5 is ethtool-3-1.2.1 based on rpm -q --whatprovides ethtool. Should this be updated for FC6 or for FC5 for that matter? I don't see how to check in the FC6 stuff about package versions. SF site says that ethool 4 and 5 was just released on 9/1/2006 http://sourceforge.net/project/showfiles.php?group_id=3242 I know this is not an Intel question but should a blocker be made for FC6? > eeprom dump, but anyway that is irrelevant to the TX hang discussion here. > > > Theory: The faster hardware requires that the e1000 use larger buffers and > > stuff. Can anyone suggest a fix for the driver then or at least an improvement > > on the modprobe.conf settings above as a work around? The Wife Allocated > Time, WAT, has well been spent. ;-) > > > Jul 29 15:26:41 mowgli kernel: TDH <7f> > > Jul 29 15:26:41 mowgli kernel: TDT <7f> > > Jul 29 15:26:41 mowgli kernel: next_to_use <7f> > > From this output you posted in a previous entry to this bug, I can tell that > your hardware is actually not hanging. The driver is waiting for a bit to be > set that the hardware almost assurredly wrote, but for some reason never shows > up in host memory. Is this also a Xen problem along with a driver and BIOS problem? http://wiki.xensource.com/xenwiki/XenFaq#head-4ce9767df34fe1c9cf4f85f7e07cb10110eae9b7 All there computers are running Xen. > > We actually see a few of these issues, it is not related to just AMD platforms > but it seems that in particular the VIA KT600 chipsets were very prone to have > this problem. In almost all cases there is something misconfigured in the > chipset by the BIOS that causes these writes to host memory from the e1000 > adapter to disappear. > > I have a driver patch that can attempt to work around this issue at the cost of > slightly higher cpu utilization for all transmit clean up, are you interested to > try? Tee Hee "slightly higher CPU utilization" I don't think mowgli works too hard as a desktop machine. I'd be happy to try the driver. In a controlled way I can try a BIOS update first before trying your new driver or I can go straight to the driver. Do you have any preference? > Understood that it is not critical. ethtool on FC 5 is ethtool-3-1.2.1 based on > rpm -q --whatprovides ethtool. Should this be updated for FC6 or for FC5 for > that matter? I don't see how to check in the FC6 stuff about package versions. > SF site says that ethool 4 and 5 was just released on 9/1/2006 > http://sourceforge.net/project/showfiles.php?group_id=3242 I know this is not > an Intel question but should a blocker be made for FC6? > Bug 205000 was created for this concern. Update: Jesse provided me with a driver to test. The driver has been in use for 48 hours without problems. I had my son try some of the things he did before during this time: Battle for Wesnoth; web surfing; and playing full wav audio from the NFS server. The version of the driver is 7.3.15_tdhdump-NAPI The driver was installed by cp tar file to /usr/src/redhat/SOURCES/e1000-7.3.15tdh.tar.gz unstar the file. rpmbuild -ba e1000.spec rpm -ivh /usr/src/redhat/RPMS/i386/e1000-7.3.15tdh-1.i386.rpm reboot. I still had my work around options of options e1000 XsumRX=0 Speed=1000 Duplex=2 InterruptThrottleRate=0 FlowControl=3 RxDescriptors=4096 TxDescriptors=4096 RxIntDelay=0 TxIntDelay=0 in the /etc/modprobe.conf file. I will comment these out and see how the stock options work for this version of the driver. Since 10/15/2006 at 12:04 the modprobe options have been removed as noted in comment #10 . I still do not have any of the TX hang issues reported in /var/log files. The 7.3.15_tdhdump-NAPI driver appears to have solved the issues as reported in this and other bug reports. Later this week I can try massive copy of, say, ISO files as an additional test. Note that even simple web surfing could generate TX hang issues. Hence, the multiple ISO file copy at one time should be no problem at this point. Thank you for the resolution to this problem and allowing me to participate in the formation of a solution. This problem was been resolved in Fedora 7. I did not have time to install Fedora 6 on this hardware configuration so I don't know that Fedora 6 had a resolution. Thanks to all the people that assisted me. Regards, Greg Is it possible to obtain the fixed version of this driver from anywhere? Is it equivalent to http://people.redhat.com/agospoda/rhel5/e1000-7.3.15tdh.patch? |