+++ This bug was initially created as a clone of Bug #398921 +++ It looks like bug 398921 resurfaced :( My system was working perfectly for over a year. A couple of days ago I updated the kernel to the newest one available for Fedora 10 and rebooted. The first time I tried to transfer some larger files the system completely locked up and I had to power-cycle it. From this point on I have not been able to transfer any larger file across my network without a) getting “e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang” b) complete lock up I've tried to use the older kernel that used to work fine, but it has the same problem now. I've just updated to Fedora 11, but I still have the same problem. This is 100% reproducible if I start to transfer some large file. I've also tried the module option “InterruptThrottleRate=0”, but it made no difference. Any ideas on how to fix this? Current system: kernel-2.6.29.4-167.fc11.i586 Mainboard: Asus P4C800-E Deluxe /var/log/messages: Jun 9 21:29:10 linux kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 9 21:29:10 linux kernel: Tx Queue <0> Jun 9 21:29:10 linux kernel: TDH <d1> Jun 9 21:29:10 linux kernel: TDT <d6> Jun 9 21:29:10 linux kernel: next_to_use <d6> Jun 9 21:29:10 linux kernel: next_to_clean <d1> Jun 9 21:29:10 linux kernel: buffer_info[next_to_clean] Jun 9 21:29:10 linux kernel: time_stamp <fffe40b6> Jun 9 21:29:10 linux kernel: next_to_watch <d2> Jun 9 21:29:10 linux kernel: jiffies <fffe4570> Jun 9 21:29:10 linux kernel: next_to_watch.status <0> Jun 9 21:29:12 linux kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 9 21:29:12 linux kernel: Tx Queue <0> Jun 9 21:29:12 linux kernel: TDH <d1> Jun 9 21:29:12 linux kernel: TDT <d6> Jun 9 21:29:12 linux kernel: next_to_use <d6> Jun 9 21:29:12 linux kernel: next_to_clean <d1> Jun 9 21:29:12 linux kernel: buffer_info[next_to_clean] Jun 9 21:29:12 linux kernel: time_stamp <fffe40b6> Jun 9 21:29:12 linux kernel: next_to_watch <d2> Jun 9 21:29:12 linux kernel: jiffies <fffe4d40> Jun 9 21:29:12 linux kernel: next_to_watch.status <0> Jun 9 21:29:14 linux kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Jun 9 21:29:14 linux kernel: Tx Queue <0> Jun 9 21:29:14 linux kernel: TDH <d1> Jun 9 21:29:14 linux kernel: TDT <d6> Jun 9 21:29:14 linux kernel: next_to_use <d6> Jun 9 21:29:14 linux kernel: next_to_clean <d1> Jun 9 21:29:14 linux kernel: buffer_info[next_to_clean] Jun 9 21:29:14 linux kernel: time_stamp <fffe40b6> Jun 9 21:29:14 linux kernel: next_to_watch <d2> Jun 9 21:29:14 linux kernel: jiffies <fffe5510> Jun 9 21:29:14 linux kernel: next_to_watch.status <0>
Hi Thomas, Yes, "Tx Unit Hang" is something we have seen before. Could you provide me with output from the following please, so that I know what you have in your system. lspci -tv lspci -vvv -xxx ethtool -i eth0 ethtool -e eth0 cat /proc/cpuinfo Additional information might help. Can you provide more detail on the transfer please ? Were you sending from or receiving to the problem interface ? Were you using ftp, nfs, http, something else ? How large are the files, about how long does it take for the failure to occur when you start the test ? I have already started trying to reproduce your failure, and when I get this information will have a more focussed shot at the repro. If I get this info and still can't get a repro within a few hours, I would then like to send you a test driver to gather more information. Dave
Created attachment 347149 [details] Output of lspci -tv
Created attachment 347151 [details] cpuinfo
Created attachment 347152 [details] Output of ethtool -e eth0
Created attachment 347153 [details] Output of ethtool -i eth0
Created attachment 347154 [details] Output of lspci -vvv -xxx
The system is used as a NAT box and a samba file server and the Intel NIC is at the internal side of the network. Copying a file via samba from the system to another seems to immediately trigger this if the file size is at least a couple of MB. Copying a file via scp from the system hangs after a few MB. It also happens (after some seconds) when I try to upload a file from a different computer via this box to a ftp server on the internet. Web browsing and receiving/sending eMails through the box seems to work (mostly) fine, however it had also happened when I tried to upload a file (~250kb) via HTTP using a form. It looks like it's the act of "trying to transfer a not-too-small bunch of data as fast as possible" that's causing this.
Thanks for the information so far. I have not yet been able to see the Transmit Hang that you report. That may be because I have not got exactly your configuration. I'll continue trying. In the meantime, here's an idea. The issue may be related to TSO, which is enabled by default on this driver/kernel. TSO, or "Transmit Segmentation Offload" allows large packets larger than the typical 1514 byte ethernet packet size to be handled by the driver, which is then responsible for their ultimate segmentation before sending to the wire. IN our case the segmentation task itself is offloaded to the NIC, so there is an efficiency gain, partially by offload of the segmentation process itself from the CPU, and also a result of the reduced number of TX stack traversals per packet. But for maximal performance gain, the NIC must be able to cache at least 2 (typically 64KB each) - TX frames for segmenation, and this older silicon might not have a large enough TX FIFO to do that properly, or the driver hasn't properly taken the FIFO size into account. I'm trying to find out. But there's a good chance that TSO is somehow involved, and its easy to find out, because we can disable it. #ethtool -k eth0 // show initial offload capabilities #ethtool -K eth0 tso off // disable TSO Please do this, and see if the problem is resolved. That is not root cause for the bug, but might be an acceptable workaround for now, and will help a lot for me to know where to focus. (Maybe we simply will always disable TSO for this part). If you do see that the problem is resolved by disabling TSO, then you may also witness a performance drop. In my testing (netperf TCP_STREAM test), I see a TX drop from about 628 Mbps to 480Mbps when I disable TSO. I was able to regain some of that loss (back to 560Mbps) by also disabling a related kernel stack feature GSO "Generic Segmentation Offload". If you need, you could do the same thing, using: #ethtool -K eth0 gso off // disable GSO Dave
I hate to disappoint you, but it looks like tso is already disabled by default... After a reboot I get: # ethtool -k eth0 Offload parameters for eth0: Cannot get device flags: Operation not supported rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off As an additional note: The device is part of a bridge: # brctl show bridge name bridge id STP enabled interfaces br0 8000.0008543de303 no eth0 eth2 tap_vpn_udp
Thanks Thomas. Did you try disabling GSO too ? I read that GSO was only enabled by defaul on 2.6.26, so it may be a new factor. I really don't know, but its easy to disable for a quick test. Its interesting that you have both the RTL-8169 and the 82547EI in the bridge. Again I don't know how that might be relevant, but could you try your simplest test again please with the 82547EI directly, out of the bridge ? What's the tap_vpn_udp. I reread your original note and see that you were still had the TX Hang even after you switched back to the older kernel version. That's weird. Can you think of anything else that changed in your configuration when the problem first appeared ? I have now got hold of an 82547EI, so will try again to repro what you see.
The bridge configuration is a bit misleading because it changed yesterday... Previously the bridge only contained the 82547EI and tap_vpn_udp which is a tab interface used by openvpn. The RTL-8169 was connected to my cable modem. However, as the 82547EI is currently more or less unusable, I was forced to add another NIC (a RTL-8139). Now the RTL-8139 is connected to the cable modem and the RTL-8169 was added to the bridge and connected to the internal network instead of the 82547EI. I've just tried to disable the bridge and directly connect the 82547EI to my network, but the system practically immediately hang when I tried to connect to it. (Couldn't even get a list of files within a directory via samba) :( Also, disabling GSO made no difference. I can't remember anything else that changed before this started, sorry. My theory was, that the new driver (or specifically 2.6.27.24-170.2.68.fc10.i686) might have changed some default settings that are now stored persistently, but that's pure speculation. I should have some time over the weekend to try again some different kernel versions... maybe I'll notice something. I also tried to enable debugging on the e1000 driver using "options e1000 debug=16", but I didn't see any additional messages. Do I need a special tool to get those debugging information? What complicates this though is that currently the system more often completely hangs instead of just producing a "Detected Tx Unit Hang" and I have to power-cycle it then... :(
I'm completely lost... I've just tried the vanilla kernels 2.6.26, 2.6.27 and 2.6.29... everyone of them failed with the same symptoms. TSO of the e1000 defaults to being disabled on all of them and I can't activate it either (if I try I get "Cannot set device tcp segmentation offload settings: Invalid argument") During all those tests the 82547EI was *not* part of bridge but directly connected to the network.
Yes, its odd that the older (back to 2.6.26) vanilla kernels fail. I'm loading up FC11 on my 82547EI now. I looked in the code to see about ethtool, and made sense of some of your results. It turns out that the 82547EI does not support TSO ! (function e1000_set_tso() in e1000_ethtool.c explicitly refuses enabling it. Oh well, that's another good reason for me to be running on the right HW. I'm sorry I wasted your time on that aspect of the issue. A look at the 7.3.21-k3 driver shows that it is also out of sync wrt to our Sourceforge driver. We try to keep the drivers as synchronized as possible, but we could do a better job, and it's possible that one of the more recent changes applied to our SF driver resolves the issue. I notice, for instance, additional locking in the SF driver , around use of the function e1000_82547_fifo_workaround(), and that isn't in the in-kernel-tree version. Could you go to oue SF site and download and install the latest standalone driver (e1000-8.0.13), from https://sourceforge.net/project/showfiles.php?group_id=42302. Its a possibility that this'd do the trick. I expect to have FC11 loaded on my 82547EI system shortly too, so will have a good shot at repro soon. Dave
I've installed the standalone driver from SF, but it also fails with "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" :( $ ethtool -i eth0 driver: e1000 version: 8.0.13 firmware-version: N/A bus-info: 0000:02:01.0 Perfect way to reproduce this is by scp'ing some file to another system. This always results in an error within 1 to 3 seconds. And most of the time it's "only" the Tx Unit Hang and not the complete lock-up. I don't know... maybe it's really just dying hardware...
I get the same messages in log files. I'll attach in a sec...
Created attachment 347862 [details] ethtool -e eth1
Created attachment 347863 [details] ethtool -i eth1
Created attachment 347864 [details] lspci -tv
Created attachment 347865 [details] lspci -vvv -xxx
Created attachment 347866 [details] cat /proc/cpuinfo
Created attachment 347867 [details] /var/log/messages
Hi Ben - You have a very different issue than Thomas. Yes, its a Transmit timeout, and so in some ways similar, but is a different type, and its on 82541PI network Si on an AMD based platform, where Thomas's is on 82547EI on an INTEL ICH5 platform). But (Ben) I do think that your issue might match one or more of the other bugs already in the forum, and hopefully one that is fixed in the latest drievr releasse....but I'm getting ahead of myself, lets continue this thread as another bug, if needed.
Thomas, Thanks for trying the SF 8.0.13 driver. I got my 82547EI system up under FC11. Here's the essential data from my system, and its a pretty close match to yours. I'm even still using the stock FC11 driver, which was in your original report. #uname -a Linux drgraha1-tan 2.6.29.4-167.fc11.i686.PAE #1 SMP Wed May 27 17:28:22 EDT 2009 i686 i686 i386 GNU/Linux #ethtool -i eth0 driver: e1000 version: 7.3.21-k3-NAPI firmware-version: N/A bus-info: 0000:01:01.0 #ethtool -k eth0 Offload parameters for eth0: rx-checksumming: on tx-checksumming: on scatter-gather: on tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: off large-receive-offload: off #lspci -tv -[0000:00]-+-00.0 Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface +-02.0 Intel Corporation 82865G Integrated Graphics Controller +-03.0-[0000:01]----01.0 Intel Corporation 82547EI Gigabit Ethernet Controller +-06.0 Intel Corporation 82865G/PE/P Processor to I/O Memory Interface +-1d.0 Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 +-1d.1 Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 ..... In intial testing, I find that I can reliably TX large files using scp from the SUT, with no hangs or error messages in the system log. I didn't expect the Xfer for be so slow, but it may be normal on this older silicon. What transfer rate are you seeing (if you can ever get a file of decent size to TX without timeout). [root@drgraha1-tan work]# scp bigfile root.3.50:/home/drgraha1/work/. root.3.50's password: bigfile 100% 1059MB 18.6MB/s 00:57 You question a possible HW issue. Its possible, but I wondeer why would it appear only when you advanced you r kernel. That's pretty suspicious. Lets try a quick run of the ethtool test package: ethtool -t eth0 All tests should pass within 30 seconds, with result 0. If they don't, it likely *is* a Si issue. If they do, then I'm going to prepare that debug driver I mentioned last week, to see if we can pull some more useful info from the system. Dave
Thomas, I have attached a debug patch that will collect more information *(to the system message log) to the patches section of our e1000 sourceforge site. http://sourceforge.net/tracker/?func=detail&aid=1460945&group_id=42302&atid=447451 If you are not familiar with applying patches, let me know and I'll provide more detail. If you do apply the patch and manage to get what looks like a good data dump (in var/log/messages), then just attach the output here, and I'll have a look at it. Thanks Dave
I'm currently unable to transfer a file large enough to test the transfer rate, but if I remember correctly about 18-20MB/s is what I used to get here too. I've just executed ethtool -t and it doesn't look very well :( The test result is FAIL The test extra info: Register test (offline) 0 Eeprom test (offline) 0 Interrupt test (offline) 0 Loopback test (offline) 13 Link test (on/offline) 0 I've also tried to apply the patch you referred to, but it's against driver version 7.0.33 which doesn't compile for kernel 2.6.29. I tried to adapt it to the latest 8.0.13 driver, but even after coping with removed/renamed defines I can't compile it due to some errors: make -C /lib/modules/2.6.29.4-167.fc11.i586/build SUBDIRS=/usr/src/e1000-8.0.13/src modules make[1]: Entering directory `/usr/src/kernels/2.6.29.4-167.fc11.i586' CC [M] /usr/src/e1000-8.0.13/src/e1000_main.o /usr/src/e1000-8.0.13/src/e1000_main.c: In function 'e1000_dump': /usr/src/e1000-8.0.13/src/e1000_main.c:3254: error: 'struct e1000_adapter' has no member named 'rx_ps_pages' /usr/src/e1000-8.0.13/src/e1000_main.c:3261: warning: initialization from incompatible pointer type /usr/src/e1000-8.0.13/src/e1000_main.c:3275: warning: format '%016llX' expects type 'long long unsigned int', but argument 7 has type 'dma_addr_t' /usr/src/e1000-8.0.13/src/e1000_main.c:3291: warning: initialization from incompatible pointer type make[2]: *** [/usr/src/e1000-8.0.13/src/e1000_main.o] Error 1 make[1]: *** [_module_/usr/src/e1000-8.0.13/src] Error 2 make[1]: Leaving directory `/usr/src/kernels/2.6.29.4-167.fc11.i586' make: *** [default] Error 2
Hmm, I thought I developed it against the 8.0.13 SF driver. I probably made a mistake. Let me check. And I'll look into the significance of that loopback test failure too. I don't get it on my platform.
Ah, I think I was somewhat confused... I see your comment from yesterday now, sorry for the confusion. I'll apply the patch and get back to you when I had a chance to test it.
Your patch worked fine, but I downloaded the wrong file at first, sorry again. I'll attach the output in a moment...
Created attachment 348154 [details] output of driver with debug patch
Thanks Thomas, Its a good debug dump, and shows a real problem. As does the loopback test, which reports a data miscompare (or maybe a timeout, the failure paths aren't too cleanly implemented). I'll get most out of the dump file, and will be consulting with a few colleagues. There's still no clear indication of whether this is a drievr issue, or your HW. I don't expect any quick breakthrough. I will get back to you later today with a summary of what I've found, and any new plan. I wich I was able to get a repro here, but alss I am not. Are you OK with continuing to help me debug ? Dave
Well there are a couple of things to try, though neither directly addresses root cause, they may work for you: 1) Use module load parameter TxDescPower=9 [You'll find instructions for how to apply this in the README file in the e1000-8.0.13 install directory]. This will reduce the max chunk size of data sent from the host to the NIC on the PCI-X bus. We have had other TXHangs due to silicon errata that this has worked around. I am not sure if the 82547EI is one of those affected silicon (I am in contact with others to find out), but its worth a shot. 2) Disable TX Checksum offload. "ethtool -K eth0 tx off". Apologies if you already tried this one. Please also capture another couple of TXHang reports like the one you attached before. I might see a significant patters by looking at a few.
My system is currently working stable with the additional NIC I installed, so it's no longer extremely urgent. However, it only takes a moment to reconfigure everything to be able to test the 82547EI and I'm still interested in the root cause of this, so I'll continue to help you of course :) If the conclusion will be a hw error I'll disable it for good, if it's a driver bug that can be fixed somehow, I'll be more than happy *g* Thanks for your efforts :) Option TxDescPower=9 or disabling TX Checksum offload did not resolve the issue. I think I tried disabling all offload options a few days ago without any success. I'll attach the debugging output in a moment.
Created attachment 348212 [details] output of driver with debug patch
Thomas, Thanks for the additional testing. I have conferred with colleagues and it seems that you most probably do have an issue with the network silicon, maybe the platform. The debug dumps show that the driver is doing what it is supposed to, which is to wait on a return of each TX "descriptor" from the HW, before recycling the descriptor to its available pool. In your case, we can see that the driver is waiting fo a descriptor, and we can see that descriptor in the HS cache of the descriptors, but it is not being written back to host memory with a "Done" indication. The driver eventually times out and reports the TX Hang (and in the debug driver case prints all the debug stuff). That the NIC has a problem ties in with the reported failure of the loopback test, and that when you restored to an older kernel/driver, the problem remains. Also, you are (OK, so far) the only person to be reporting exactly this issue). I do notice that you still have TX Checksummming enabled (from the dumps), but this is not likely to be realted to your issue. Again though, why not disable it - another ethtool -K variant). On of my colleagues suggested that this may be a temperature related issue. It is summer now after all - its possible. Some of these older parts are sensitve to temperature. But that's about all I've been able to come up with.
To by sure I've just disabled all offloading settings: # ethtool -k eth0 Offload parameters for eth0: Cannot get device flags: Operation not supported rx-checksumming: off tx-checksumming: off scatter-gather: off tcp-segmentation-offload: off udp-fragmentation-offload: off generic-segmentation-offload: off generic-receive-offload: off large-receive-offload: off However, the problem is still there. :( I'll attach the dump in a moment, just so it's saved with all the other ones... I will then consider this NIC as physically damaged and won't use it again until told otherwise (or maybe I'll try again next winter ;)) Thank you for your quick responses and your efforts. :) I'd have one more question though: Reading your explanation I can see how this TX Hang happens, but do you have any idea why my system also often locks up completely and then needs a power-cycle to get up again when I try to use this NIC?
Created attachment 348533 [details] output of driver with debug patch
Thanks Thomas, I've looked at the new dump. All offloads are clearly disabled. Again, it simply looks like the NIC has stopped transmitting data. The simplest of the dumps to analyze is actually the second one. Td[desc] [address 63:0 ] ntw TXDESCRIPTOR FIFO Tc[0x000] 00000000353DE202 0 -- T7000: 353DE202|353DE202 8B00002A|8B00002A Tc[0x001] 0000000035A3B8E6 1 -- T7010: 35A3B8E6|35A3B8E6 8B00004A|8B00004A Tc[0x002] 0000000035A3B8EE 2 -- T7020: 35A3B8EE|35A3B8EE 8B000042|8B000042 Tc[0x003] 0000000035A3B8EE 3 -- T7030: 35A3B8EE|35A3B8EE 8B000042|8B000042 Tc[0x004] 0000322200000000 6 -- T7040: 00000000|00000000 21000000|21000000 Td[0x005] 0000000035A3B93E 5 -- T7050: 35A3B93E|35A3B93E 22100042|22100042 Td[0x006] 00000000348E2000 6 -- T7060: 348E2000|348E2000 AB100015|AB100015 Tc[0x007] 0000322200000000 A -- T7070: 00000000|00000000 21000000|21000000 Td[0x008] 0000000035A3B93E 8 -- T7080: 35A3B93E|35A3B93E 22100042|22100042 Td[0x009] 00000000348E2015 9 -- T7090: 348E2015|348E2015 22100200|22100200 Td[0x00A] 00000000348E2215 A -- T70A0: 348E2215|348E2215 AB100118|AB100118 Tc[0x00B] 0000322200000000 D NTC T70B0: 359D7345 (TXD FIFO DATA IS STALE !!!) Td[0x00C] 0000000035A3B93E C Td[0x00D] 00000000348E232D D Tc[0x00E] 0000000035A3A0E2 E Tc[0x00F] 0000000000000000 0 NTU I've condensed the essential part of the dump above. We see that the NIC's TX Desriptor FIFO doesn't contian the TX descriptor that the driver is waiting to see completed. The FIFO lines up for the most part, up to element 00A, but 00B is not showing. We can see that the driver had properly informed the NIC that it *should* fetch this descriptor, so it should be in the Descriptor FIFO. Either the NIC DMA RX engine was hung, or the read by the NIC failed. Why does the system sometimes hang ? That's a pretty important question. We can guess. If the NIC is failing to read from host memory in this sample dump, there's something flaky with the Device/Host DMA, and that could have any number of consequences. A PCI bus error may be involved, which might cause an NMI. Or possibly, if a Receive Descriptor is corrupted by this issue, the RX DMA of packet data will be misdirected , and the NIC could then scribble to pretty miuch anywhere in host memory. Yes this is conjecture, but does fit with the dumps. If I had a repro locally I'd certainly chase it down further.
I've got the same problem in my pc, too (same processor, same card). It may have something to do with the memory size. I'm running 4GB ram. Whenever I remove 2GB, the cards works well.
To wrap up this issue, I should note that I never did find a driver issue, believe that this is a not a SW BUG, rather a HW issue, and I sent Thomas an "Intel PRO/1000 XT Server Adapter" card to replace the failing interface. Thomas was up & running again with this replacement 7/31/09. I am closing this issue. Hi "vxworks". The 4GB/2GB aspect of this is interesting. Your issue may be the same at Thomas's, but a lot of issues look very similar in their original manifestation. Could you please file a new bug, and I'll get to it and be able to dedicate my attention to your symptoms. Thanks.
There is a known issue with some systems and PCI adapters, that can usually be fixed by applying the patch that removes DMA to/from addresses >=4GB. I'm working on a quick patch to add a module parameter for the > 4GB thing.
As an additional information: My system only has 2GB RAM.