Bug 504873 - e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Summary: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 11
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 513462
TreeView+ depends on / blocked
 
Reported: 2009-06-09 20:36 UTC by Thomas Müller
Modified: 2009-08-17 19:15 UTC (History)
10 users (show)

Fixed In Version:
Clone Of: 398921
Environment:
Last Closed: 2009-08-17 17:41:42 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Output of lspci -tv (1.03 KB, text/plain)
2009-06-10 04:23 UTC, Thomas Müller
no flags Details
cpuinfo (1.10 KB, text/plain)
2009-06-10 04:37 UTC, Thomas Müller
no flags Details
Output of ethtool -e eth0 (3.59 KB, text/plain)
2009-06-10 04:38 UTC, Thomas Müller
no flags Details
Output of ethtool -i eth0 (83 bytes, text/plain)
2009-06-10 04:39 UTC, Thomas Müller
no flags Details
Output of lspci -vvv -xxx (23.19 KB, text/plain)
2009-06-10 04:39 UTC, Thomas Müller
no flags Details
ethtool -e eth1 (487 bytes, application/octet-stream)
2009-06-14 21:40 UTC, ben thompson
no flags Details
ethtool -i eth1 (84 bytes, application/octet-stream)
2009-06-14 21:40 UTC, ben thompson
no flags Details
lspci -tv (1.62 KB, application/octet-stream)
2009-06-14 21:41 UTC, ben thompson
no flags Details
lspci -vvv -xxx (35.66 KB, application/octet-stream)
2009-06-14 21:41 UTC, ben thompson
no flags Details
cat /proc/cpuinfo (1.44 KB, application/octet-stream)
2009-06-14 21:42 UTC, ben thompson
no flags Details
/var/log/messages (97.57 KB, application/octet-stream)
2009-06-14 21:43 UTC, ben thompson
no flags Details
output of driver with debug patch (213.38 KB, text/plain)
2009-06-16 19:36 UTC, Thomas Müller
no flags Details
output of driver with debug patch (565.68 KB, text/plain)
2009-06-17 06:18 UTC, Thomas Müller
no flags Details
output of driver with debug patch (283.75 KB, text/plain)
2009-06-18 17:48 UTC, Thomas Müller
no flags Details

Description Thomas Müller 2009-06-09 20:36:04 UTC
+++ This bug was initially created as a clone of Bug #398921 +++

It looks like bug 398921 resurfaced :(

My system was working perfectly for over a year. A couple of days ago I updated
the kernel to the newest one available for Fedora 10 and rebooted.
The first time I tried to transfer some larger files the system completely
locked up and I had to power-cycle it.

From this point on I have not been able to transfer any larger file across my
network without
a) getting “e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang”
b) complete lock up

I've tried to use the older kernel that used to work fine, but it has the same
problem now.


I've just updated to Fedora 11, but I still have the same problem.

This is 100% reproducible if I start to transfer some large file.

I've also tried the module option “InterruptThrottleRate=0”, but it made no
difference.


Any ideas on how to fix this?


Current system:
kernel-2.6.29.4-167.fc11.i586
Mainboard: Asus P4C800-E Deluxe

/var/log/messages:
Jun  9 21:29:10 linux kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  9 21:29:10 linux kernel:  Tx Queue             <0>
Jun  9 21:29:10 linux kernel:  TDH                  <d1>
Jun  9 21:29:10 linux kernel:  TDT                  <d6>
Jun  9 21:29:10 linux kernel:  next_to_use          <d6>
Jun  9 21:29:10 linux kernel:  next_to_clean        <d1>
Jun  9 21:29:10 linux kernel: buffer_info[next_to_clean]
Jun  9 21:29:10 linux kernel:  time_stamp           <fffe40b6>
Jun  9 21:29:10 linux kernel:  next_to_watch        <d2>
Jun  9 21:29:10 linux kernel:  jiffies              <fffe4570>
Jun  9 21:29:10 linux kernel:  next_to_watch.status <0>
Jun  9 21:29:12 linux kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  9 21:29:12 linux kernel:  Tx Queue             <0>
Jun  9 21:29:12 linux kernel:  TDH                  <d1>
Jun  9 21:29:12 linux kernel:  TDT                  <d6>
Jun  9 21:29:12 linux kernel:  next_to_use          <d6>
Jun  9 21:29:12 linux kernel:  next_to_clean        <d1>
Jun  9 21:29:12 linux kernel: buffer_info[next_to_clean]
Jun  9 21:29:12 linux kernel:  time_stamp           <fffe40b6>
Jun  9 21:29:12 linux kernel:  next_to_watch        <d2>
Jun  9 21:29:12 linux kernel:  jiffies              <fffe4d40>
Jun  9 21:29:12 linux kernel:  next_to_watch.status <0>
Jun  9 21:29:14 linux kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Jun  9 21:29:14 linux kernel:  Tx Queue             <0>
Jun  9 21:29:14 linux kernel:  TDH                  <d1>
Jun  9 21:29:14 linux kernel:  TDT                  <d6>
Jun  9 21:29:14 linux kernel:  next_to_use          <d6>
Jun  9 21:29:14 linux kernel:  next_to_clean        <d1>
Jun  9 21:29:14 linux kernel: buffer_info[next_to_clean]
Jun  9 21:29:14 linux kernel:  time_stamp           <fffe40b6>
Jun  9 21:29:14 linux kernel:  next_to_watch        <d2>
Jun  9 21:29:14 linux kernel:  jiffies              <fffe5510>
Jun  9 21:29:14 linux kernel:  next_to_watch.status <0>

Comment 1 dave graham 2009-06-09 23:44:42 UTC
Hi Thomas,

Yes, "Tx Unit Hang" is something we have seen before.

Could you provide me with output from the following please, so that I know what you have in your system.

lspci -tv
lspci -vvv -xxx
ethtool -i eth0
ethtool -e eth0
cat /proc/cpuinfo

Additional information might help.

Can you provide more detail on the transfer please ? Were you sending from or receiving to the problem interface ? Were you using ftp, nfs, http, something else ? How large are the files, about how long does it take for the failure to occur when you start the test ? 

I have already started trying to reproduce your failure, and when I get this information will have a more focussed shot at the repro. If I get this info and still can't get a repro within a few hours, I would then like to send you a test driver to gather more information.

Dave

Comment 2 Thomas Müller 2009-06-10 04:23:47 UTC
Created attachment 347149 [details]
Output of lspci -tv

Comment 3 Thomas Müller 2009-06-10 04:37:53 UTC
Created attachment 347151 [details]
cpuinfo

Comment 4 Thomas Müller 2009-06-10 04:38:50 UTC
Created attachment 347152 [details]
Output of ethtool -e eth0

Comment 5 Thomas Müller 2009-06-10 04:39:22 UTC
Created attachment 347153 [details]
Output of ethtool -i eth0

Comment 6 Thomas Müller 2009-06-10 04:39:59 UTC
Created attachment 347154 [details]
Output of lspci -vvv -xxx

Comment 7 Thomas Müller 2009-06-10 04:56:48 UTC
The system is used as a NAT box and a samba file server and the Intel NIC is at the internal side of the network.

Copying a file via samba from the system to another seems to immediately trigger this if the file size is at least a couple of MB.
Copying a file via scp from the system hangs after a few MB.
It also happens (after some seconds) when I try to upload a file from a different computer via this box to a ftp server on the internet.

Web browsing and receiving/sending eMails through the box seems to work (mostly) fine, however it had also happened when I tried to upload a file (~250kb) via HTTP using a form.

It looks like it's the act of "trying to transfer a not-too-small bunch of data as fast as possible" that's causing this.

Comment 8 dave graham 2009-06-10 19:22:53 UTC
Thanks for the information so far. I have not yet been able to see the Transmit Hang that you report. That may be because I have not got exactly your configuration. I'll continue trying. In the meantime, here's an idea.

The issue may be related to TSO, which is enabled by default on this driver/kernel. TSO, or "Transmit Segmentation Offload" allows large packets larger than the typical 1514 byte ethernet packet size to be handled by the driver, which is then responsible for their ultimate segmentation before sending to the wire. IN our case the segmentation task itself is offloaded to the NIC, so there is an efficiency gain, partially by offload of the segmentation process itself from the CPU, and also a result of the reduced number of TX stack traversals per packet. But for maximal performance gain, the NIC must be able to cache at least 2 (typically 64KB each) - TX frames for segmenation, and this older silicon might not have a large enough TX FIFO to do that properly, or the driver hasn't properly taken the FIFO size into account. I'm trying to find out. But there's a good chance that TSO is somehow involved, and its easy to find out, because we can disable it. 

#ethtool -k eth0         // show initial offload capabilities
#ethtool -K eth0 tso off // disable TSO

Please do this, and see if the problem is resolved. That is not root cause for the bug, but might be an acceptable workaround for now, and will help a lot for me to know where to focus. (Maybe we simply will always disable TSO for this part).

If you do see that the problem is resolved by disabling TSO, then you may also witness a performance drop. In my testing (netperf TCP_STREAM test), I see a TX drop from about 628 Mbps to 480Mbps when I disable TSO. I was able to regain some of that loss (back to 560Mbps) by also disabling a related kernel stack feature GSO "Generic Segmentation Offload". If you need, you could do the same thing, using: 

#ethtool -K eth0 gso off // disable GSO

Dave

Comment 9 Thomas Müller 2009-06-11 12:20:41 UTC
I hate to disappoint you, but it looks like tso is already disabled by default...

After a reboot I get:
# ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off


As an additional note: The device is part of a bridge:
# brctl show
bridge name     bridge id               STP enabled     interfaces
br0             8000.0008543de303       no              eth0
                                                        eth2
                                                        tap_vpn_udp

Comment 10 dave graham 2009-06-12 00:06:41 UTC
Thanks Thomas.
Did you try disabling GSO too ? I read that GSO was only enabled by defaul on 2.6.26, so it may be a new factor. I really don't know, but its easy to disable for a quick test. Its interesting that you have both the RTL-8169 and the 82547EI in the bridge. Again I don't know how that might be relevant, but could you try your simplest test again please with the 82547EI directly, out of the bridge ? What's the tap_vpn_udp.

I reread your original note and see that you were still had the TX Hang even after you switched back to the older kernel version. That's weird. Can you think of anything else that changed in your configuration when the problem first appeared ?

I have now got hold of an 82547EI, so will try again to repro what you see.

Comment 11 Thomas Müller 2009-06-12 05:54:53 UTC
The bridge configuration is a bit misleading because it changed yesterday...
Previously the bridge only contained the 82547EI and tap_vpn_udp which is a tab interface used by openvpn.
The RTL-8169 was connected to my cable modem.

However, as the 82547EI is currently more or less unusable, I was forced to add another NIC (a RTL-8139).
Now the RTL-8139 is connected to the cable modem and the RTL-8169 was added to the bridge and connected to the internal network instead of the 82547EI.


I've just tried to disable the bridge and directly connect the 82547EI to my network, but the system practically immediately hang when I tried to connect to it. (Couldn't even get a list of files within a directory via samba) :(
Also, disabling GSO made no difference.

I can't remember anything else that changed before this started, sorry. My theory was, that the new driver (or specifically 2.6.27.24-170.2.68.fc10.i686) might have changed some default settings that are now stored persistently, but that's pure speculation.

I should have some time over the weekend to try again some different kernel versions... maybe I'll notice something.
I also tried to enable debugging on the e1000 driver using "options e1000 debug=16", but I didn't see any additional messages. Do I need a special tool to get those debugging information?

What complicates this though is that currently the system more often completely hangs instead of just producing a "Detected Tx Unit Hang" and I have to power-cycle it then... :(

Comment 12 Thomas Müller 2009-06-12 20:42:01 UTC
I'm completely lost...

I've just tried the vanilla kernels 2.6.26, 2.6.27 and 2.6.29... everyone of them failed with the same symptoms.
TSO of the e1000 defaults to being disabled on all of them and I can't activate it either (if I try I get "Cannot set device tcp segmentation offload settings: Invalid argument")

During all those tests the 82547EI was *not* part of bridge but directly connected to the network.

Comment 13 dave graham 2009-06-12 22:11:59 UTC
Yes, its odd that the older (back to 2.6.26) vanilla kernels fail. 

I'm loading up FC11 on my 82547EI now. 
I looked in the code to see about ethtool, and made sense of some of your results. It turns out that the 82547EI does not support TSO ! (function e1000_set_tso() in e1000_ethtool.c explicitly refuses enabling it. Oh well, that's another good reason for me to be running on the right HW. I'm sorry I wasted your time on that aspect of the issue.

A look at the 7.3.21-k3 driver shows that it is also out of sync wrt to our Sourceforge driver. We try to keep the drivers as synchronized as possible, but we could do a better job, and it's possible that one of the more recent changes applied to our SF driver resolves the issue. I notice, for instance, additional locking in the SF driver , around use of the function e1000_82547_fifo_workaround(), and that isn't in the in-kernel-tree version. Could you go to oue SF site and download and install the latest standalone driver (e1000-8.0.13), from https://sourceforge.net/project/showfiles.php?group_id=42302. Its a possibility that this'd do the trick.

I expect to have FC11 loaded on my 82547EI system shortly too, so will have a good shot at repro soon.

Dave

Comment 14 Thomas Müller 2009-06-13 07:53:56 UTC
I've installed the standalone driver from SF, but it also fails with "e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang" :(

$ ethtool -i eth0
driver: e1000
version: 8.0.13
firmware-version: N/A
bus-info: 0000:02:01.0


Perfect way to reproduce this is by scp'ing some file to another system. This always results in an error within 1 to 3 seconds. And most of the time it's "only" the Tx Unit Hang and not the complete lock-up.

I don't know... maybe it's really just dying hardware...

Comment 15 ben thompson 2009-06-14 21:39:38 UTC
I get the same messages in log files. I'll attach in a sec...

Comment 16 ben thompson 2009-06-14 21:40:21 UTC
Created attachment 347862 [details]
ethtool -e eth1

Comment 17 ben thompson 2009-06-14 21:40:49 UTC
Created attachment 347863 [details]
ethtool -i eth1

Comment 18 ben thompson 2009-06-14 21:41:18 UTC
Created attachment 347864 [details]
lspci -tv

Comment 19 ben thompson 2009-06-14 21:41:44 UTC
Created attachment 347865 [details]
lspci -vvv -xxx

Comment 20 ben thompson 2009-06-14 21:42:13 UTC
Created attachment 347866 [details]
cat /proc/cpuinfo

Comment 21 ben thompson 2009-06-14 21:43:27 UTC
Created attachment 347867 [details]
/var/log/messages

Comment 22 dave graham 2009-06-15 17:54:21 UTC
Hi Ben - You have a very different issue than Thomas. Yes, its a Transmit timeout, and so in some ways similar, but is a different type, and its on 82541PI network Si on an AMD based platform, where Thomas's is on 82547EI on an INTEL ICH5 platform). But (Ben) I do think that your issue might match one or more of the other bugs already in the forum, and hopefully one that is fixed in the latest drievr releasse....but I'm getting ahead of myself, lets continue this thread as another bug, if needed.

Comment 23 dave graham 2009-06-15 18:07:46 UTC
Thomas, 
Thanks for trying the SF 8.0.13 driver. 

I got my 82547EI system up under FC11. Here's the essential data from my system, and its a pretty close match to yours. I'm even still using the stock FC11 driver, which was in your original report.

#uname -a
Linux drgraha1-tan 2.6.29.4-167.fc11.i686.PAE #1 SMP Wed May 27 17:28:22 EDT 2009 i686 i686 i386 GNU/Linux

#ethtool -i eth0
driver: e1000
version: 7.3.21-k3-NAPI
firmware-version: N/A
bus-info: 0000:01:01.0

#ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

#lspci -tv
-[0000:00]-+-00.0  Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface
           +-02.0  Intel Corporation 82865G Integrated Graphics Controller
           +-03.0-[0000:01]----01.0  Intel Corporation 82547EI Gigabit Ethernet Controller
           +-06.0  Intel Corporation 82865G/PE/P Processor to I/O Memory Interface
           +-1d.0  Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1
           +-1d.1  Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2
          .....

In intial testing, I find that I can reliably TX large files using scp from the SUT, with no hangs or error messages in the system log. I didn't expect the Xfer for be so slow, but it may be normal on this older silicon. What transfer rate are you seeing (if you can ever get a file of decent size to TX without timeout).

[root@drgraha1-tan work]# scp bigfile root.3.50:/home/drgraha1/work/.
root.3.50's password: 
bigfile                           100% 1059MB  18.6MB/s   00:57  

You question a possible HW issue. Its possible, but I wondeer why would it appear only when you advanced you r kernel. That's pretty suspicious. Lets try a quick run of the ethtool test package:

	ethtool -t eth0

All tests should pass within 30 seconds, with result 0. If they don't, it likely *is* a Si issue. If they do, then I'm going to prepare that debug driver I mentioned last week, to see if we can pull some more useful info from the system.

Dave

Comment 24 dave graham 2009-06-15 21:50:54 UTC
Thomas,
I have attached a debug patch that will collect more information *(to the system message log) to the patches section of our e1000 sourceforge site. 

http://sourceforge.net/tracker/?func=detail&aid=1460945&group_id=42302&atid=447451

If you are not familiar with applying patches, let me know and I'll provide more detail. 

If you do apply the patch and manage to get what looks like a good data dump (in var/log/messages), then just attach the output here, and I'll have a look at it.

Thanks
Dave

Comment 25 Thomas Müller 2009-06-16 18:48:32 UTC
I'm currently unable to transfer a file large enough to test the transfer rate, but if I remember correctly about 18-20MB/s is what I used to get here too.

I've just executed ethtool -t and it doesn't look very well :(
The test result is FAIL
The test extra info:
Register test  (offline)         0
Eeprom test    (offline)         0
Interrupt test (offline)         0
Loopback test  (offline)         13
Link test   (on/offline)         0


I've also tried to apply the patch you referred to, but it's against driver version 7.0.33 which doesn't compile for kernel 2.6.29.
I tried to adapt it to the latest 8.0.13 driver, but even after coping with removed/renamed defines I can't compile it due to some errors:
make -C /lib/modules/2.6.29.4-167.fc11.i586/build SUBDIRS=/usr/src/e1000-8.0.13/src modules
make[1]: Entering directory `/usr/src/kernels/2.6.29.4-167.fc11.i586'
  CC [M]  /usr/src/e1000-8.0.13/src/e1000_main.o
/usr/src/e1000-8.0.13/src/e1000_main.c: In function 'e1000_dump':
/usr/src/e1000-8.0.13/src/e1000_main.c:3254: error: 'struct e1000_adapter' has no member named 'rx_ps_pages'
/usr/src/e1000-8.0.13/src/e1000_main.c:3261: warning: initialization from incompatible pointer type
/usr/src/e1000-8.0.13/src/e1000_main.c:3275: warning: format '%016llX' expects type 'long long unsigned int', but argument 7 has type 'dma_addr_t'
/usr/src/e1000-8.0.13/src/e1000_main.c:3291: warning: initialization from incompatible pointer type
make[2]: *** [/usr/src/e1000-8.0.13/src/e1000_main.o] Error 1
make[1]: *** [_module_/usr/src/e1000-8.0.13/src] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.29.4-167.fc11.i586'
make: *** [default] Error 2

Comment 26 dave graham 2009-06-16 18:56:13 UTC
Hmm, I thought I developed it against the 8.0.13 SF driver. I probably made a mistake. Let me check. And I'll look into the significance of that loopback test failure too. I don't get it on my platform.

Comment 27 Thomas Müller 2009-06-16 19:06:19 UTC
Ah, I think I was somewhat confused... I see your comment from yesterday now, sorry for the confusion.

I'll apply the patch and get back to you when I had a chance to test it.

Comment 28 Thomas Müller 2009-06-16 19:33:47 UTC
Your patch worked fine, but I downloaded the wrong file at first, sorry again.

I'll attach the output in a moment...

Comment 29 Thomas Müller 2009-06-16 19:36:00 UTC
Created attachment 348154 [details]
output of driver with debug patch

Comment 30 dave graham 2009-06-16 20:26:55 UTC
Thanks Thomas, Its a good debug dump, and shows a real problem. As does the loopback test, which reports a data miscompare (or maybe a timeout, the failure paths aren't too cleanly implemented). I'll get most out of the dump file, and will be consulting with a few colleagues. There's still no clear indication of whether this is a drievr issue, or your HW.

I don't expect any quick breakthrough. I will get back to you later today with a summary of what I've found, and any new plan. I wich I was able to get a repro here, but alss I am not. Are you OK with continuing to help me debug ?

Dave

Comment 31 dave graham 2009-06-16 21:59:48 UTC
Well there are a couple of things to try, though neither directly addresses root cause, they may work for you:

1) Use module load parameter TxDescPower=9
[You'll find instructions for how to apply this in the README file in the e1000-8.0.13 install directory]. This will reduce the max chunk size of data sent from the host to the NIC on the PCI-X bus. We have had other TXHangs due to silicon errata that this has worked around. I am not sure if the 82547EI is one of those affected silicon (I am in contact with others to find out), but its worth a shot.

2) Disable TX Checksum offload. "ethtool -K eth0 tx off". Apologies if you already tried this one.

Please also capture another couple of TXHang reports like the one you attached before. I might see a significant patters by looking at a few.

Comment 32 Thomas Müller 2009-06-17 06:17:38 UTC
My system is currently working stable with the additional NIC I installed, so it's no longer extremely urgent. However, it only takes a moment to reconfigure everything to be able to test the 82547EI and I'm still interested in the root cause of this, so I'll continue to help you of course :)
If the conclusion will be a hw error I'll disable it for good, if it's a driver bug that can be fixed somehow, I'll be more than happy *g*
Thanks for your efforts :)

Option TxDescPower=9 or disabling TX Checksum offload did not resolve the issue.
I think I tried disabling all offload options a few days ago without any success.

I'll attach the debugging output in a moment.

Comment 33 Thomas Müller 2009-06-17 06:18:40 UTC
Created attachment 348212 [details]
output of driver with debug patch

Comment 34 dave graham 2009-06-17 22:06:53 UTC
Thomas, 
Thanks for the additional testing. I have conferred with colleagues and it seems that you most probably do have an issue with the network silicon, maybe the platform. The debug dumps show that the driver is doing what it is supposed to, which is to wait on a return of each TX "descriptor" from the HW, before recycling the descriptor to its available pool. In your case, we can see that the driver is waiting fo a descriptor, and we can see that descriptor in the HS cache of the descriptors, but it is not being written back to host memory with a "Done" indication. The driver eventually times out and reports the TX Hang (and in the debug driver case prints all the debug stuff).

That the NIC has a problem ties in with the reported failure of the loopback test, and that when you restored to an older kernel/driver, the problem remains. Also, you are (OK, so far) the only person to be reporting exactly this issue).

I do notice that you still have TX Checksummming enabled (from the dumps), but this is not likely to be realted to your issue. Again though, why not disable it - another ethtool -K variant).

On of my colleagues suggested that this may be a temperature related issue. It is summer now after all - its possible. Some of these older parts are sensitve to temperature.

But that's about all I've been able to come up with.

Comment 35 Thomas Müller 2009-06-18 17:46:55 UTC
To by sure I've just disabled all offloading settings:
# ethtool -k eth0
Offload parameters for eth0:
Cannot get device flags: Operation not supported
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off

However, the problem is still there. :(
I'll attach the dump in a moment, just so it's saved with all the other ones...


I will then consider this NIC as physically damaged and won't use it again until told otherwise (or maybe I'll try again next winter ;))
Thank you for your quick responses and your efforts. :)


I'd have one more question though:
Reading your explanation I can see how this TX Hang happens, but do you have any idea why my system also often locks up completely and then needs a power-cycle to get up again when I try to use this NIC?

Comment 36 Thomas Müller 2009-06-18 17:48:11 UTC
Created attachment 348533 [details]
output of driver with debug patch

Comment 37 dave graham 2009-06-18 23:30:20 UTC
Thanks Thomas,
I've looked at the new dump. All offloads are clearly disabled. Again, it simply looks like the NIC has stopped transmitting data. The simplest of the dumps to analyze is actually the second one.

Td[desc]  [address 63:0  ] ntw      TXDESCRIPTOR FIFO
Tc[0x000] 00000000353DE202   0   -- T7000: 353DE202|353DE202 8B00002A|8B00002A
Tc[0x001] 0000000035A3B8E6   1   -- T7010: 35A3B8E6|35A3B8E6 8B00004A|8B00004A
Tc[0x002] 0000000035A3B8EE   2   -- T7020: 35A3B8EE|35A3B8EE 8B000042|8B000042
Tc[0x003] 0000000035A3B8EE   3   -- T7030: 35A3B8EE|35A3B8EE 8B000042|8B000042
Tc[0x004] 0000322200000000   6   -- T7040: 00000000|00000000 21000000|21000000
Td[0x005] 0000000035A3B93E   5   -- T7050: 35A3B93E|35A3B93E 22100042|22100042
Td[0x006] 00000000348E2000   6   -- T7060: 348E2000|348E2000 AB100015|AB100015
Tc[0x007] 0000322200000000   A   -- T7070: 00000000|00000000 21000000|21000000
Td[0x008] 0000000035A3B93E   8   -- T7080: 35A3B93E|35A3B93E 22100042|22100042
Td[0x009] 00000000348E2015   9   -- T7090: 348E2015|348E2015 22100200|22100200
Td[0x00A] 00000000348E2215   A   -- T70A0: 348E2215|348E2215 AB100118|AB100118
Tc[0x00B] 0000322200000000   D NTC  T70B0: 359D7345 (TXD FIFO DATA IS STALE !!!)
Td[0x00C] 0000000035A3B93E   C                      
Td[0x00D] 00000000348E232D   D
Tc[0x00E] 0000000035A3A0E2   E
Tc[0x00F] 0000000000000000   0 NTU


I've condensed the essential part of the dump above. We see that the NIC's TX Desriptor FIFO doesn't contian the TX descriptor that the driver is waiting to see completed. The FIFO lines up for the most part, up to element 00A, but 00B is not showing. We can see that the driver had properly informed the NIC that it *should* fetch this descriptor, so it should be in the Descriptor FIFO. Either the NIC DMA RX engine was hung, or the read by the NIC failed.

Why does the system sometimes hang ? That's a pretty important question. We can guess. If the NIC is failing to read from host memory in this sample dump, there's something flaky with the Device/Host DMA, and that could have any number of consequences. A PCI bus error may be involved, which might cause an NMI. Or possibly, if a Receive Descriptor is corrupted by this issue, the RX DMA of packet data will be misdirected , and the NIC could then scribble to pretty miuch anywhere in host memory.

Yes this is conjecture, but does fit with the dumps. If I had a repro locally I'd certainly chase it down further.

Comment 38 vxworks 2009-08-14 23:24:21 UTC
I've got the same problem in my pc, too (same processor, same card). It may have something to do with the memory size.

I'm running 4GB ram. Whenever I remove 2GB, the cards works well.

Comment 39 dave graham 2009-08-17 17:41:42 UTC
To wrap up this issue, I should note that I never did find a driver issue, believe that this is a not a SW BUG, rather a HW issue, and I sent Thomas an "Intel PRO/1000 XT Server Adapter" card to replace the failing interface. Thomas was up & running again with this replacement 7/31/09. I am closing this issue.

Hi "vxworks". The 4GB/2GB aspect of this is interesting. Your issue may be the same at Thomas's, but a lot of issues look very similar in their original manifestation. Could you please file a new bug, and I'll get to it and be able to dedicate my attention to your symptoms. Thanks.

Comment 40 Jesse Brandeburg 2009-08-17 19:06:24 UTC
There is a known issue with some systems and PCI adapters, that can usually be fixed by applying the patch that removes DMA to/from addresses >=4GB.

I'm working on a quick patch to add a module parameter for the > 4GB thing.

Comment 41 Thomas Müller 2009-08-17 19:15:40 UTC
As an additional information: My system only has 2GB RAM.


Note You need to log in before you can comment on or make changes to this bug.