Bug 400561
Summary: | e1000 erratic ping latency | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Warren Togami <wtogami> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8 | CC: | auke-jan.h.kok, jesse.brandeburg, thelaw269 | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | 2.6.23.9-85.fc8 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-12-20 19:56:21 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Warren Togami
2007-11-27 06:00:35 UTC
Another data point... A week or two ago, I was not experiencing this e1000 problem with these same kernel versions. My current theory is that *something* in userspace either in F8 updates or something I might have changed manually is triggering a e1000 bug. I only noticed it today because direct plugging in a thin client to boot from my laptop suddenly became REALLY SLOW in transferring data over tftp and NFS protocols. It is not 100% certain that this behavior is related to the e1000 erratic ping latency above, but seemingly both problems began around the same time. I am getting a USB ethernet adapter in order to test in another way. I spent a few hours testing various kernels, always doing a cold boot. Several boots in a row e1000 had the erratic ping problem. But then several subsequent boots with the very same kernels I was not able to reproduce the erratic ping problem. But the dismally slow tftp and NFS performance persisted. I am very perplexed about what might have changed here... apparently fixed by: http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff_plain;h=f2fa3114919fa195f800a04a5e57156c0f67fff4 and it says that fixes the bad checksum problem too. http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff_plain;h=f2fa3114919fa195f800a04a5e57156c0f67fff4 Tried this patch against 2.6.23.9 in F-8. Unfortunately, this fixes neither the erratic ping latency nor the slow tftp/nfs mentioned above. 64 bytes from 172.31.16.1: icmp_seq=1 ttl=64 time=1000 ms 64 bytes from 172.31.16.1: icmp_seq=2 ttl=64 time=128 ms 64 bytes from 172.31.16.1: icmp_seq=3 ttl=64 time=53.6 ms 64 bytes from 172.31.16.1: icmp_seq=4 ttl=64 time=1.14 ms 64 bytes from 172.31.16.1: icmp_seq=5 ttl=64 time=1000 ms 64 bytes from 172.31.16.1: icmp_seq=6 ttl=64 time=128 ms 64 bytes from 172.31.16.1: icmp_seq=7 ttl=64 time=1000 ms 64 bytes from 172.31.16.1: icmp_seq=8 ttl=64 time=128 ms 64 bytes from 172.31.16.1: icmp_seq=9 ttl=64 time=1000 ms 64 bytes from 172.31.16.1: icmp_seq=10 ttl=64 time=128 ms 64 bytes from 172.31.16.1: icmp_seq=11 ttl=64 time=1.05 ms 64 bytes from 172.31.16.1: icmp_seq=12 ttl=64 time=128 ms Curious that it has the same numbers 1000ms and 128ms repeatedly. Grr!!! At the office, plugging my same laptop into the same thin client works without any tftp/nfs slowness. Plugging into the office 100mbit full-duplex network also has no ping latency or slowness issues. Tested both -41 and my custom build of 2.6.23.9 with the above patch that both failed at home. (Tested the same ethernet cables at home both with a RJ45 to RJ45 tester and plugging into another Thinkpad T41 laptop with e1000. Works without issue on that other laptop plugged into the same devices.) Is your system using the e1000e or e1000 driver? If e1000, what is the PCI ID of the adapter? e1000 lspci: 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller lspci -vn 02:00.0 0200: 8086:109a Subsystem: 17aa:2001 Flags: bus master, fast devsel, latency 0, IRQ 2297 Memory at ee000000 (32-bit, non-prefetchable) [size=128K] I/O ports at 2000 [size=32] Capabilities: [c8] Power Management version 2 Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Capabilities: [e0] Express Endpoint IRQ 0 okay back to the default debugging on this part, please attach ethtool -e and lspci -vv output. (In reply to comment #6) > e1000 > lspci: > 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller That patch was for e1000e, can you apply this one as well? It makes the 82573 use the e100e driver. http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff_plain;h=b3637100199b2679cd2f39d47a0061a3398cd3ca Intel people, is this one going in 2.6.24? No, since 2.6.24 is late in the RC right now and we're removing support for 82573 from e1000 in 2.6.25 (by moving it to e1000e). I'm not inclined to push this patch this late in the cycle for a piece of hardware which it will not support for longer than a single cycle anyway. Created attachment 272011 [details]
backport of e1000e patch to e1000
Completely untested backport, not even compiled yet.
Re: #10: that should work just fine for e1000. The patch against e1000e seems to be working fine on my 82573 at home. Booting the F8 system from e1000 with one kernel to e1000e in the new kernel... it automatically loaded the e1000e driver instead of e1000 probably due to udev and ignored the contents of modprobe.conf and device renaming named it eth0 probably because the MAC address matched the definition in /etc/sysconfig/network-scripts/ifcfg-eth0. This indicates that a possible future migration from e1000 to e1000e would work for most users. It is unclear what exactly "alias eth0 e1000" leftover in modprobe.conf would impact. I am trying your backport to e1000 next. Tried the patch in Comment #10 and booted with e1000. Ping latency on my home network, and tftp/nfs speed to my thin client both appear to be good. But then again, it was working fine with my thin client this morning without this patch as well, and all of these kernels were working fine on e1000 prior to Monday too. So I cannot know for sure if this patch is actually helping. =( Intel, Could the patch in Comment #10 be pushed to 2.6.24? Does the higher power usage after the patch only happen when you are actually plugged in? (Not when unplugged from Ethernet and using wireless.) reply to comment #14: I'm OK with that patch being pushed to 2.6.24 if that's desirable, but I do not know if it will be accepted in the first place (it's new code and we're late in the rc cycle, and we're going to have to remove it in 2.6.25 anyway) Power consumption of the device is higher with L1 ASPM disabled at *all* times because the features saves power on the pci-e link level, not the PHY link level. (In reply to comment #15) > I'm OK with that patch being pushed to 2.6.24 if that's desirable, but I do not > know if it will be accepted in the first place (it's new code and we're late in > the rc cycle, and we're going to have to remove it in 2.6.25 anyway) > You're not removing it, just moving it to e1000e. > Power consumption of the device is higher with L1 ASPM disabled at *all* times
> because the features saves power on the pci-e link level, not the PHY link
> level.
Sadness. Will it be impossible to find a solution that doesn't require
disabling L1 ASPM?
we've been able to come up with various workarounds but they all do not fix what is broken, and the problem always seems to resurface after each one of those. Disabling the feature is by far the best solution for everyone. we *are* going to remove all pci-e adapter support from e1000 in 2.6.25, including this workaround. e1000 in 2.6.25 will no longer support 82573, so logically this 82573-specific fix will be removed as well. I'm not sure, but I believe that even if we disable L1 support in the driver that the adapter will go into L1 in the D3 state. So if you have a kernel or a driver that is putting the device into D3 during "no use" cases, then I think it might still save PCIe power. I was unable to glean an exact answer of what happens when the driver disables L1, and then the device is put into D3. The power loss of staying in L0 at any case is not very much, but is non-zero. Has this patch been included in the e1000 driver available from http://sourceforge.net/projects/e1000/ ? If not, will it be included? Although I've recompiled the kernel on a test system to verify that this fixed the problem (it appears to, thank you!) I'd like to avoid doing that on all systems that have this problem, much easier to just compile the e1000 package and install it. I would try to patch it myself, but the file structure looks very different and I didn't many to monkey it up. Patch is not in the e1000 driver - see comment #9. This patch is also not in any of our e1000.sf.net releases. It will in time be included but not for several more weeks. Fix in 2.6.23.9-75 and up. kernel-2.6.23.9-85.fc8 has been pushed to the Fedora 8 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update kernel' FYI Chuck Ebbert had me try some modprobe settings in bug 398921. What was very interesting are the test results as found in bug 398921 comment #9 and bug 398921 comment #10. The Xen kernel removed the tx unit hang messages while using the same Intel driver version in both Kernels! The stock kernel generates the tx unit hang messages. Warren Togami if you have time, it would be interesting to see if the Xen kernel did anything for you. kernel-2.6.23.9-85.fc8 has been pushed to the Fedora 8 stable repository. If problems still persist, please make note of it in this bug report. As you wished... Linux version 2.6.23.9-85.fc8 ... Dec 22 23:28:50 mowgli kernel: Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI ... Dec 22 23:35:23 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out Dec 22 23:35:23 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out Dec 22 23:35:23 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Dec 22 23:35:23 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang Dec 22 23:35:23 mowgli kernel: Tx Queue <0> Dec 22 23:35:23 mowgli kernel: Tx Queue <0> Dec 22 23:35:23 mowgli kernel: TDH <b> Dec 22 23:35:23 mowgli kernel: TDH <b> Dec 22 23:35:23 mowgli kernel: TDT <b> Dec 22 23:35:23 mowgli kernel: TDT <b> Dec 22 23:35:23 mowgli kernel: next_to_use <b> Dec 22 23:35:23 mowgli kernel: next_to_use <b> Dec 22 23:35:23 mowgli kernel: next_to_clean <1f> Dec 22 23:35:23 mowgli kernel: next_to_clean <1f> Dec 22 23:35:23 mowgli kernel: buffer_info[next_to_clean] Dec 22 23:35:23 mowgli kernel: buffer_info[next_to_clean] Dec 22 23:35:23 mowgli kernel: time_stamp <27c56> Dec 22 23:35:23 mowgli kernel: time_stamp <27c56> Dec 22 23:35:23 mowgli kernel: next_to_watch <1f> Dec 22 23:35:23 mowgli kernel: next_to_watch <1f> Dec 22 23:35:23 mowgli kernel: jiffies <29810> Dec 22 23:35:23 mowgli kernel: jiffies <29810> Dec 22 23:35:23 mowgli kernel: next_to_watch.status <0> Dec 22 23:35:23 mowgli kernel: next_to_watch.status <0> Dec 22 23:35:26 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX Dec 22 23:35:26 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX I will try the Xen Kernel again Linux version 2.6.21-2952.fc8xen ... Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI ... As I reported in bug 398921 comment #9 both the xen and regular kernels are now using the same driver. However, I do not see the Tx Unit Hang messages when using the xen kernel. I was under the impression that the issue was related to the Xen kernel having an older working e1000 driver while the non-Xen kernels used a later driver. Perhaps this may have been true or not? Now nice both the kernels have the same 7.3.20-k2-NAPI e1000 driver, why does the Xen kernel produce no errors while the stock kernel produces a boat load of 'em and performance suffers. In the case of the Xen kernel, the copy command the produced errors allows me to copy 15G with no errors. du -sh 15G . What would you like me to do next? From bug 398921 History of the Problem on the ECS AMD Sempron Motherboard. Intel Driver Kernel TK Hang Issues 7.3.15-k2-NAPI Fedora 5 kernel ? Rock Solid 7.3.??-k2-NAPI Fedora 6 kernel ? Not installed. 7.3.15-k2-NAPI /boot/vmlinuz-2.6.20-2925.11.fc7xen Rock Solid 7.3.15-k2-NAPI /boot/vmlinuz-2.6.20-2925.9.fc7xen Rock Solid 7.3.20-k2-NAPI /boot/vmlinuz-2.6.21-1.3228.fc7 TX Issues Encountered 7.3.20-k2-NAPI /boot/vmlinuz-2.6.22.1-27.fc7 TX Issues Encountered 7.3.20-k2-NAPI /boot/vmlinuz-2.6.23.1-42.fc8 TX Issues Encountered |