Bug 400561

Summary: e1000 erratic ping latency
Product: [Fedora] Fedora Reporter: Warren Togami <wtogami>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: medium    
Version: 8CC: auke-jan.h.kok, jesse.brandeburg, thelaw269
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 2.6.23.9-85.fc8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-12-20 19:56:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
backport of e1000e patch to e1000 none

Description Warren Togami 2007-11-27 06:00:35 UTC
kernel-2.6.23.1-41.fc8
kernel-2.6.23.1-42.fc8
kernel-2.6.23.1-49.fc8
kernel-2.6.23.8-61.fc8
kernel-2.6.24-0.43.rc3.git1.fc9
Lenovo Thinkpad T60 x86_64

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
02:00.0 0200: 8086:109a
        Subsystem: 17aa:2001
        Flags: bus master, fast devsel, latency 0, IRQ 2297
        Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2000 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0
Enable+
        Capabilities: [e0] Express Endpoint IRQ 0

[root@newcaprica ~]# ping 172.31.16.1
PING 172.31.16.1 (172.31.16.1) 56(84) bytes of data.
64 bytes from 172.31.16.1: icmp_seq=1 ttl=64 time=226 ms
64 bytes from 172.31.16.1: icmp_seq=2 ttl=64 time=26.8 ms
64 bytes from 172.31.16.1: icmp_seq=3 ttl=64 time=2.09 ms
64 bytes from 172.31.16.1: icmp_seq=4 ttl=64 time=1.36 ms
64 bytes from 172.31.16.1: icmp_seq=5 ttl=64 time=227 ms
64 bytes from 172.31.16.1: icmp_seq=6 ttl=64 time=0.912 ms
64 bytes from 172.31.16.1: icmp_seq=7 ttl=64 time=223 ms
64 bytes from 172.31.16.1: icmp_seq=8 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=9 ttl=64 time=229 ms
64 bytes from 172.31.16.1: icmp_seq=10 ttl=64 time=1.02 ms
64 bytes from 172.31.16.1: icmp_seq=11 ttl=64 time=230 ms
64 bytes from 172.31.16.1: icmp_seq=12 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=13 ttl=64 time=1.84 ms
64 bytes from 172.31.16.1: icmp_seq=14 ttl=64 time=1.13 ms
64 bytes from 172.31.16.1: icmp_seq=15 ttl=64 time=231 ms
64 bytes from 172.31.16.1: icmp_seq=16 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=17 ttl=64 time=0.979 ms
64 bytes from 172.31.16.1: icmp_seq=18 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=19 ttl=64 time=232 ms
64 bytes from 172.31.16.1: icmp_seq=20 ttl=64 time=1001 ms
64 bytes from 172.31.16.1: icmp_seq=21 ttl=64 time=1.09 ms

Intermittently see erratic ping latency (one hop away) like above when plugged
into a a Linksys WRT54G or directly into another box's ethernet port.  The
erratic ping latency is similar to what I see people complaining about in Bug
#241783.  I see this behavior in all of the above kernel versions.

Interestingly, ping times go back to normal (1-2ms) when I begin a data transfer
over that Ethernet adapter like 600KB/sec.  When the transfer completed and the
card became idle again, the ping times returned to the above erratic behavior.

Comment 1 Warren Togami 2007-11-27 06:11:09 UTC
Another data point...
A week or two ago, I was not experiencing this e1000 problem with these same
kernel versions.  My current theory is that *something* in userspace either in
F8 updates or something I might have changed manually is triggering a e1000 bug. 

I only noticed it today because direct plugging in a thin client to boot from my
laptop suddenly became REALLY SLOW in transferring data over tftp and NFS
protocols.  It is not 100% certain that this behavior is related to the e1000
erratic ping latency above, but seemingly both problems began around the same
time.  I am getting a USB ethernet adapter in order to test in another way.

I spent a few hours testing various kernels, always doing a cold boot.  Several
boots in a row e1000 had the erratic ping problem.  But then several subsequent
boots with the very same kernels I was not able to reproduce the erratic ping
problem.  But the dismally slow tftp and NFS performance persisted.

I am very perplexed about what might have changed here...

Comment 2 Chuck Ebbert 2007-11-27 23:28:50 UTC
apparently fixed by:
http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff_plain;h=f2fa3114919fa195f800a04a5e57156c0f67fff4

and it says that fixes the bad checksum problem too.

Comment 3 Warren Togami 2007-11-28 02:44:06 UTC
http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff_plain;h=f2fa3114919fa195f800a04a5e57156c0f67fff4
Tried this patch against 2.6.23.9 in F-8.  Unfortunately, this fixes neither the
erratic ping latency nor the slow tftp/nfs mentioned above.

64 bytes from 172.31.16.1: icmp_seq=1 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=2 ttl=64 time=128 ms
64 bytes from 172.31.16.1: icmp_seq=3 ttl=64 time=53.6 ms
64 bytes from 172.31.16.1: icmp_seq=4 ttl=64 time=1.14 ms
64 bytes from 172.31.16.1: icmp_seq=5 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=6 ttl=64 time=128 ms
64 bytes from 172.31.16.1: icmp_seq=7 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=8 ttl=64 time=128 ms
64 bytes from 172.31.16.1: icmp_seq=9 ttl=64 time=1000 ms
64 bytes from 172.31.16.1: icmp_seq=10 ttl=64 time=128 ms
64 bytes from 172.31.16.1: icmp_seq=11 ttl=64 time=1.05 ms
64 bytes from 172.31.16.1: icmp_seq=12 ttl=64 time=128 ms

Curious that it has the same numbers 1000ms and 128ms repeatedly.

Comment 4 Warren Togami 2007-11-28 16:05:06 UTC
Grr!!!

At the office, plugging my same laptop into the same thin client works without
any tftp/nfs slowness.  Plugging into the office 100mbit full-duplex network
also has no ping latency or slowness issues.  Tested both -41 and my custom
build of 2.6.23.9 with the above patch that both failed at home.

(Tested the same ethernet cables at home both with a RJ45 to RJ45 tester and
plugging into another Thinkpad T41 laptop with e1000.  Works without issue on
that other laptop plugged into the same devices.)

Comment 5 Chuck Ebbert 2007-11-28 20:51:49 UTC
Is your system using the e1000e or e1000 driver?
If e1000, what is the PCI ID of the adapter?


Comment 6 Warren Togami 2007-11-28 21:18:59 UTC
e1000
lspci:
02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller

lspci -vn
02:00.0 0200: 8086:109a
        Subsystem: 17aa:2001
        Flags: bus master, fast devsel, latency 0, IRQ 2297
        Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2000 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0
Enable+
        Capabilities: [e0] Express Endpoint IRQ 0

Comment 7 Jesse Brandeburg 2007-11-28 21:32:16 UTC
okay back to the default debugging on this part, please attach ethtool -e and 
lspci -vv output.



Comment 8 Chuck Ebbert 2007-11-28 22:06:30 UTC
(In reply to comment #6)
> e1000
> lspci:
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller

That patch was for e1000e, can you apply this one as well? It makes the 82573
use the e100e driver.

http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff_plain;h=b3637100199b2679cd2f39d47a0061a3398cd3ca

Intel people, is this one going in 2.6.24?


Comment 9 Auke Kok 2007-11-28 23:30:26 UTC
No, since 2.6.24 is late in the RC right now and we're removing support for
82573 from e1000 in 2.6.25 (by moving it to e1000e). I'm not inclined to push
this patch this late in the cycle for a piece of hardware which it will not
support for longer than a single cycle anyway.


Comment 10 Chuck Ebbert 2007-11-29 00:01:15 UTC
Created attachment 272011 [details]
backport of e1000e patch to e1000

Completely untested backport, not even compiled yet.

Comment 11 Auke Kok 2007-11-29 00:14:15 UTC
Re: #10: that should work just fine for e1000.

Comment 12 Warren Togami 2007-11-29 03:09:29 UTC
The patch against e1000e seems to be working fine on my 82573 at home.  Booting
the F8 system from e1000 with one kernel to e1000e in the new kernel... it
automatically loaded the e1000e driver instead of e1000 probably due to udev and
ignored the contents of modprobe.conf and device renaming named it eth0 probably
because the MAC address matched the definition in
/etc/sysconfig/network-scripts/ifcfg-eth0.   This indicates that a possible
future migration from e1000 to e1000e would work for most users.  It is unclear
what exactly "alias eth0 e1000" leftover in modprobe.conf would impact.

I am trying your backport to e1000 next.

Comment 13 Warren Togami 2007-11-29 03:47:40 UTC
Tried the patch in Comment #10 and booted with e1000.  Ping latency on my home
network, and tftp/nfs speed to my thin client both appear to be good.  But then
again, it was working fine with my thin client this morning without this patch
as well, and all of these kernels were working fine on e1000 prior to Monday
too.  So I cannot know for sure if this patch is actually helping. =(


Comment 14 Warren Togami 2007-11-29 03:52:50 UTC
Intel,

Could the patch in Comment #10 be pushed to 2.6.24?

Does the higher power usage after the patch only happen when you are actually
plugged in?  (Not when unplugged from Ethernet and using wireless.)

Comment 15 Auke Kok 2007-11-29 18:21:28 UTC
reply to comment #14:

I'm OK with that patch being pushed to 2.6.24 if that's desirable, but I do not
know if it will be accepted in the first place (it's new code and we're late in
the rc cycle, and we're going to have to remove it in 2.6.25 anyway)

Power consumption of the device is higher with L1 ASPM disabled at *all* times
because the features saves power on the pci-e link level, not the PHY link level. 

Comment 16 Chuck Ebbert 2007-11-29 18:27:04 UTC
(In reply to comment #15)
> I'm OK with that patch being pushed to 2.6.24 if that's desirable, but I do not
> know if it will be accepted in the first place (it's new code and we're late in
> the rc cycle, and we're going to have to remove it in 2.6.25 anyway)
> 

You're not removing it, just moving it to e1000e.




Comment 17 Warren Togami 2007-11-29 18:29:18 UTC
> Power consumption of the device is higher with L1 ASPM disabled at *all* times
> because the features saves power on the pci-e link level, not the PHY link
> level. 

Sadness.  Will it be impossible to find a solution that doesn't require
disabling L1 ASPM?

Comment 18 Auke Kok 2007-11-29 18:48:48 UTC
we've been able to come up with various workarounds but they all do not fix what
is broken, and the problem always seems to resurface after each one of those.
Disabling the feature is by far the best solution for everyone.

we *are* going to remove all pci-e adapter support from e1000 in 2.6.25,
including this workaround. e1000 in 2.6.25 will no longer support 82573, so
logically this 82573-specific fix will be removed as well.

Comment 19 Jesse Brandeburg 2007-11-29 23:46:50 UTC
I'm not sure, but I believe that even if we disable L1 support in the driver
that the adapter will go into L1 in the D3 state.  So if you have a kernel or a
driver that is putting the device into D3 during "no use" cases, then I think it
might still save PCIe power.

I was unable to glean an exact answer of what happens when the driver disables
L1, and then the device is put into D3.  The power loss of staying in L0 at any
case is not very much, but is non-zero.

Comment 20 Lawrence Lauderdale 2007-12-03 15:20:47 UTC
Has this patch been included in the e1000 driver available from
http://sourceforge.net/projects/e1000/ ? If not, will it be included? Although
I've recompiled the kernel on a test system to verify that this fixed the
problem (it appears to, thank you!) I'd like to avoid doing that on all systems
that have this problem, much easier to just compile the e1000 package and
install it. I would try to patch it myself, but the file structure looks very
different and I didn't many to monkey it up.

Comment 21 Auke Kok 2007-12-03 16:52:19 UTC
Patch is not in the e1000 driver - see comment #9.

This patch is also not in any of our e1000.sf.net releases. It will in time be
included but not for several more weeks.

Comment 22 Chuck Ebbert 2007-12-07 21:05:12 UTC
Fix in 2.6.23.9-75 and up.

Comment 23 Fedora Update System 2007-12-12 20:00:11 UTC
kernel-2.6.23.9-85.fc8 has been pushed to the Fedora 8 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'

Comment 24 Greg Morgan 2007-12-13 08:09:16 UTC
FYI Chuck Ebbert had me try some modprobe settings in bug 398921.  What was very
interesting are the test results as found in bug 398921 comment #9 and bug
398921 comment #10.  The Xen kernel removed the tx unit hang messages while
using the same Intel driver version in both Kernels!  The stock kernel generates
the tx unit hang messages. 

Warren Togami if you have time, it would be interesting to see if the Xen kernel
did anything for you.

Comment 25 Fedora Update System 2007-12-20 19:56:12 UTC
kernel-2.6.23.9-85.fc8 has been pushed to the Fedora 8 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 26 Greg Morgan 2007-12-23 06:50:08 UTC
As you wished...

Linux version 2.6.23.9-85.fc8
...
Dec 22 23:28:50 mowgli kernel: Intel(R) PRO/1000 Network Driver - version
7.3.20-k2-NAPI
...
Dec 22 23:35:23 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
Dec 22 23:35:23 mowgli kernel: NETDEV WATCHDOG: eth0: transmit timed out
Dec 22 23:35:23 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Dec 22 23:35:23 mowgli kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit
Hang
Dec 22 23:35:23 mowgli kernel:   Tx Queue             <0>
Dec 22 23:35:23 mowgli kernel:   Tx Queue             <0>
Dec 22 23:35:23 mowgli kernel:   TDH                  <b>
Dec 22 23:35:23 mowgli kernel:   TDH                  <b>
Dec 22 23:35:23 mowgli kernel:   TDT                  <b>
Dec 22 23:35:23 mowgli kernel:   TDT                  <b>
Dec 22 23:35:23 mowgli kernel:   next_to_use          <b>
Dec 22 23:35:23 mowgli kernel:   next_to_use          <b>
Dec 22 23:35:23 mowgli kernel:   next_to_clean        <1f>
Dec 22 23:35:23 mowgli kernel:   next_to_clean        <1f>
Dec 22 23:35:23 mowgli kernel: buffer_info[next_to_clean]
Dec 22 23:35:23 mowgli kernel: buffer_info[next_to_clean]
Dec 22 23:35:23 mowgli kernel:   time_stamp           <27c56>
Dec 22 23:35:23 mowgli kernel:   time_stamp           <27c56>
Dec 22 23:35:23 mowgli kernel:   next_to_watch        <1f>
Dec 22 23:35:23 mowgli kernel:   next_to_watch        <1f>
Dec 22 23:35:23 mowgli kernel:   jiffies              <29810>
Dec 22 23:35:23 mowgli kernel:   jiffies              <29810>
Dec 22 23:35:23 mowgli kernel:   next_to_watch.status <0>
Dec 22 23:35:23 mowgli kernel:   next_to_watch.status <0>
Dec 22 23:35:26 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX
Dec 22 23:35:26 mowgli kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000
Mbps Full Duplex, Flow Control: RX/TX

I will try the Xen Kernel again

Comment 27 Greg Morgan 2007-12-23 07:45:47 UTC
Linux version 2.6.21-2952.fc8xen 
...
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI
...
As I reported in bug 398921 comment #9 both the xen and regular kernels are now
using the same driver.  However, I do not see the Tx Unit Hang messages when
using the xen kernel.  

I was under the impression that the issue was related to the Xen kernel having
an older working e1000 driver while the non-Xen kernels used a later driver. 
Perhaps this may have been true or not?  Now nice both the kernels have the same
7.3.20-k2-NAPI e1000 driver, why does the Xen kernel produce no errors while the
stock kernel produces a boat load of 'em and performance suffers.  In the case
of the Xen kernel, the copy command the produced errors allows me to copy 15G
with no errors.
du -sh
15G     .

What would you like me to do next?

From bug 398921

History of the Problem on the ECS AMD Sempron Motherboard.

Intel
Driver		Kernel					TK Hang Issues
7.3.15-k2-NAPI	Fedora 5 kernel ?               	Rock Solid
7.3.??-k2-NAPI	Fedora 6 kernel ?               	Not installed.
7.3.15-k2-NAPI	/boot/vmlinuz-2.6.20-2925.11.fc7xen	Rock Solid
7.3.15-k2-NAPI	/boot/vmlinuz-2.6.20-2925.9.fc7xen	Rock Solid
7.3.20-k2-NAPI	/boot/vmlinuz-2.6.21-1.3228.fc7		TX Issues Encountered
7.3.20-k2-NAPI	/boot/vmlinuz-2.6.22.1-27.fc7		TX Issues Encountered
7.3.20-k2-NAPI  /boot/vmlinuz-2.6.23.1-42.fc8           TX Issues Encountered