Description of problem: Network connection wich are using the r8169 are not reliable. It may work for some but is unusable for us. We have switched from Realtek's r8168 - which worked fine - to the builtin r8169 of 5.4 beta kernels but since have this issue as shown here http://patchwork.kernel.org/patch/37934/ The 8169 chip only generates MSI interrupts when all enabled event sources are quiescent and one or more sources transition to active. If not all of the active events are acknowledged, or a new event becomes active while the existing ones are cleared in the handler, we will not see a new interrupt. The current interrupt handler masks off the Rx and Tx events once the NAPI handler has been scheduled, which opens a race window in which we can get another Rx or Tx event and never ACK'ing it, stopping all activity until the link is reset (ifconfig down/up). Fix this by always ACK'ing all event sources, and loop in the handler until we have all sources quiescent. Version-Release number of selected component (if applicable): kernel-2.6.18-160.el5 How reproducible: Just let a box run for many hours and move some data over the net. In our case the dead link happens ~ once a week. Steps to Reproduce: 1. configure a r8169 network adapter 2. start working with it 3. work for many hours Actual results: Jul 28 23:23:35 nx-08 kernel: NETDEV WATCHDOG: eth0: transmit timed out Jul 28 23:23:35 nx-08 kernel: r8169: eth0: link up Jul 28 23:25:46 nx-08 kernel: r8169: eth0: link down Jul 28 23:25:49 nx-08 kernel: r8169: eth0: link up Jul 28 23:25:52 nx-08 kernel: r8169: eth0: link down Jul 28 23:25:57 nx-08 kernel: r8169: eth0: link up Expected results: It should work. Additional info: It seems to affect only certain chips on certain hardware and link speed seems to have an effect too. If I understand correctly then it's no surprise. I have backported a patch from 2.6.30.3 which hopefully fixes it. I'm just rebuilding RPMs to test it. If it runs I'll post it here. To make sure it works may need some days to verify.
Created attachment 355605 [details] avoid dead link on r8169 The patched kernel works but I can not yet confirm that the bug is gone because it doesn't happen very often in my case.
I'd like to confirm that I didn't see any error again with this patch after moving some TB of data through it on my test box. Also another computer which has shown errors almost daily has not shown any errors again since installing the new kernel 4 days ago. That was I real show stopper for us on all RTL8168 NICs which is widely used on Atom based system these days.
Packages are located at: http://people.redhat.com/ivecera/rhel-5-ivtest/ Simon, could you please test them?
Any chance you could post a i686 build there? The boxes in question are Atom N270 based and we run them on 32bit (I'm not even sure they could run x86_64). Regards, Simon
No problem Simon, I will post it ASAP.
Simon, i686 packages are also there. Could you please test them?
Ivan, it doesn't seem to work. I tried to make the link stop by sending large amount of data trough it. While speed is usually as expected the transfer stops after some activity and will resume later. The time used to transfer 10G of data is ~3 times higher than what it should be. I have tested with 2.6.18-162.el5, 2.6.18-162.el5.ivtest.1 and 2.6.18-160 and they all show the same issue, while 2.6.18-160.invoca1.el5 performs fine. Actual results: (running 2.6.18-162.el5.ivtest.1) [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 677.99 seconds, 15.8 MB/s [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 512.643 seconds, 20.9 MB/s [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 478.893 seconds, 22.4 MB/s Expected results: (running 2.6.18-160.invoca1.el5) [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 146.771 seconds, 73.2 MB/s [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 146.289 seconds, 73.4 MB/s [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 143.643 seconds, 74.8 MB/s
I know the bug I posted was not about performance but about dead link. But they seem related because with my patch in 2.6.18-160.invoca1.el5 speed and "dead link" are fine.
I can confirm that both issues are related. Just got a call from one of our users where I installed the 2.6.18-162.el5.ivtest.1 kernel and the following logs showed up today: Aug 17 08:08:13 dhcp-1-149 kernel: NETDEV WATCHDOG: eth0: transmit timed out Aug 17 08:08:13 dhcp-1-149 kernel: r8169: eth0: link up Aug 17 08:11:26 dhcp-1-149 dhclient: DHCPREQUEST on eth0 to 192.168.1.10 port 67 Aug 17 08:18:31 dhcp-1-149 kernel: NETDEV WATCHDOG: eth0: transmit timed out Aug 17 08:18:31 dhcp-1-149 kernel: r8169: eth0: link up Aug 17 13:17:18 dhcp-1-149 dhclient: DHCPREQUEST on eth0 to 192.168.1.10 port 67 Aug 17 15:03:01 dhcp-1-149 kernel: NETDEV WATCHDOG: eth0: transmit timed out Aug 17 15:03:01 dhcp-1-149 kernel: r8169: eth0: link up BTW, my description of the "dead link" is not always correct. Sometimes the link somehow slows down but doesn't get dead. Maybe that's what happened in my tests shown above. Note that the exact same happens with unpatched 2.6.18-162.el5.
Simon, there are new packages (2.6.18-164...) at: http://people.redhat.com/ivecera/rhel-5-ivtest/ Could you please test them?
Hi Ivan, 2.6.18-164.el5.ivtest.1 works fine. It shows exactly the same behavior like my own patched 2.6.18-160.invoca1.el5. [root@client140 ~]# uname -a Linux client140.bi.corp.invoca.ch 2.6.18-164.el5.ivtest.1 #1 SMP Mon Aug 24 11:18:49 EDT 2009 i686 i686 i386 GNU/Linux [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 146.228 seconds, 73.4 MB/s [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 149.719 seconds, 71.7 MB/s [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 148.641 seconds, 72.2 MB/s I hope this patch will make it into 5.4 as well as current 5.3.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
in kernel-2.6.18-169.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
kernel-2.6.18-169.el5 performs well in my tests: [root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 148.176 seconds, 72.5 MB/s
*** Bug 521132 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html
I can confirm this bug on Fedora 18 testing. I fixed with adding to Grub bootloader: clocksource=acpi_pm It seems for me it was an AMD PowerNow and timing issue with power management and Linux. Lance