514589 – r8169 stopping all activity until the link is reset

Bug 514589 - r8169 stopping all activity until the link is reset

Summary: r8169 stopping all activity until the link is reset

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Ivan Vecera
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	521132 (view as bug list)
Depends On:
Blocks:	529366 533192
TreeView+	depends on / blocked

Reported:	2009-07-29 18:39 UTC by Simon Matter
Modified:	2014-07-01 13:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 06:53:44 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
avoid dead link on r8169 (3.62 KB, patch) 2009-07-29 20:13 UTC, Simon Matter	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Simon Matter 2009-07-29 18:39:28 UTC

Description of problem:

Network connection wich are using the r8169 are not reliable. It may work for some but is unusable for us. We have switched from Realtek's r8168 - which worked fine - to the builtin r8169 of 5.4 beta kernels but since have this issue as shown here http://patchwork.kernel.org/patch/37934/

The 8169 chip only generates MSI interrupts when all enabled event
sources are quiescent and one or more sources transition to active. If
not all of the active events are acknowledged, or a new event becomes
active while the existing ones are cleared in the handler, we will not
see a new interrupt.

The current interrupt handler masks off the Rx and Tx events once the
NAPI handler has been scheduled, which opens a race window in which we
can get another Rx or Tx event and never ACK'ing it, stopping all
activity until the link is reset (ifconfig down/up). Fix this by always
ACK'ing all event sources, and loop in the handler until we have all
sources quiescent.

Version-Release number of selected component (if applicable):
kernel-2.6.18-160.el5

How reproducible:
Just let a box run for many hours and move some data over the net. In our case the dead link happens ~ once a week.

Steps to Reproduce:
1. configure a r8169 network adapter
2. start working with it
3. work for many hours

Actual results:
Jul 28 23:23:35 nx-08 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jul 28 23:23:35 nx-08 kernel: r8169: eth0: link up
Jul 28 23:25:46 nx-08 kernel: r8169: eth0: link down
Jul 28 23:25:49 nx-08 kernel: r8169: eth0: link up
Jul 28 23:25:52 nx-08 kernel: r8169: eth0: link down
Jul 28 23:25:57 nx-08 kernel: r8169: eth0: link up

Expected results:
It should work.

Additional info:
It seems to affect only certain chips on certain hardware and link speed seems to have an effect too. If I understand correctly then it's no surprise.

I have backported a patch from 2.6.30.3 which hopefully fixes it. I'm just rebuilding RPMs to test it. If it runs I'll post it here. To make sure it works may need some days to verify.

Comment 1 Simon Matter 2009-07-29 20:13:37 UTC

Created attachment 355605 [details]
avoid dead link on r8169

The patched kernel works but I can not yet confirm that the bug is gone because it doesn't happen very often in my case.

Comment 2 Simon Matter 2009-08-03 15:39:42 UTC

I'd like to confirm that I didn't see any error again with this patch after moving some TB of data through it on my test box. Also another computer which has shown errors almost daily has not shown any errors again since installing the new kernel 4 days ago. That was I real show stopper for us on all RTL8168 NICs which is widely used on Atom based system these days.

Comment 3 Ivan Vecera 2009-08-13 18:30:13 UTC

Packages are located at:
http://people.redhat.com/ivecera/rhel-5-ivtest/

Simon, could you please test them?

Comment 4 Simon Matter 2009-08-13 19:43:36 UTC

Any chance you could post a i686 build there?
The boxes in question are Atom N270 based and we run them on 32bit (I'm not even sure they could run x86_64).

Regards,
Simon

Comment 5 Ivan Vecera 2009-08-14 07:52:49 UTC

No problem Simon, I will post it ASAP.

Comment 6 Ivan Vecera 2009-08-14 10:55:29 UTC

Simon, i686 packages are also there. Could you please test them?

Comment 7 Simon Matter 2009-08-17 09:44:05 UTC

Ivan, it doesn't seem to work. I tried to make the link stop by sending large amount of data trough it. While speed is usually as expected the transfer stops after some activity and will resume later. The time used to transfer 10G of data is ~3 times higher than what it should be.
I have tested with 2.6.18-162.el5, 2.6.18-162.el5.ivtest.1 and 2.6.18-160 and they all show the same issue, while 2.6.18-160.invoca1.el5 performs fine.

Actual results:
(running 2.6.18-162.el5.ivtest.1)
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 677.99 seconds, 15.8 MB/s
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 512.643 seconds, 20.9 MB/s
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 478.893 seconds, 22.4 MB/s


Expected results:
(running 2.6.18-160.invoca1.el5)
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 146.771 seconds, 73.2 MB/s
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 146.289 seconds, 73.4 MB/s
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 143.643 seconds, 74.8 MB/s

Comment 8 Simon Matter 2009-08-17 09:47:14 UTC

I know the bug I posted was not about performance but about dead link. But they seem related because with my patch in 2.6.18-160.invoca1.el5 speed and "dead link" are fine.

Comment 9 Simon Matter 2009-08-17 13:40:11 UTC

I can confirm that both issues are related. Just got a call from one of our users where I installed the 2.6.18-162.el5.ivtest.1 kernel and the following logs showed up today:

Aug 17 08:08:13 dhcp-1-149 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Aug 17 08:08:13 dhcp-1-149 kernel: r8169: eth0: link up
Aug 17 08:11:26 dhcp-1-149 dhclient: DHCPREQUEST on eth0 to 192.168.1.10 port 67
Aug 17 08:18:31 dhcp-1-149 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Aug 17 08:18:31 dhcp-1-149 kernel: r8169: eth0: link up
Aug 17 13:17:18 dhcp-1-149 dhclient: DHCPREQUEST on eth0 to 192.168.1.10 port 67
Aug 17 15:03:01 dhcp-1-149 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Aug 17 15:03:01 dhcp-1-149 kernel: r8169: eth0: link up

BTW, my description of the "dead link" is not always correct. Sometimes the link somehow slows down but doesn't get dead. Maybe that's what happened in my tests shown above. Note that the exact same happens with unpatched 2.6.18-162.el5.

Comment 10 Ivan Vecera 2009-08-24 20:11:53 UTC

Simon, there are new packages (2.6.18-164...) at:
http://people.redhat.com/ivecera/rhel-5-ivtest/

Could you please test them?

Comment 11 Simon Matter 2009-08-25 07:58:07 UTC

Hi Ivan, 2.6.18-164.el5.ivtest.1 works fine. It shows exactly the same behavior like my own patched 2.6.18-160.invoca1.el5.

[root@client140 ~]# uname -a
Linux client140.bi.corp.invoca.ch 2.6.18-164.el5.ivtest.1 #1 SMP Mon Aug 24 11:18:49 EDT 2009 i686 i686 i386 GNU/Linux

[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 146.228 seconds, 73.4 MB/s
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 149.719 seconds, 71.7 MB/s
[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 148.641 seconds, 72.2 MB/s

I hope this patch will make it into 5.4 as well as current 5.3.

Comment 12 RHEL Program Management 2009-08-27 06:53:23 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 16 RHEL Program Management 2009-09-25 17:37:01 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 17 Don Zickus 2009-10-13 16:09:11 UTC

in kernel-2.6.18-169.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 19 Simon Matter 2009-10-13 18:11:47 UTC

kernel-2.6.18-169.el5 performs well in my tests:

[root@client140 ~]# dd if=/dev/zero bs=1024k count=10240 > /dev/tcp/delta64/7777
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 148.176 seconds, 72.5 MB/s

Comment 22 Ivan Vecera 2009-11-24 19:01:49 UTC

*** Bug 521132 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2010-03-30 06:53:44 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Comment 27 Lance Lassetter 2013-05-24 13:54:46 UTC

I can confirm this bug on Fedora 18 testing.  I fixed with adding to Grub bootloader:  clocksource=acpi_pm

It seems for me it was an AMD PowerNow and timing issue with power management and Linux.  

Lance

Note You need to log in before you can comment on or make changes to this bug.