Bug 666866 - Heavy load on ath5k wireless device makes system unresponsive
Summary: Heavy load on ath5k wireless device makes system unresponsive
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.6
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: rc
: ---
Assignee: Stanislaw Gruszka
QA Contact: Desktop QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-03 14:35 UTC by Simon Matter
Modified: 2013-01-10 08:15 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 10:29:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ath5k: disable ASPM L0s for all cards (1.38 KB, patch)
2011-01-03 14:35 UTC, Simon Matter
no flags Details | Diff
lspci unpatched (16.43 KB, text/plain)
2011-02-17 13:01 UTC, Simon Matter
no flags Details
lspci patched (16.42 KB, text/plain)
2011-02-17 13:01 UTC, Simon Matter
no flags Details
/0001-ath5k-disable-ASPM-L0s-for-all-cards.patch (4.60 KB, text/plain)
2011-05-04 14:23 UTC, Stanislaw Gruszka
no flags Details
test_ath5k_aspm.patch (1.87 KB, text/plain)
2011-05-04 14:32 UTC, Stanislaw Gruszka
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Simon Matter 2011-01-03 14:35:25 UTC
Created attachment 471488 [details]
ath5k: disable ASPM L0s for all cards

Description of problem:
Heavy load on PCIe based ath5k driven wireless devices makes system unresponsive. While the kernel not really crashes it just floods kernel logs with messages like below and switching power seems to be the only way to get out of this.

KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1193)
recvmsg bug: copied E1096BA0 seq E1097148

Version-Release number of selected component (if applicable):
kernel-2.6.18-238.el5.i686
(this is a EL5 development kernel taken from here http://people.redhat.com/jwilson/el5/)

How reproducible:
Just use the wireless device and put some heavy traffic on it. After some seconds or minutes the problem shows up.

Steps to Reproduce:
1. Run something like rsync over the wireless link
2.
3. 
  
Actual results:
The current TCP connection stalls and the kernel log is flooded with messages like this:

KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1193)
recvmsg bug: copied E1096BA0 seq E1097148

Expected results:
No errors should show up and rsync should finish without problems.

Additional info:
After searching a long time without success I came here http://wireless.kernel.org/en/users/Drivers/ath5k and just saw this "Disable ASPM L0s for all cards: This fixes problems with PCI-E cards, especially on Acer Aspire One."
I have then tried to backport the patch into the RHEL5 kernel and it works fine for me.

Comment 1 RHEL Program Management 2011-02-01 16:59:46 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Michal Schmidt 2011-02-17 11:56:06 UTC
Interesting. I thought in RHEL5 we did not enable ASPM at all.

Simon, would you attach the output of "lspci -nnvvv" with the unpatched kernel? Thanks.

Comment 12 Simon Matter 2011-02-17 13:00:47 UTC
OK, I did it with both kernels. The unpatched is this one http://people.redhat.com/jwilson/el5/238.el5/i686/kernel-2.6.18-238.el5.i686.rpm. The patched is my own wih the mentioned patch. See here:

[simix@wurro ~]$ diff 2.6.18-238.el5 2.6.18-238.invoca1.el5 -Nau
--- 2.6.18-238.el5      2011-02-17 13:58:56.000000000 +0100
+++ 2.6.18-238.invoca1.el5      2011-02-17 13:58:56.000000000 +0100
@@ -142,7 +142,7 @@
                Device: MaxPayload 128 bytes, MaxReadReq 128 bytes
                Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 3
                Link: Latency L0s <256ns, L1 <4us
-               Link: ASPM L0s L1 Enabled RCB 64 bytes CommClk+ ExtSynch-
+               Link: ASPM L1 Enabled RCB 64 bytes CommClk+ ExtSynch-
                Link: Speed 2.5Gb/s, Width x1
                Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+
                Slot: Number 2, PowerLimit 6.500000
@@ -332,7 +332,7 @@
                Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
                Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 0
                Link: Latency L0s <512ns, L1 <64us
-               Link: ASPM L0s L1 Enabled RCB 128 bytes CommClk+ ExtSynch-
+               Link: ASPM L1 Enabled RCB 128 bytes CommClk+ ExtSynch-
                Link: Speed 2.5Gb/s, Width x1
        Capabilities: [90] MSI-X: Enable- Mask- TabSize=1
                Vector table: BAR=0 offset=00000000

Regards,
Simon

Comment 13 Simon Matter 2011-02-17 13:01:21 UTC
Created attachment 479316 [details]
lspci unpatched

Comment 14 Simon Matter 2011-02-17 13:01:45 UTC
Created attachment 479317 [details]
lspci patched

Comment 15 Michal Schmidt 2011-02-17 13:45:32 UTC
Matthew Garrett tells me that the RHEL5 kernel never enables ASPM. In this case the BIOS must have done it.

So a patch like the proposed one is indeed necessary. I'd only suggest a cleanup. Instead of copying the code from e1000e, it should be put into a common function. Stanislaw will do that.

Comment 16 Stanislaw Gruszka 2011-02-18 07:32:56 UTC
If ASPM can be enabled by BIOS, should we disable it explicitly also in RHEL5 on the same drivers we did it in RHEL6 (i.e. ath5k, ath9k, iwlwifi, r8169, aacraid ...)? Matthew, your opinion?

Comment 17 Matthew Garrett 2011-02-18 13:08:20 UTC
In general the BIOS won't set up ASPM modes that break - this case seems to be an exception. I think we could get away with just doing ath5k.

Comment 23 Stanislaw Gruszka 2011-05-04 14:23:19 UTC
Created attachment 496804 [details]
/0001-ath5k-disable-ASPM-L0s-for-all-cards.patch

This is a slightly different patch from that was proposed. It check if device is PCIe one, and add comments that was in upstream commit.

Comment 24 Stanislaw Gruszka 2011-05-04 14:30:05 UTC
(In reply to comment #15)
> So a patch like the proposed one is indeed necessary. I'd only suggest a
> cleanup. Instead of copying the code from e1000e, it should be put into a
> common function. Stanislaw will do that.

Hmm, I told that by changed my mind, now I think it's not worth doing so ...

Comment 25 Stanislaw Gruszka 2011-05-04 14:32:06 UTC
Created attachment 496807 [details]
test_ath5k_aspm.patch

As my BIOS does not enable ASPM, I was using this for testing.

Comment 27 Jarod Wilson 2011-05-13 22:19:11 UTC
Patch(es) available in kernel-2.6.18-261.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 29 Simon Matter 2011-05-16 06:47:36 UTC
Somehow kernel-2.6.18-261.el5 is not available in http://people.redhat.com/jwilson/el5, maybe a missing sync?

Comment 30 Simon Matter 2011-05-20 04:32:43 UTC
I can confirm that kernel-2.6.18-261.el5 works for me as expected.

Comment 32 Vasiliy Sharapov 2011-06-27 18:21:52 UTC
Couldn't reproduce the bug. We have a machine with the same PCI device but a different subsystem. The bug does not appear in this configuration. Our device configuration doesn't have
Capabilities: [60] Express Legacy Endpoint IRQ 0
And high volume transfers (8GiB) over various protocols (scp, http (wget), and rsync) are all fine. I never got kernel activity in dmesg throughout this.

Setting verified to SanityOnly and Customer.

Comment 33 errata-xmlrpc 2011-07-21 10:29:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.