Created attachment 471488 [details] ath5k: disable ASPM L0s for all cards Description of problem: Heavy load on PCIe based ath5k driven wireless devices makes system unresponsive. While the kernel not really crashes it just floods kernel logs with messages like below and switching power seems to be the only way to get out of this. KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1193) recvmsg bug: copied E1096BA0 seq E1097148 Version-Release number of selected component (if applicable): kernel-2.6.18-238.el5.i686 (this is a EL5 development kernel taken from here http://people.redhat.com/jwilson/el5/) How reproducible: Just use the wireless device and put some heavy traffic on it. After some seconds or minutes the problem shows up. Steps to Reproduce: 1. Run something like rsync over the wireless link 2. 3. Actual results: The current TCP connection stalls and the kernel log is flooded with messages like this: KERNEL: assertion (flags & MSG_PEEK) failed at net/ipv4/tcp.c (1193) recvmsg bug: copied E1096BA0 seq E1097148 Expected results: No errors should show up and rsync should finish without problems. Additional info: After searching a long time without success I came here http://wireless.kernel.org/en/users/Drivers/ath5k and just saw this "Disable ASPM L0s for all cards: This fixes problems with PCI-E cards, especially on Acer Aspire One." I have then tried to backport the patch into the RHEL5 kernel and it works fine for me.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Interesting. I thought in RHEL5 we did not enable ASPM at all. Simon, would you attach the output of "lspci -nnvvv" with the unpatched kernel? Thanks.
OK, I did it with both kernels. The unpatched is this one http://people.redhat.com/jwilson/el5/238.el5/i686/kernel-2.6.18-238.el5.i686.rpm. The patched is my own wih the mentioned patch. See here: [simix@wurro ~]$ diff 2.6.18-238.el5 2.6.18-238.invoca1.el5 -Nau --- 2.6.18-238.el5 2011-02-17 13:58:56.000000000 +0100 +++ 2.6.18-238.invoca1.el5 2011-02-17 13:58:56.000000000 +0100 @@ -142,7 +142,7 @@ Device: MaxPayload 128 bytes, MaxReadReq 128 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 3 Link: Latency L0s <256ns, L1 <4us - Link: ASPM L0s L1 Enabled RCB 64 bytes CommClk+ ExtSynch- + Link: ASPM L1 Enabled RCB 64 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x1 Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+ Slot: Number 2, PowerLimit 6.500000 @@ -332,7 +332,7 @@ Device: MaxPayload 128 bytes, MaxReadReq 512 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s L1, Port 0 Link: Latency L0s <512ns, L1 <64us - Link: ASPM L0s L1 Enabled RCB 128 bytes CommClk+ ExtSynch- + Link: ASPM L1 Enabled RCB 128 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x1 Capabilities: [90] MSI-X: Enable- Mask- TabSize=1 Vector table: BAR=0 offset=00000000 Regards, Simon
Created attachment 479316 [details] lspci unpatched
Created attachment 479317 [details] lspci patched
Matthew Garrett tells me that the RHEL5 kernel never enables ASPM. In this case the BIOS must have done it. So a patch like the proposed one is indeed necessary. I'd only suggest a cleanup. Instead of copying the code from e1000e, it should be put into a common function. Stanislaw will do that.
If ASPM can be enabled by BIOS, should we disable it explicitly also in RHEL5 on the same drivers we did it in RHEL6 (i.e. ath5k, ath9k, iwlwifi, r8169, aacraid ...)? Matthew, your opinion?
In general the BIOS won't set up ASPM modes that break - this case seems to be an exception. I think we could get away with just doing ath5k.
Created attachment 496804 [details] /0001-ath5k-disable-ASPM-L0s-for-all-cards.patch This is a slightly different patch from that was proposed. It check if device is PCIe one, and add comments that was in upstream commit.
(In reply to comment #15) > So a patch like the proposed one is indeed necessary. I'd only suggest a > cleanup. Instead of copying the code from e1000e, it should be put into a > common function. Stanislaw will do that. Hmm, I told that by changed my mind, now I think it's not worth doing so ...
Created attachment 496807 [details] test_ath5k_aspm.patch As my BIOS does not enable ASPM, I was using this for testing.
Patch(es) available in kernel-2.6.18-261.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Somehow kernel-2.6.18-261.el5 is not available in http://people.redhat.com/jwilson/el5, maybe a missing sync?
I can confirm that kernel-2.6.18-261.el5 works for me as expected.
Couldn't reproduce the bug. We have a machine with the same PCI device but a different subsystem. The bug does not appear in this configuration. Our device configuration doesn't have Capabilities: [60] Express Legacy Endpoint IRQ 0 And high volume transfers (8GiB) over various protocols (scp, http (wget), and rsync) are all fine. I never got kernel activity in dmesg throughout this. Setting verified to SanityOnly and Customer.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html