Bug 795141 - [ath5k] network gets stuck after several hours with 3.2.3+ kernel
Summary: [ath5k] network gets stuck after several hours with 3.2.3+ kernel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: John W. Linville
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-02-19 17:40 UTC by Göran Uddeborg
Modified: 2012-10-29 15:53 UTC (History)
12 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2012-10-29 15:53:52 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Göran Uddeborg 2012-02-19 17:40:09 UTC
Description of problem:
I use an Atheros card I use in access point mode.  This worked without problems for a long time using the 3.1.7-1 kernel.  But shortly after upgrading to 3.2.3-2 the clients could no longer see the networks served by the host.  A look in messages showed 

  Feb 17 05:47:33 mimmi kernel: [324048.332647] ath5k phy0: gain calibration timeout (2462MHz)
  Feb 17 05:47:33 mimmi kernel: [324048.660453] ath5k phy0: calibration of channel 11 failed
  Feb 17 05:47:34 mimmi kernel: [324049.104485] ath5k phy0: gain calibration timeout (2462MHz)
  Feb 17 05:47:34 mimmi kernel: [324049.847293] ath5k phy0: gain calibration timeout (2462MHz)

And then the last message was repeated once or twice per second.

I also noticed that a kworker process was high in "top" while this was happening.

Rebooting the machine makes the problem go away for a while, but after some hours or a day or so it starts happening again.

I tried to do rmmod on the ath5k module and its immediate dependencies, and then loading it again, but I couldn't get the network to work again.  The only fix I have found is to reboot.

Between those two kernels I also ran 3.1.9-1 for a very short while.  I didn't see any problems with that kernel, but I used it for so short a time it might just have been luck.

I found bug 785951 which seemed very similar.  And that should be fixed in 3.2.5-3, so I tried to upgrade.  That gave me 3.2.6-3 which worked fine for 33 hours, but this morning the problem reappeared.


Version-Release number of selected component (if applicable):
kernel-3.2.6-3.fc16.x86_64
hostapd-0.7.3-2.fc15.x86_64


How reproducible:
It takes a while after a reboot before it happens.  I haven't figured out any pattern for how long delay there is.


Additional information:
05:06.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01)

Comment 1 John W. Linville 2012-02-20 15:02:56 UTC
FWIW, the kernel in question is using compat-wireless-3.3-rc1-2 with the following patches applied:

Patch50101: mac80211-fix-debugfs-key-station-symlink.patch
Patch50102: brcmsmac-fix-tx-queue-flush-infinite-loop.patch
Patch50103: mac80211-Use-the-right-headroom-size-for-mesh-mgmt-f.patch
Patch50105: b43-add-option-to-avoid-duplicating-device-support-w.patch
Patch50106: mac80211-update-oper_channel-on-ibss-join.patch
Patch50107: mac80211-set-bss_conf.idle-when-vif-is-connected.patch
Patch50108: iwlwifi-fix-PCI-E-transport-inta-race.patch
Patch50109: bcma-Fix-mem-leak-in-bcma_bus_scan.patch
Patch50110: rt2800lib-fix-wrong-128dBm-when-signal-is-stronger-t.patch
Patch50111: iwlwifi-make-Tx-aggregation-enabled-on-ra-be-at-DEBU.patch
Patch50112: ssb-fix-cardbus-slot-in-hostmode.patch
Patch50113: iwlwifi-don-t-mess-up-QoS-counters-with-non-QoS-fram.patch
Patch50114: mac80211-timeout-a-single-frame-in-the-rx-reorder-bu.patch
Patch50115: ath9k-use-WARN_ON_ONCE-in-ath_rc_get_highest_rix.patch
Patch50116: mwifiex-handle-association-failure-case-correctly.patch
Patch50117: ath9k-Fix-kernel-panic-during-driver-initilization.patch
Patch50118: mwifiex-add-NULL-checks-in-driver-unload-path.patch
Patch50119: ath9k-fix-a-WEP-crypto-related-regression.patch
Patch50120: ath9k_hw-fix-a-RTS-CTS-timeout-regression.patch
Patch50121: bcma-don-t-fail-for-bad-SPROM-CRC.patch
Patch50122: zd1211rw-firmware-needs-duration_id-set-to-zero-for-.patch
Patch50123: mac80211-Fix-a-rwlock-bad-magic-bug.patch
Patch50124: rtlwifi-Modify-rtl_pci_init-to-return-0-on-success.patch

With that said, the "ath5k phy0: gain calibration timeout" issue has been around for a long time.  Is it possible that some other environmental condition has changed (i.e. moved the AP somewhere that gets hotter)?

Comment 2 Göran Uddeborg 2012-02-20 20:33:55 UTC
The machine is standing where it has stood for many years.  It's winter here, and the room temperature hasn't changed noticeably.  So I can't see that anything in the environment has changed.

The host is used for some day-to-day purposes, both in some server roles, like serving as an AP, and as a desktop machine, so the usage pattern is not completely stable.  And I have done some system changes, like updating mysql packages.  But nothing that I can possibly imagine would affect the wireless code.

It did start in time very shortly after my switch to the 3.2.3-2 kernel, so I strongly suspect that to be the cause.  But if it would be of any help, I could reboot with the 3.1.7-1 kernel again, and make sure it doesn't happen with that.  

(I hope there isn't any serious, remotely exploitable security fixes since 3.1.7-1.  This host has to be reachable from the outside.  I can trust its users, but it needs to resist attacks from the outside.)

"Sure" to a certain extent of course, since this doesn't happen immediately on reboot but with various delays.  As of this writing, the machine has been up 37 hours with the 3.2.6-3 kernel since last reboot, and so far it is working.  It remains to be seen for how long.  So if I try an older kernel again, I wouldn't really ever be completely sure the problem wasn't about to happen soon.

Comment 3 Nick Kossifidis 2012-02-20 21:08:04 UTC
Can you test this out ? 

https://dev.openwrt.org/browser/trunk/package/mac80211/patches/441-ath5k_no_agc_recalibration.patch?rev=30624

It seems that some cards/platforms don't like periodic gain calibration. So far we have reports from embedded mips platforms (the same card worked ok on mipsel), this is the first on x86_64 (x86 seems to be ok with the same card, it doesn't make sense). The difference of this "gain calibration timeout" is that it happens on the same channel, not during scan as the other reports. Your report is also the first to indicate a calibration failure, the other reports indicated the error was mostly during reset. Initially it seemed that it had to do with stopping the tx queues (that's why you have resets on the same channel because the queues got stuck and the driver tries to reset the card) but it seems we have to drop gain calibration from the periodic calibration. I'm waiting to get my hands on some docs from Atheros to confirm this but it seems that gain calibration is only needed after phy reset after all, not even after fast channel switch (which was one of the reasons I added it there), so the above patch should do the trick. I'm still not sure about dropping the code for stopping the tx queues completely since AR5210 by design does gain calibration periodically (no dynamic I/Q calibration there).

Comment 4 Nick Kossifidis 2012-02-20 21:14:05 UTC
Have in mind that we are trying to debug this "gain calibration timeout" thing for a long time and we still don't have a way to deal with it, it seems it's almost non-deterministic ! That patch that added it to the periodic calibration was an experiment to see how it'll work, I'm sorry it had the opposite results...

Comment 5 Göran Uddeborg 2012-02-20 22:05:16 UTC
I will build a kernel with your patch.  I'll be back when I've done it and have tried it for a little while.

Comment 6 Nick Kossifidis 2012-02-20 22:18:19 UTC
Try also this one, it removes the queue stopping...
https://dev.openwrt.org/browser/trunk/package/mac80211/patches/440-ath5k_calibrate_no_queue_stop.patch?rev=30623

BTW both patches come from Felix Fietkau from OpenWRT

Comment 7 Göran Uddeborg 2012-02-21 13:36:01 UTC
These patches are not made for the 3.2.6 kernel, are they?  I tried to rebuild the RPMS after adding the patches, but the build fail when it tried to apply the first.  And looking at them manually, I don't find the code they try to patch.  The 440-ath5k_calibrate_no_queue_stop patch tries to patch a function called ath5k_calibrate_work, which doesn't even exist in the sources in the 3.2.6-3 sources.  The 441-ath5k_no_agc_recalibration patches ath5k_hw_phy_calibrate which does exist, but looks very different from what the patch assumes.  I looked around a bit in case the code had just been moved around, but I couldn't find it.

I haven't rebuilt the kernel RPM:s before, and the SPEC file did look a bit different from the typical package.  So maybe I'm missing something here.  I could use a little hint on what it could be.

By the way, the 3.2.6-3 kernel does appear much less affected by this issue than the 3.2.3-2 kernel did.  The latter worked maybe a day, but with the former I've only seen this happen once so far.  And the second boot has now worked fine for more than 54 hours.

Or maybe there is something in the environment that changes after all, although I don't know what it would be.  I better go M-x phases-of-moon ...

Comment 8 John W. Linville 2012-02-21 13:40:30 UTC
Göran,

Patching the f16 kernel for wireless has some pitfalls.  I'll put together a test kernel for you sometime today.

Nick, thanks for the attention!

Comment 9 John W. Linville 2012-02-21 15:19:21 UTC
Test kernels building here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=3807351

When that build completes, please attempt to recreate this issue with them and report the result here...thanks!

Comment 10 Göran Uddeborg 2012-02-21 17:40:50 UTC
Thanks.  I've rebooted with 3.2.7-1.bz795141.1 now.  It works fine initially, but I guess that doesn't say all that much.  As I've mentioned, we haven't figured out any triggers, so the best we can do is to use it as much as possible.

I'll let you know as soon as I notice anything unusual.

Comment 11 Joachim Frieben 2012-03-03 20:48:07 UTC
Issue also present for kernel-3.2.9-1.fc16.x86_64. The scratch build kernel-3.2.7-1.bz795141.1.fc16 is gone though :o(

Comment 12 Göran Uddeborg 2012-03-04 15:44:13 UTC
I didn't save the binary RPM, but in case you want to rebuild it yourself, I've put the source RPM at ftp://ftp.uddeborg.se/pub/kernel-3.2.7-1.bz795141.1.fc16.src.rpm

I've hadn't had any problems with the 3.2.7-1.bz795141.1 kernel since I started using it on the 21:st.  Since I don't really know what triggered the problem, I can not be completely sure, but it certainly looks as if the problem is gone in it.

Comment 13 Dave Jones 2012-03-22 16:39:14 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 14 Dave Jones 2012-03-22 16:44:22 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 15 Dave Jones 2012-03-22 16:52:47 UTC
[mass update]
kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository.
Please retest with this update.

Comment 16 Göran Uddeborg 2012-03-22 22:32:21 UTC
It's not fixed, unfortunately.  It didn't take more than just above 20 minutes after booting 3.3.0-4.fc16.x86_64 before it started spewing "gain calibration timeout" messages.

Comment 17 Tuxtard 2012-03-30 10:12:53 UTC
Bug is still present.

kernel: 3.3.0-4.fc16.i686.PAE

Message:
Mar 30 09:10:13 localhost kernel: [81231.154051] ath5k phy0: gain calibration timeout (2447MHz)

Comment 18 Andy Shevchenko 2012-05-14 17:41:37 UTC
3.3.5-2.fc16 Have the same bug.
F-15 has the same bug #638943 opened

Comment 19 Tuxtard 2012-05-14 19:21:55 UTC
Confirmed on 3.3.5-2.fc16

Comment 20 Andy Shevchenko 2012-05-17 11:18:08 UTC
Today I took the ath5k from linux-next and backported it to 3.3.5-2.fc16. It doesn't help. So, apparently something is still broken in ath5k and/or mac80211/cfg80211.
Actually if you run gkrellm you could easily see high CPU load. I turned on the debug in the module and get such messages:
...
[316350.536370] udevd[31746]: renamed network interface wlan0 to ath0
[316350.536680] cfg80211: Regulatory domain changed to country: UA
[316350.536685] cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
[316350.536690] cfg80211:   (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm)
[316350.818870] ath5k: phy0: (ath5k_start:2613): mode 2
[316350.818875] ath5k: phy0: (ath5k_stop_locked:2573): invalid 0
[316350.819588] ath5k: phy0: (ath5k_reset:2746): resetting
[316350.821101] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2
[316351.168310] ath5k: phy0: gain calibration timeout (2412MHz)
[316351.168437] ath5k: phy0: (ath5k_rx_start:1092): cachelsz 64 rx_bufsize 2368
[316351.168482] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2
[316351.168488] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:522): mode setup opmode 2 (UNKNOWN)
[316351.168495] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:541): RX filter 0x0
[316351.168670] ath5k: phy0: (ath5k_rfkill_disable:42): rfkill disable (gpio:0 polarity:0)
[316351.519785] ADDRCONF(NETDEV_UP): ath0: link is not ready
[316357.113990] net_ratelimit: 18 callbacks suppressed
[316357.114023] ath5k: phy0: (ath5k_chan_set:437): channel set, resetting (2412 -> 2412 MHz)
[316357.114028] ath5k: phy0: (ath5k_reset:2746): resetting
[316357.115478] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2
[316357.462525] ath5k: phy0: gain calibration timeout (2412MHz)
[316357.462654] ath5k: phy0: (ath5k_rx_start:1092): cachelsz 64 rx_bufsize 2368
[316357.462667] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2
[316357.462673] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:522): mode setup opmode 2 (UNKNOWN)
[316357.462681] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:541): RX filter 0x17
[316357.587084] ath5k: phy0: (ath5k_chan_set:437): channel set, resetting (2412 -> 2417 MHz)
[316357.587092] ath5k: phy0: (ath5k_reset:2746): resetting
[316362.201914] net_ratelimit: 73 callbacks suppressed

It seems the kernel tries to reset the hw so hard without any success.

Comment 21 Andy Shevchenko 2012-05-17 11:56:25 UTC
Accordingly to 
https://bbs.archlinux.org/viewtopic.php?pid=1087622
the new driver (two patches are mentioned there) should work.

I just did 
modprobe -r ath5k
pm-suspend
... (power on)
modprobe ath5k

Will see...

Comment 22 Andy Shevchenko 2012-05-24 04:29:16 UTC
(In reply to comment #21)
> Accordingly to 
> https://bbs.archlinux.org/viewtopic.php?pid=1087622
> the new driver (two patches are mentioned there) should work.
> 
> I just did 
> modprobe -r ath5k
> pm-suspend
> ... (power on)
> modprobe ath5k

Get no problem from that time till now.
Laptop is turned on 24x7.

So, I think the version from the linux kernel v. pre-3.5 (probably those two patches) solve the issue.

Comment 23 Göran Uddeborg 2012-06-02 11:47:46 UTC
Andy, I'm a bit confused by different version references in the comments here and in the forum thread.  Is this a fix that is going into the upstreams kernel?  If so, was it included in 3.4, or will we have to wait for 3.5?

Comment 24 Andy Shevchenko 2012-06-02 11:55:35 UTC
(In reply to comment #23)
> Andy, I'm a bit confused by different version references in the comments
> here and in the forum thread.  Is this a fix that is going into the
> upstreams kernel?  If so, was it included in 3.4, or will we have to wait
> for 3.5?
I used the version that will be a part of the v3.5. However, if I remember correctly the mentioned patches were also a part of the v3.4-rcX, but I didn't check it.

Comment 25 Göran Uddeborg 2012-06-02 20:48:37 UTC
I see, thanks!  I'll try Fedora v3.4 kernels when they show up and see if they help.

Comment 26 A. Bleasby 2012-06-06 12:39:01 UTC
Same symptons here for some considerable time. Has also required a power-cycle
of a Virgin Super Hub whenever the fault occurred. Having applied the patches
from comment 21 to a

3.3.7-1.fc17.x86_64

kernel the wireless has now been stable for 3 days (usually went down several
times per day - often when wireless activity from neighbours' increased).

Comment 27 Dave Jones 2012-10-23 15:28:11 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 28 Göran Uddeborg 2012-10-23 18:02:15 UTC
This problem finally disappeared when I upgraded to a 3.5.1 kernel.  It is the F17 kernel though (kernel-3.5.1-1.fc17.x86_64) in case that actually matters.  As far as I'm concerned, this can be closed.

Comment 29 John W. Linville 2012-10-29 15:53:52 UTC
Closing on basis of comment 28.


Note You need to log in before you can comment on or make changes to this bug.