Description of problem: I use an Atheros card I use in access point mode. This worked without problems for a long time using the 3.1.7-1 kernel. But shortly after upgrading to 3.2.3-2 the clients could no longer see the networks served by the host. A look in messages showed Feb 17 05:47:33 mimmi kernel: [324048.332647] ath5k phy0: gain calibration timeout (2462MHz) Feb 17 05:47:33 mimmi kernel: [324048.660453] ath5k phy0: calibration of channel 11 failed Feb 17 05:47:34 mimmi kernel: [324049.104485] ath5k phy0: gain calibration timeout (2462MHz) Feb 17 05:47:34 mimmi kernel: [324049.847293] ath5k phy0: gain calibration timeout (2462MHz) And then the last message was repeated once or twice per second. I also noticed that a kworker process was high in "top" while this was happening. Rebooting the machine makes the problem go away for a while, but after some hours or a day or so it starts happening again. I tried to do rmmod on the ath5k module and its immediate dependencies, and then loading it again, but I couldn't get the network to work again. The only fix I have found is to reboot. Between those two kernels I also ran 3.1.9-1 for a very short while. I didn't see any problems with that kernel, but I used it for so short a time it might just have been luck. I found bug 785951 which seemed very similar. And that should be fixed in 3.2.5-3, so I tried to upgrade. That gave me 3.2.6-3 which worked fine for 33 hours, but this morning the problem reappeared. Version-Release number of selected component (if applicable): kernel-3.2.6-3.fc16.x86_64 hostapd-0.7.3-2.fc15.x86_64 How reproducible: It takes a while after a reboot before it happens. I haven't figured out any pattern for how long delay there is. Additional information: 05:06.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01)
FWIW, the kernel in question is using compat-wireless-3.3-rc1-2 with the following patches applied: Patch50101: mac80211-fix-debugfs-key-station-symlink.patch Patch50102: brcmsmac-fix-tx-queue-flush-infinite-loop.patch Patch50103: mac80211-Use-the-right-headroom-size-for-mesh-mgmt-f.patch Patch50105: b43-add-option-to-avoid-duplicating-device-support-w.patch Patch50106: mac80211-update-oper_channel-on-ibss-join.patch Patch50107: mac80211-set-bss_conf.idle-when-vif-is-connected.patch Patch50108: iwlwifi-fix-PCI-E-transport-inta-race.patch Patch50109: bcma-Fix-mem-leak-in-bcma_bus_scan.patch Patch50110: rt2800lib-fix-wrong-128dBm-when-signal-is-stronger-t.patch Patch50111: iwlwifi-make-Tx-aggregation-enabled-on-ra-be-at-DEBU.patch Patch50112: ssb-fix-cardbus-slot-in-hostmode.patch Patch50113: iwlwifi-don-t-mess-up-QoS-counters-with-non-QoS-fram.patch Patch50114: mac80211-timeout-a-single-frame-in-the-rx-reorder-bu.patch Patch50115: ath9k-use-WARN_ON_ONCE-in-ath_rc_get_highest_rix.patch Patch50116: mwifiex-handle-association-failure-case-correctly.patch Patch50117: ath9k-Fix-kernel-panic-during-driver-initilization.patch Patch50118: mwifiex-add-NULL-checks-in-driver-unload-path.patch Patch50119: ath9k-fix-a-WEP-crypto-related-regression.patch Patch50120: ath9k_hw-fix-a-RTS-CTS-timeout-regression.patch Patch50121: bcma-don-t-fail-for-bad-SPROM-CRC.patch Patch50122: zd1211rw-firmware-needs-duration_id-set-to-zero-for-.patch Patch50123: mac80211-Fix-a-rwlock-bad-magic-bug.patch Patch50124: rtlwifi-Modify-rtl_pci_init-to-return-0-on-success.patch With that said, the "ath5k phy0: gain calibration timeout" issue has been around for a long time. Is it possible that some other environmental condition has changed (i.e. moved the AP somewhere that gets hotter)?
The machine is standing where it has stood for many years. It's winter here, and the room temperature hasn't changed noticeably. So I can't see that anything in the environment has changed. The host is used for some day-to-day purposes, both in some server roles, like serving as an AP, and as a desktop machine, so the usage pattern is not completely stable. And I have done some system changes, like updating mysql packages. But nothing that I can possibly imagine would affect the wireless code. It did start in time very shortly after my switch to the 3.2.3-2 kernel, so I strongly suspect that to be the cause. But if it would be of any help, I could reboot with the 3.1.7-1 kernel again, and make sure it doesn't happen with that. (I hope there isn't any serious, remotely exploitable security fixes since 3.1.7-1. This host has to be reachable from the outside. I can trust its users, but it needs to resist attacks from the outside.) "Sure" to a certain extent of course, since this doesn't happen immediately on reboot but with various delays. As of this writing, the machine has been up 37 hours with the 3.2.6-3 kernel since last reboot, and so far it is working. It remains to be seen for how long. So if I try an older kernel again, I wouldn't really ever be completely sure the problem wasn't about to happen soon.
Can you test this out ? https://dev.openwrt.org/browser/trunk/package/mac80211/patches/441-ath5k_no_agc_recalibration.patch?rev=30624 It seems that some cards/platforms don't like periodic gain calibration. So far we have reports from embedded mips platforms (the same card worked ok on mipsel), this is the first on x86_64 (x86 seems to be ok with the same card, it doesn't make sense). The difference of this "gain calibration timeout" is that it happens on the same channel, not during scan as the other reports. Your report is also the first to indicate a calibration failure, the other reports indicated the error was mostly during reset. Initially it seemed that it had to do with stopping the tx queues (that's why you have resets on the same channel because the queues got stuck and the driver tries to reset the card) but it seems we have to drop gain calibration from the periodic calibration. I'm waiting to get my hands on some docs from Atheros to confirm this but it seems that gain calibration is only needed after phy reset after all, not even after fast channel switch (which was one of the reasons I added it there), so the above patch should do the trick. I'm still not sure about dropping the code for stopping the tx queues completely since AR5210 by design does gain calibration periodically (no dynamic I/Q calibration there).
Have in mind that we are trying to debug this "gain calibration timeout" thing for a long time and we still don't have a way to deal with it, it seems it's almost non-deterministic ! That patch that added it to the periodic calibration was an experiment to see how it'll work, I'm sorry it had the opposite results...
I will build a kernel with your patch. I'll be back when I've done it and have tried it for a little while.
Try also this one, it removes the queue stopping... https://dev.openwrt.org/browser/trunk/package/mac80211/patches/440-ath5k_calibrate_no_queue_stop.patch?rev=30623 BTW both patches come from Felix Fietkau from OpenWRT
These patches are not made for the 3.2.6 kernel, are they? I tried to rebuild the RPMS after adding the patches, but the build fail when it tried to apply the first. And looking at them manually, I don't find the code they try to patch. The 440-ath5k_calibrate_no_queue_stop patch tries to patch a function called ath5k_calibrate_work, which doesn't even exist in the sources in the 3.2.6-3 sources. The 441-ath5k_no_agc_recalibration patches ath5k_hw_phy_calibrate which does exist, but looks very different from what the patch assumes. I looked around a bit in case the code had just been moved around, but I couldn't find it. I haven't rebuilt the kernel RPM:s before, and the SPEC file did look a bit different from the typical package. So maybe I'm missing something here. I could use a little hint on what it could be. By the way, the 3.2.6-3 kernel does appear much less affected by this issue than the 3.2.3-2 kernel did. The latter worked maybe a day, but with the former I've only seen this happen once so far. And the second boot has now worked fine for more than 54 hours. Or maybe there is something in the environment that changes after all, although I don't know what it would be. I better go M-x phases-of-moon ...
Göran, Patching the f16 kernel for wireless has some pitfalls. I'll put together a test kernel for you sometime today. Nick, thanks for the attention!
Test kernels building here: http://koji.fedoraproject.org/koji/taskinfo?taskID=3807351 When that build completes, please attempt to recreate this issue with them and report the result here...thanks!
Thanks. I've rebooted with 3.2.7-1.bz795141.1 now. It works fine initially, but I guess that doesn't say all that much. As I've mentioned, we haven't figured out any triggers, so the best we can do is to use it as much as possible. I'll let you know as soon as I notice anything unusual.
Issue also present for kernel-3.2.9-1.fc16.x86_64. The scratch build kernel-3.2.7-1.bz795141.1.fc16 is gone though :o(
I didn't save the binary RPM, but in case you want to rebuild it yourself, I've put the source RPM at ftp://ftp.uddeborg.se/pub/kernel-3.2.7-1.bz795141.1.fc16.src.rpm I've hadn't had any problems with the 3.2.7-1.bz795141.1 kernel since I started using it on the 21:st. Since I don't really know what triggered the problem, I can not be completely sure, but it certainly looks as if the problem is gone in it.
[mass update] kernel-3.3.0-4.fc16 has been pushed to the Fedora 16 stable repository. Please retest with this update.
It's not fixed, unfortunately. It didn't take more than just above 20 minutes after booting 3.3.0-4.fc16.x86_64 before it started spewing "gain calibration timeout" messages.
Bug is still present. kernel: 3.3.0-4.fc16.i686.PAE Message: Mar 30 09:10:13 localhost kernel: [81231.154051] ath5k phy0: gain calibration timeout (2447MHz)
3.3.5-2.fc16 Have the same bug. F-15 has the same bug #638943 opened
Confirmed on 3.3.5-2.fc16
Today I took the ath5k from linux-next and backported it to 3.3.5-2.fc16. It doesn't help. So, apparently something is still broken in ath5k and/or mac80211/cfg80211. Actually if you run gkrellm you could easily see high CPU load. I turned on the debug in the module and get such messages: ... [316350.536370] udevd[31746]: renamed network interface wlan0 to ath0 [316350.536680] cfg80211: Regulatory domain changed to country: UA [316350.536685] cfg80211: (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp) [316350.536690] cfg80211: (2402000 KHz - 2482000 KHz @ 40000 KHz), (N/A, 2000 mBm) [316350.818870] ath5k: phy0: (ath5k_start:2613): mode 2 [316350.818875] ath5k: phy0: (ath5k_stop_locked:2573): invalid 0 [316350.819588] ath5k: phy0: (ath5k_reset:2746): resetting [316350.821101] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2 [316351.168310] ath5k: phy0: gain calibration timeout (2412MHz) [316351.168437] ath5k: phy0: (ath5k_rx_start:1092): cachelsz 64 rx_bufsize 2368 [316351.168482] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2 [316351.168488] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:522): mode setup opmode 2 (UNKNOWN) [316351.168495] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:541): RX filter 0x0 [316351.168670] ath5k: phy0: (ath5k_rfkill_disable:42): rfkill disable (gpio:0 polarity:0) [316351.519785] ADDRCONF(NETDEV_UP): ath0: link is not ready [316357.113990] net_ratelimit: 18 callbacks suppressed [316357.114023] ath5k: phy0: (ath5k_chan_set:437): channel set, resetting (2412 -> 2412 MHz) [316357.114028] ath5k: phy0: (ath5k_reset:2746): resetting [316357.115478] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2 [316357.462525] ath5k: phy0: gain calibration timeout (2412MHz) [316357.462654] ath5k: phy0: (ath5k_rx_start:1092): cachelsz 64 rx_bufsize 2368 [316357.462667] ath5k: phy0: (ath5k_hw_set_opmode:878): mode 2 [316357.462673] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:522): mode setup opmode 2 (UNKNOWN) [316357.462681] ath5k: phy0: (ath5k_update_bssid_mask_and_opmode:541): RX filter 0x17 [316357.587084] ath5k: phy0: (ath5k_chan_set:437): channel set, resetting (2412 -> 2417 MHz) [316357.587092] ath5k: phy0: (ath5k_reset:2746): resetting [316362.201914] net_ratelimit: 73 callbacks suppressed It seems the kernel tries to reset the hw so hard without any success.
Accordingly to https://bbs.archlinux.org/viewtopic.php?pid=1087622 the new driver (two patches are mentioned there) should work. I just did modprobe -r ath5k pm-suspend ... (power on) modprobe ath5k Will see...
(In reply to comment #21) > Accordingly to > https://bbs.archlinux.org/viewtopic.php?pid=1087622 > the new driver (two patches are mentioned there) should work. > > I just did > modprobe -r ath5k > pm-suspend > ... (power on) > modprobe ath5k Get no problem from that time till now. Laptop is turned on 24x7. So, I think the version from the linux kernel v. pre-3.5 (probably those two patches) solve the issue.
Andy, I'm a bit confused by different version references in the comments here and in the forum thread. Is this a fix that is going into the upstreams kernel? If so, was it included in 3.4, or will we have to wait for 3.5?
(In reply to comment #23) > Andy, I'm a bit confused by different version references in the comments > here and in the forum thread. Is this a fix that is going into the > upstreams kernel? If so, was it included in 3.4, or will we have to wait > for 3.5? I used the version that will be a part of the v3.5. However, if I remember correctly the mentioned patches were also a part of the v3.4-rcX, but I didn't check it.
I see, thanks! I'll try Fedora v3.4 kernels when they show up and see if they help.
Same symptons here for some considerable time. Has also required a power-cycle of a Virgin Super Hub whenever the fault occurred. Having applied the patches from comment 21 to a 3.3.7-1.fc17.x86_64 kernel the wireless has now been stable for 3 days (usually went down several times per day - often when wireless activity from neighbours' increased).
# Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).
This problem finally disappeared when I upgraded to a 3.5.1 kernel. It is the F17 kernel though (kernel-3.5.1-1.fc17.x86_64) in case that actually matters. As far as I'm concerned, this can be closed.
Closing on basis of comment 28.