Bug 671514 - going from kernel-2.6.35.9-64.fc14.x86_64 to 2.6.35.10-74.fc14 breaks ath9k ap
Summary: going from kernel-2.6.35.9-64.fc14.x86_64 to 2.6.35.10-74.fc14 breaks ath9k ap
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 14
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Stanislaw Gruszka
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-01-21 18:18 UTC by Trever Adams
Modified: 2011-05-25 14:16 UTC (History)
5 users (show)

Fixed In Version: kernel-2.6.35.11-87.fc14
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-25 14:16:56 UTC


Attachments (Terms of Use)
build config (5.11 KB, application/octet-stream)
2011-01-21 18:18 UTC, Trever Adams
no flags Details
runtime configuration for hostapd (41.04 KB, application/octet-stream)
2011-02-04 20:56 UTC, Trever Adams
no flags Details

Description Trever Adams 2011-01-21 18:18:27 UTC
Created attachment 474651 [details]
build config

Description of problem:
Changing the kernel version as specified in summary causes hostapd to think it is working but windows to complain and be unable to connect to the ap. The AP is an ath9k based card.

Version-Release number of selected component (if applicable):
2.6.35.10-74.fc14

How reproducible:
Every time

Actual hostapd config (not the build which is attached) can be provided if requested.

Comment 1 Trever Adams 2011-02-04 20:55:22 UTC
This is quite a serious bug and a huge regression.

Card data:

06:05.0 Network controller: Atheros Communications Inc. AR922X Wireless Network Adapter (rev 01)
	Subsystem: D-Link System Inc DWA-552 802.11n Xtreme N Desktop Adapter (rev A2)
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 168, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 20
	Region 0: Memory at fbff0000 (32-bit, non-prefetchable) [size=64K]
	Capabilities: [44] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=100mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Kernel driver in use: ath9k
	Kernel modules: ath9k

Comment 2 Trever Adams 2011-02-04 20:56:54 UTC
Created attachment 477114 [details]
runtime configuration for hostapd

This is exactly what I use, except the passphrase and SSID have been changed.

Comment 3 Trever Adams 2011-02-13 01:36:40 UTC
Kernel kernel-2.6.35.11-83.fc14.x86_64 still has this bug.

On other devices I see that the EXACT same configuration yields WEP instead of WPA under this kernel. It should be WPA/WPA2 and was working before. If I disable all 802.11n related features, I get a functioning 802.11g access point.

Scan for neighboring BSSes prior to enabling 40 MHz channel
Scan requested (ret=0) - scan timeout 10 seconds
Interface initialization will be completed in a callback
nl80211: Event message available
nl80211: Scan trigger
RTM_NEWLINK: operstate=0 ifi_flags=0x1002 ()
RTM_NEWLINK, IFLA_IFNAME: Interface 'mon.wlan0' added
Unknown event 5
RTM_NEWLINK: operstate=0 ifi_flags=0x11043 ([UP][RUNNING][LOWER_UP])
RTM_NEWLINK, IFLA_IFNAME: Interface 'mon.wlan0' added
Unknown event 5
RTM_NEWLINK: operstate=0 ifi_flags=0x11043 ([UP][RUNNING][LOWER_UP])
RTM_NEWLINK, IFLA_IFNAME: Interface 'wlan0' added
Unknown event 5
nl80211: Event message available
nl80211: New scan results available
Received scan results (0 BSSes)
40 MHz affected channel range: [2412,2462] MHz
Completing interface initialization
Mode: IEEE 802.11g  Channel: 4  Frequency: 2427 MHz
nl80211: Failed to set channel (freq=2427): -22 (Invalid argument)
Could not set channel for kernel driver
RTM_NEWLINK: operstate=0 ifi_flags=0x11043 ([UP][RUNNING][LOWER_UP])
RTM_NEWLINK, IFLA_IFNAME: Interface 'wlan0' added
Unknown event 5

Comment 4 Trever Adams 2011-03-07 10:03:16 UTC
This seems to have started about http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=commitdiff;h=b97dc7392d3023c515d6fb0adfa31547ca3915b8

This is the only patch or change in the package that could have caused this problem. I could be wrong. I am not sure that all of the things are showing up in f14/master that are ending up in the package.

Comment 5 Stanislaw Gruszka 2011-03-08 07:56:08 UTC
It's very unlikely that above patch cause breakage.

We have bunch of other wireless patches between .35.9 and .35.10.

> 3acc1eff9aab5dc224463d6fdc1fc98912f97a6f cfg80211: fix extension channel checks to initiate communication
> a664846f7362bd6b9cf6f1e407dd86daa963da97 mac80211: delete AddBA response timer
> a171652459e33373b16fe9d48e200aaa359010b8 mac80211: don't sanitize invalid rates
> 1416ef9a67f443b26bb9b9498ca14351616eedac mac80211: Fix ibss station got expired immediately
> 042a3f996b48464386e2b1ab6ecafc3be8a295c1 mac80211: reset probe send counter upon connection timer reset
> afa9194b3aee8846357236026905fb221c1d4564 mac80211: clear txflags for ps-filtered frames
> 0fe79ffa47b60778e71fc8b2dcfc30eb9681db30 mac80211: use correct station flags lock
> b732c91f60cd067a497515b82af31950361367f4 mac80211: disable beacon monitor while going offchannel
> 6f9867ece8be2ed323b085181d1ffa4424ab4ec6 mac80211: send last 3/5 probe requests as unicast
> 4311061afad44eff734ca645ee1e2833eb19c33d mac80211: make the beacon monitor available externally
> f4fa1333cd6577a63c532bc59586ac3b80f56c7e mac80211: reset connection idle when going offchannel
> 82f7a2e95e8f41d56633822a649ff3621fc7a540 mac80211: add helper for reseting the connection monitor
> b9676bfde50b87df6c2a214a46e7382ddd13a075 mac80211: Fix signal strength average initialization for CQM events
> 96303095042fabda86dbadf7fea4ec3ad122a0f6 mac80211: fix offchannel assumption upon association
> ce838da05df6b16b9d2ac4cf4df1e86e6c9da16f mac80211: fix channel assumption for association done work
> 12e3edfbbf42a116afc5e491930c5a1e56f4edb2 cfg80211: fix regression on processing country IEs
> b4c656d2584875fc74bbda31687eeac9dbb03d41 cfg80211: fix locking
> 8c344624d82e34d805d9250b942158f652e9ac75 cfg80211: fix BSS double-unlinking

Perhaps you could build vanilla 2.6.35.y kernel and revert these patches one by one by one and see which one breaks (hint: you don't need to install whole kernel all the time, only rebuild and install modules and reload cfg80211 and mac80211 modules before test). Let me know, if you could do this, or want same more help with that. Otherwise I will try to reproduce problem locally by myself or review each patch to find possible bug.

Comment 6 Trever Adams 2011-03-08 11:34:58 UTC
I guess I missed all of those using git web. This is a production machine, so I could only do this at night, so it may take a while. If you can help me out (maybe show me how to use git to do this) I would love it. I have to admit my understanding of some of this terminology (the descriptions of the patches) is a bit vague.

The following seem like could candidates for checking first:

3acc1eff9aab5dc224463d6fdc1fc98912f97a6f cfg80211: fix extension channel checks to initiate communication
a664846f7362bd6b9cf6f1e407dd86daa963da97 mac80211: delete AddBA response timer
a171652459e33373b16fe9d48e200aaa359010b8 mac80211: don't sanitize invalid rates
1416ef9a67f443b26bb9b9498ca14351616eedac mac80211: Fix ibss station got expired immediately
0fe79ffa47b60778e71fc8b2dcfc30eb9681db30 mac80211: use correct station flags lock
6f9867ece8be2ed323b085181d1ffa4424ab4ec6 mac80211: send last 3/5 probe requests as unicast*
12e3edfbbf42a116afc5e491930c5a1e56f4edb2 cfg80211: fix regression on processing country IEs

Comment 7 Stanislaw Gruszka 2011-03-08 15:21:54 UTC
(In reply to comment #6)
> I guess I missed all of those using git web.
Perhaps you looked at fedora changlog, not upstream.

> If you can help me out
> (maybe show me how to use git to do this) I would love it.
How to install kernel from git I described here:
https://bugzilla.redhat.com/show_bug.cgi?id=640612#c37

You have to clone this kernel:
http://git.kernel.org/?p=linux/kernel/git/longterm/linux-2.6.35.y.git;a=summary

Reverting is simple "git revert COMMIT_NUMBER", remember to revert latest commits first to do not break dependency, otherwise auto revert will fail. Then you will need to edit files (use "git diff" ) and then commit changes (by "git commit -a"). If you commit is wrong you can use reset to go back to previous commit: "git reset --hard HEAD~1". If you get lost, there is lots of documentation of using git in the internet.

Comment 8 Trever Adams 2011-03-09 14:32:05 UTC
All right, I will do a binary search type attack on this, one run this morning before I have to bring it back up.

The clone was actually from git://git.kernel.org/pub/scm/linux/kernel/git/longterm/linux-2.6.35.y.git.

Thank you. (I am assuming that you gave me the commits in sorted order.)

Comment 9 Stanislaw Gruszka 2011-03-09 15:06:07 UTC
Can get them by something like :

git log --pretty=oneline v2.6.35.9..HEAD -- net/mac80211/ net/wireless/ drivers/net/wireless/ath/ath9k

Newest are on top, older on bottom.

I suggest to check current HEAD i.e v2.6.35.11, after check it is really broken,
revert from there commits obtained by "git log".

Comment 10 Trever Adams 2011-03-09 17:42:14 UTC
git bisect start '--' 'net/mac80211/' 'net/wireless/' 'drivers/net/wireless/ath/ath9k'
# good: [512ac859f60d61374783e276f8fb7861a9d1d0b9] Linux 2.6.35.9
git bisect good 512ac859f60d61374783e276f8fb7861a9d1d0b9
# bad: [787a4575ad364d69416615881345eae389882588] Release 2.6.35.10
git bisect bad 787a4575ad364d69416615881345eae389882588
# good: [4311061afad44eff734ca645ee1e2833eb19c33d] mac80211: make the beacon monitor available externally
git bisect good 4311061afad44eff734ca645ee1e2833eb19c33d
# good: [afa9194b3aee8846357236026905fb221c1d4564] mac80211: clear txflags for ps-filtered frames
git bisect good afa9194b3aee8846357236026905fb221c1d4564
# good: [1416ef9a67f443b26bb9b9498ca14351616eedac] mac80211: Fix ibss station got expired immediately
git bisect good 1416ef9a67f443b26bb9b9498ca14351616eedac
# good: [a664846f7362bd6b9cf6f1e407dd86daa963da97] mac80211: delete AddBA response timer
git bisect good a664846f7362bd6b9cf6f1e407dd86daa963da97
# good: [3acc1eff9aab5dc224463d6fdc1fc98912f97a6f] cfg80211: fix extension channel checks to initiate communication
git bisect good 3acc1eff9aab5dc224463d6fdc1fc98912f97a6f


That left:
commit 3acc1eff9aab5dc224463d6fdc1fc98912f97a6f
Author: Luis R. Rodriguez <lrodriguez@atheros.com>
Date:   Fri Nov 12 16:31:23 2010 -0800

    cfg80211: fix extension channel checks to initiate communication
    
    commit 9236d838c920e90708570d9bbd7bb82d30a38130 upstream.


Which seems to be working. This leaves the RedHat/Fedora specific patches and the following:
git log --pretty=oneline 3acc1eff9aab5dc224463d6fdc1fc98912f97a6f..v2.6.35.10
787a4575ad364d69416615881345eae389882588 Release 2.6.35.10
eb015d662b1831e61ce9610da234565f8349311b Fix pktcdvd ioctl dev_minor range check
6efba844d4ad8783c46ee2477344bfd87f737325 Un-inline get_pipe_info() helper functi
23076bfefeb871c22c40f8abad5b5ea7065a265f Export 'get_pipe_info()' to other users
1e6aa82072d3c8267011461d17fbb25c247a3214 Rename 'pipe_info()' to 'get_pipe_info(
22d2fae507439362ff0efe22fe3196bb9b6e4ce4 nmi: fix clock comparator revalidation
fe590f17c1b2f22e5ce66d091bfb21fe5b59cac0 r8169: fix checksum broken
6cf6d548763e7e751a3fe77d084999bd11587834 r8169: (re)init phy on resume
d4114d4e08bba51c56d77ee3a3741e98ed107dc5 r8169: fix rx checksum offload

Which do not seem to be related.

Comment 11 Trever Adams 2011-03-09 17:49:12 UTC
kernel-2.6.35.11-83.fc14.x86_64 is dead in the water with WEP listed not WPA/WPA2 and doesn't work. Even with WEP.

With all the other kernels (except maybe some I compiled) I do not seem to get the following (which I now get with the kernel listed 11-83):

cat /var/log/messages | grep -i kern | grep -v Invisible | grep alg
Mar  9 06:58:37 HighCountry kernel: [    0.702276] alg: No test for stdrng (krng)
Mar  9 06:58:44 HighCountry kernel: [   23.122034] alg: No test for cipher_null (cipher_null-generic)
Mar  9 06:58:44 HighCountry kernel: [   23.122106] alg: No test for ecb(cipher_null) (ecb-cipher_null)
Mar  9 06:58:44 HighCountry kernel: [   23.122216] alg: No test for digest_null (digest_null-generic)
Mar  9 06:58:44 HighCountry kernel: [   23.122279] alg: No test for compress_null (compress_null-generic)
Mar  9 08:04:26 HighCountry kernel: [    0.699250] alg: No test for stdrng (krng)
Mar  9 08:04:34 HighCountry kernel: [   22.543812] alg: No test for cipher_null (cipher_null-generic)
Mar  9 08:04:34 HighCountry kernel: [   22.544298] alg: No test for ecb(cipher_null) (ecb-cipher_null)
Mar  9 08:04:34 HighCountry kernel: [   22.544386] alg: No test for digest_null (digest_null-generic)
Mar  9 08:04:34 HighCountry kernel: [   22.544461] alg: No test for compress_null (compress_null-generic)
Mar  9 10:34:07 HighCountry kernel: [    0.699250] alg: No test for stdrng (krng)
Mar  9 10:34:15 HighCountry kernel: [   22.571713] alg: No test for cipher_null (cipher_null-generic)
Mar  9 10:34:15 HighCountry kernel: [   22.571779] alg: No test for ecb(cipher_null) (ecb-cipher_null)
Mar  9 10:34:15 HighCountry kernel: [   22.571839] alg: No test for digest_null (digest_null-generic)
Mar  9 10:34:15 HighCountry kernel: [   22.571899] alg: No test for compress_null (compress_null-generic)
Mar  9 10:44:07 HighCountry kernel: [    0.695045] alg: No test for stdrng (krng)
Mar  9 10:44:15 HighCountry kernel: [   23.363817] alg: No test for cipher_null (cipher_null-generic)
Mar  9 10:44:15 HighCountry kernel: [   23.363858] alg: No test for ecb(cipher_null) (ecb-cipher_null)
Mar  9 10:44:15 HighCountry kernel: [   23.364227] alg: No test for digest_null (digest_null-generic)
Mar  9 10:44:15 HighCountry kernel: [   23.364346] alg: No test for compress_null (compress_null-generic)

Comment 12 Stanislaw Gruszka 2011-03-10 15:00:42 UTC
So vanilla-2.6.35.10 works but fedora 2.6.35.10-74.fc1f does not, correct?

Does vanilla 2.6.35.11 works as well. If not, I would try to revert my patch:
> f776b89f14c57f89a21110b8e78d9dafe71ae80d mac80211: fix hard lockup in sta_addba_resp_timer_expired
But I really do not see how this could broke things.

"alg: No test for" messages are ok, we enabled CONFIG_CRYPTO_MANAGER_TESTS option if fedora. Anything else suspicious in logs?

Comment 13 Trever Adams 2011-03-10 20:14:20 UTC
I noticed there are three ath9k modules. I was only unloading the one actually named that, mac80211 and cfg80211. Is it possible I needed to actually install the entire kernel (not just the modules) and reboot between each bisect?

I will review the logs later, I am short on time at the moment. It is good to know that alg stuff is normal.

I will try 2.6.35.11 vanilla tonight.

Thank you. Is it possible some other change in the kernel is causing memory corruption (just an odd thought I had a few hours ago)?

Comment 14 Stanislaw Gruszka 2011-03-11 08:53:21 UTC
(In reply to comment #13)
> I noticed there are three ath9k modules. I was only unloading the one actually
> named that, mac80211 and cfg80211. 
Oops, sorry I forgot about that I thought ath9k_* modules will depend on *80211 modules, hence need to be unloaded.

> Is it possible I needed to actually install
> the entire kernel (not just the modules) and reboot between each bisect?
If you are doing bisection you have to reinstall whole kernel. I was talking before about reverting commits, in that case you dont have to reinstall as long you know what modules commit touch.

> Is it possible some other change in the kernel is causing memory
> corruption (just an odd thought I had a few hours ago)?
That possible, but at this point I would exclude that.

Comment 15 Trever Adams 2011-03-11 13:38:29 UTC
Alright, I will try to get everything up and going and redo the tests as I can over the next week.

Comment 16 Trever Adams 2011-03-11 22:01:48 UTC
Vanilla 2.6.35.11 is not functioning, showing the same symptoms as mentioned here. Beginning bisection and reboots as I was able to schedule downtime (no one is using the system today).

Comment 17 Trever Adams 2011-03-11 23:09:51 UTC
Between
3acc1eff9aab5dc224463d6fdc1fc98912f97a6f cfg80211: fix extension channel checks 
a664846f7362bd6b9cf6f1e407dd86daa963da97 mac80211: delete AddBA response timer

Something broke. hostapd has the user space checks listed below. I have no problem, so either something is broken in hostapd in regard to the patch below, or the patch below is broken. Or, some patch between these two which wasn't caught in my bisect (git bisect start -- net/mac80211/ net/wireless/ drivers/net/wireless/ath/ath9k) caused the problem.


3acc1eff9aab5dc224463d6fdc1fc98912f97a6f is the first bad commit
commit 3acc1eff9aab5dc224463d6fdc1fc98912f97a6f
Author: Luis R. Rodriguez <lrodriguez@atheros.com>
Date:   Fri Nov 12 16:31:23 2010 -0800

    cfg80211: fix extension channel checks to initiate communication
    
    commit 9236d838c920e90708570d9bbd7bb82d30a38130 upstream.
    
    When operating in a mode that initiates communication and using
    HT40 we should fail if we cannot use both primary and secondary
    channels to initiate communication. Our current ht40 allowmap
    only covers STA mode of operation, for beaconing modes we need
    a check on the fly as the mode of operation is dynamic and
    there other flags other than disable which we should read
    to check if we can initiate communication.
    
    Do not allow for initiating communication if our secondary HT40
    channel has is either disabled, has a passive scan flag, a
    no-ibss flag or is a radar channel. Userspace now has similar
    checks but this is also needed in-kernel.
    
    Reported-by: Jouni Malinen <jouni.malinen@atheros.com>
    Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
    Signed-off-by: Andi Kleen <ak@linux.intel.com>

:040000 040000 f700ffb4477c8a0940d7976bab5616e934e6d7fa 229b8df02bf7d67c2cfafef76b6d2eae862fab81 M	net

Comment 18 Stanislaw Gruszka 2011-03-14 14:29:25 UTC
You can confirm "cfg80211: fix extension channel checks to initiate communication" patch cause breakage by reverting it.

git bisect reset
git revert 3acc1eff9aa

Comment 19 Stanislaw Gruszka 2011-03-15 13:14:41 UTC
There is additional fix for 3acc1eff9, currently in upstream but not in 2.6.35 stable:

http://git.kernel.org/linus/09a02fdb919876c01e8f05960750a418b3f7fa48

Probably it fix your problem, could you apply it and test (click on "raw" in above link, download as test.patch, do "patch -p1 < test.patch" in kernel source dir, and recompile/reinstall)?

Comment 20 Trever Adams 2011-03-15 20:01:08 UTC
Stanislaw, thank you. This indeed solves the problem. I do not know if this still allows the full speed or not. I have not tried with an 802.11n capable device where I can see actual speed. I will be able to do tests tonight. Other than that, I know this solves the problem.

Comment 21 Trever Adams 2011-03-15 20:01:45 UTC
This was with 2.6.35.11 (with only the patch Stanislaw suggested).

Comment 22 Trever Adams 2011-03-16 16:43:04 UTC
The patch in question fixed the problem and I just verified that it causes no new problems. If this patch (the one Stanislaw suggested) is integrated, this will solve and close this bug.

Thank you.

Comment 23 Stanislaw Gruszka 2011-03-17 07:40:11 UTC
Thank to you for bisection and testing, I wish all bug reporters would be so helpful.

Comment 24 Trever Adams 2011-03-17 17:01:20 UTC
Thank you and you are welcome. I am glad you were able to help me help solve this problem. It was most enjoyable!

Comment 25 Trever Adams 2011-03-21 18:40:08 UTC
Is it possible to see this fix pushed to F14 mainline or are there problems?

Comment 26 Stanislaw Gruszka 2011-03-22 08:06:48 UTC
I'm sorry for delay. I posted patch http://lists.fedoraproject.org/pipermail/kernel/2011-March/003051.html . Now it's up to fedora kernel maintainers (who are very busy persons) to apply it.

Comment 28 Trever Adams 2011-04-03 07:48:22 UTC
I have been running this kernel for a few days now. It is as rock solid as the kernels were for me before the problem and seems to be a might more solid than the mainline kernel. I would love to see this pushed to stable unless there are problems. Thank you to all involved.

Comment 29 Trever Adams 2011-04-21 18:34:33 UTC
I am satisfied that this is fixed in released kernel: kernel-2.6.35.12-88.fc14.x86_64. Thank you to everyone involved for fixing this so quickly and so well. I will leave it to others to close, in case there are problems I am not aware of.


Note You need to log in before you can comment on or make changes to this bug.