Bug 927191 - NetworkManager causes spurious wireless disassociations, fails to ever associate and connect correctly
Summary: NetworkManager causes spurious wireless disassociations, fails to ever associ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 18
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1016707
TreeView+ depends on / blocked
 
Reported: 2013-03-25 10:16 UTC by Linus Torvalds
Modified: 2013-11-20 18:55 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1016707 (view as bug list)
Environment:
Last Closed: 2013-11-20 18:55:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
wpa_supplicant.log for failure cases (69.78 KB, text/plain)
2013-03-25 10:55 UTC, Linus Torvalds
no flags Details
working trace from wpa_supplicant --ddKt (24.00 KB, text/plain)
2013-03-25 16:00 UTC, Linus Torvalds
no flags Details
nonworking NM -> wpa_supplicant trace with -ddKt (37.11 KB, text/plain)
2013-03-26 02:05 UTC, Linus Torvalds
no flags Details
failing wpa_supplicant log (-Dnl80211) (180.94 KB, text/plain)
2013-03-26 15:19 UTC, Linus Torvalds
no flags Details
Same log, working, and without -Dnl80211 (56.00 KB, text/plain)
2013-03-26 15:24 UTC, Linus Torvalds
no flags Details
Mitchener: output from lspci (1.58 KB, text/plain)
2013-04-16 16:57 UTC, Garrett Mitchener
no flags Details
Mitchener: output from lsusb (548 bytes, text/plain)
2013-04-16 16:59 UTC, Garrett Mitchener
no flags Details
output of lsmod (4.39 KB, text/plain)
2013-06-08 20:22 UTC, Garrett Mitchener
no flags Details
output of lspci (1.58 KB, text/plain)
2013-06-08 20:23 UTC, Garrett Mitchener
no flags Details
output of dmesg (107.80 KB, text/plain)
2013-06-08 20:23 UTC, Garrett Mitchener
no flags Details
/var/log/messages (1.09 MB, text/plain)
2013-06-08 20:24 UTC, Garrett Mitchener
no flags Details
copy of /sys/kernel/debug/ieee80211/phy0/ath9k/xmit (520 bytes, text/plain)
2013-06-08 20:25 UTC, Garrett Mitchener
no flags Details
copy of /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/<station address>/rc_stats (520 bytes, text/plain)
2013-06-08 20:26 UTC, Garrett Mitchener
no flags Details
copy of /var/log/wpa_supplicant.log (281.74 KB, text/x-log)
2013-06-08 20:27 UTC, Garrett Mitchener
no flags Details
rc_stats after the network connection stabilized (520 bytes, text/plain)
2013-06-08 20:27 UTC, Garrett Mitchener
no flags Details
wpa_supplicant.log after the network connection stabilized (281.74 KB, text/x-log)
2013-06-08 20:28 UTC, Garrett Mitchener
no flags Details
xmit after the network connection stabilized (1.33 KB, text/plain)
2013-06-08 20:28 UTC, Garrett Mitchener
no flags Details
output of lsmod (4.46 KB, text/plain)
2013-06-21 20:50 UTC, Garrett Mitchener
no flags Details
output of dmesg (250.44 KB, text/plain)
2013-06-21 20:50 UTC, Garrett Mitchener
no flags Details
/var/log/messages (750.19 KB, text/plain)
2013-06-21 20:51 UTC, Garrett Mitchener
no flags Details
copy of /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/<station address>/rc_stats (1.00 KB, text/plain)
2013-06-21 20:52 UTC, Garrett Mitchener
no flags Details
copy of /sys/kernel/debug/ieee80211/phy0/ath9k/xmit (1.33 KB, text/plain)
2013-06-21 20:53 UTC, Garrett Mitchener
no flags Details
/var/log/messages from 2013-08-22, my Asus Zenbook (18.74 KB, text/plain)
2013-08-22 18:32 UTC, Garrett Mitchener
no flags Details

Description Linus Torvalds 2013-03-25 10:16:12 UTC
Description of problem:

On *some* networks, NM seems to have problems ever associating. The machine works fine on other wireless networks, so I'm not sure what triggers the problem. For example, I can connect to my Android phone hotspot (WPA2) fine, but when I try to connect to the particular network in the hotel room I am right now, it never successfully completes. It keeps disassociating, and asking for the WPA password over and over again.

And I know the wireless network works fine. My phone can connect to it, and I can even connect to it on the same laptop if I kill NetworkManager and then just do it manually with wpa_supplicant.

Version-Release number of selected component (if applicable):

NetworkManager-0.9.8.0-1.fc18.x86_64

How reproducible:

100% reproducible on that particular network (essid: "lodgecondo", password "condo000", WPA2/psk). But it works on my home network, and with my phone hotspot. 

Steps to Reproduce:
1. Find a network that shows this issue
2. Try to connect to it
3. Run around naked with a chainsaw to let out your frustrations
  
Actual results:

This is from dmesg:
  wlan0: authenticate with 50:46:5d:02:85:08
  wlan0: send auth to 50:46:5d:02:85:08 (try 1/3)
  wlan0: authenticated
  wlan0: associate with 50:46:5d:02:85:08 (try 1/3)
  wlan0: RX AssocResp from 50:46:5d:02:85:08 (capab=0x411 status=0 aid=16)
  wlan0: associated
  wlan0: disassociating from 50:46:5d:02:85:08 by local choice (reason=3)
  cfg80211: Calling CRDA to update world regulatory domain
  cfg80211: World regulatory domain updated:
  cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
  cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211: Calling CRDA for country: US
  cfg80211: Regulatory domain changed to country: US
  cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
  cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2700 mBm)
  cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 1700 mBm)
  cfg80211:   (5250000 KHz - 5330000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5490000 KHz - 5600000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5650000 KHz - 5710000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 3000 mBm)
  cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 4000 mBm)

Rinse and repeat. Notice the "wlan0: disassociating from..".

Expected results:

This is dmesg with just wpa_supplicant run by hand:

  wlan0: authenticate with 50:46:5d:02:85:08
  wlan0: send auth to 50:46:5d:02:85:08 (try 1/3)
  wlan0: authenticated
  wlan0: associate with 50:46:5d:02:85:08 (try 1/3)
  wlan0: RX AssocResp from 50:46:5d:02:85:08 (capab=0x411 status=0 aid=16)
  wlan0: associated
  IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready

The wireless works fine, I'm doing this bug-report with that wireless network right now. Just not with NetworkManager running.

Additional info:

This is a ChromeBook pixel with the ath9k chipset, with up-to-date Fedora18 as of today (except I obviously have my own kernel - but I checked, and it fails the same way with the currenf Fedora kernel too).

Comment 1 Linus Torvalds 2013-03-25 10:55:22 UTC
Created attachment 715950 [details]
wpa_supplicant.log for failure cases

This is the wpa_supplicant.log file with extra debug output as documented at

  https://live.gnome.org/NetworkManager/Debugging

for a failed event (well, several failures, as far as I can tell.

Does this make anybody go "oh, I know what's up?"

Comment 2 Linus Torvalds 2013-03-25 16:00:51 UTC
Created attachment 716083 [details]
working trace from wpa_supplicant --ddKt

It's cut off in the middle of a line because apparently wpa_supplicant ends up buffering standard output to a file, but I suspect nobody cares. It has successfully authenticated already and finished the connection.

This comes with the PSK key visible, this is not some super-secure installation, it's a rental condo.

Comment 3 Linus Torvalds 2013-03-26 02:05:58 UTC
Created attachment 716290 [details]
nonworking NM -> wpa_supplicant trace with -ddKt

Ok, I couldn't get NetworkManager or dbus debugging to work, so I brute-forced it by renaming "wpa_supplicant" as "wpa_supplicant.real" and then have just a shell-script wpa_supplicant that added the -ddKt flags and then ran the real one.

That worked, and here's the wpa_supplicant log with key data and timestamps for the non-working NetworkManager case.

It definitely uses the right passphrase, and the resulting binary PSK key seems to be identical too. Any ideas?

Comment 4 Robert Shade 2013-03-26 03:21:43 UTC
Observations:

In the working trace, wpa_supplicant is using WEXT, whereas the other is using nl80211.

For some reason, when under nl80211, wpa_supplicant doesn't get a response for its EAPOL-Key 2/4 and the connection times out.

Comment 5 Dan Williams 2013-03-26 14:00:49 UTC
As on my reply to your private mail thread, the WEXT vs. nl80211 thing could well be an issue here, we need to figure out why the supplicant/kernel aren't connecting correctly with nl80211.

If you haven't already, make sure you move NetworkManager out of the way so that systemd doesn't respawn it after you've stopped it (annoying, I know).  Then run your successful supplicant test from comment #2, but when spawning the supplicant, use:

wpa_supplicant -dddtuK -Dnl80211 -C /home/torvalds/wpa_conf.txt -i wlan0

and lets see if that works; that'll help point the blame finger in the right direction.

Comment 6 Linus Torvalds 2013-03-26 15:19:35 UTC
Created attachment 716576 [details]
failing wpa_supplicant log (-Dnl80211)

Yup, it seems to be related to -Dnl80211. This log was created with

wpa_supplicant -dddtuK -Dnl80211 -i wlan0 -c wpa_conf.txt > nl80211.log

Comment 7 Linus Torvalds 2013-03-26 15:24:39 UTC
Created attachment 716586 [details]
Same log, working, and without -Dnl80211

This should be very similar to the 716083 attachement, but I made it right after the failing case with the exact same flags except for dropping the -Dnl80211 part. Just in case the extra -d and -u flags make a difference.

Comment 8 Garrett Mitchener 2013-04-10 13:56:41 UTC
I'm seeing this bug too.  I'm using Fedora 18 with an Asus netbook.  The campus wireless uses WPA2.  The network connection works with this machine and Windows XP, Fedora 17 (before I upgraded to F18), and Ubuntu 12.10.

Comment 9 Dan Williams 2013-04-15 20:14:28 UTC
(In reply to comment #8)
> I'm seeing this bug too.  I'm using Fedora 18 with an Asus netbook.  The
> campus wireless uses WPA2.  The network connection works with this machine
> and Windows XP, Fedora 17 (before I upgraded to F18), and Ubuntu 12.10.

Garrett, what kind of wifi card do you have?  Running 'lspci' and 'lsusb' would certainly tell us; you could attach the output of those commands here.

Comment 10 Garrett Mitchener 2013-04-16 16:57:48 UTC
Created attachment 736457 [details]
Mitchener: output from lspci

Output from lspci on my asus EEEPC netbook

Comment 11 Garrett Mitchener 2013-04-16 16:59:06 UTC
Created attachment 736458 [details]
Mitchener: output from lsusb

Output from lsusb on my asus EEEPC netbook

Comment 12 Sujith 2013-06-05 17:32:50 UTC
(In reply to Garrett Mitchener from comment #8)
> I'm seeing this bug too.  I'm using Fedora 18 with an Asus netbook.  The
> campus wireless uses WPA2.  The network connection works with this machine
> and Windows XP, Fedora 17 (before I upgraded to F18), and Ubuntu 12.10.

Garrett, are you still seeing this issue ?

Comment 13 Dan Williams 2013-06-05 17:53:04 UTC
So Garrett's machine has ath9k, which isn't as mature as other drivers, so that could be the cause of the problems, also the nl80211/wext thing.  Logs from 'dmesg' would be helpful to see if the device simply loses sync with the AP or what.

Comment 14 Sujith 2013-06-05 19:24:50 UTC
Yes, I'd like ath9k logs along with some debugfs statistics to debug this issue.

Comment 15 Sujith 2013-06-05 19:29:28 UTC
Ok, I got a copy of the packet capture that Linus sent to Jouni and here are some observations:

In the success case, MSG 2/4 is sent at MCS 3.
In the failure case, MSG 2/4 is sent at MCS 13.

From the wpa_s logs, MSG 2/4 appears to cause the timeout and eventual disconnect.

This is a problem with the ath9k rate control, where it sends EAPOL packets at really high rates even in very congested environments for the reason that it starts from the *highest* rate and then starts ramping down based on the packet error rate.

But, this will cause authentication timeouts because the algorithm needs time to come down to lower rates and a failed EAPOL packet breaks things.

minstrel doesn't have this problem because it starts from the lowest rate and goes upward. I am not sure about the difference between nl80211/WEXT and why that would make a difference in rate selection.

This has to be verified - so if someone who can reproduce this uses minstrel to check if the issue is still seen, that'd give us a clue.

Comment 16 Jouni Malinen 2013-06-05 23:37:23 UTC
(In reply to Sujith from comment #15)
> Ok, I got a copy of the packet capture that Linus sent to Jouni and here are
> some observations:
> 
> In the success case, MSG 2/4 is sent at MCS 3.
> In the failure case, MSG 2/4 is sent at MCS 13.

Ah, great. That makes things easier to understand. Based on this, I'm now easily able to reproduce the issue and get capture files that look very similar to the one from Linus.

> This is a problem with the ath9k rate control, where it sends EAPOL packets
> at really high rates even in very congested environments for the reason that
> it starts from the *highest* rate and then starts ramping down based on the
> packet error rate.
> 
> But, this will cause authentication timeouts because the algorithm needs
> time to come down to lower rates and a failed EAPOL packet breaks things.

Yes, that needs to be fixed. The behavior seemed to be to try two first data frames after association with MCS 13 and only then drop to MCS 3. Guess how many times the AP that Linus tried to use transmitted EAPOL-Key 1/4.. Just twice. The last success case worked due to another bug (in mac80211, I'd assume) that leaked a data frame that should not have been there. That was enough to convince rate control to drop TX rate for the second (and final with this AP) EAPOL-Key msg 2/4 attempt.

I did not check what rates were included in the retransmit sequence, but apparently not low enough to make this work.

This seems to match quite well with the description of WPA2-Personal working with phone hotspot or home network and not with a hotel network. I'd assume the first two have much better signal strength while some hotels I've been to would be in quite the opposite end of the scale.. ;-) (The one in the capture file was not really that bad, but it was not good enough to get anything like MCS 13 going through.)

> minstrel doesn't have this problem because it starts from the lowest rate
> and goes upward. I am not sure about the difference between nl80211/WEXT and
> why that would make a difference in rate selection.

I'd assume the nl80211 vs. WEXT part is unrelated.

> This has to be verified - so if someone who can reproduce this uses minstrel
> to check if the issue is still seen, that'd give us a clue.

I can confirm that this resolves the issue that I'm seeing. For others who would like to test this, disabling CONFIG_ATH9K_RATE_CONTROL=y allows this to be worked around until ath9k rate control can be improved or the default rate control algorithm for ath9k gets changed.

Comment 17 Jouni Malinen 2013-06-06 00:00:19 UTC
(In reply to Jouni Malinen from comment #16)
> The last success case worked due to another bug (in mac80211, I'd
> assume) that leaked a data frame that should not have been there. That was
> enough to convince rate control to drop TX rate for the second (and final
> with this AP) EAPOL-Key msg 2/4 attempt.

> I'd assume the nl80211 vs. WEXT part is unrelated.

Actually, I think I need to take that back.. That "leaked data frame" is something that I think was recently identified as a bug with WEXT. So that bug may actually be enough to explain why the WEXT case was more likely to work here since there may be extra data frames before the 4-way handshake has been completed and with nl80211, those are dropped before given to the driver (which is the correct behavior).

Comment 18 Garrett Mitchener 2013-06-06 00:58:58 UTC
(In reply to Jouni Malinen from comment #16)
 
> I can confirm that this resolves the issue that I'm seeing. For others who
> would like to test this, disabling CONFIG_ATH9K_RATE_CONTROL=y allows this
> to be worked around until ath9k rate control can be improved or the default
> rate control algorithm for ath9k gets changed.

Is this an option for building the ath9k driver ...?

I'll try to get some logs & such next time I'm on campus.  What should I try to get from debugfs?

Comment 19 Sujith 2013-06-06 01:43:05 UTC
(In reply to Jouni Malinen from comment #17)
> Actually, I think I need to take that back.. That "leaked data frame" is
> something that I think was recently identified as a bug with WEXT. So that
> bug may actually be enough to explain why the WEXT case was more likely to
> work here since there may be extra data frames before the 4-way handshake
> has been completed and with nl80211, those are dropped before given to the
> driver (which is the correct behavior).

Yes, frames 1257, 1268 and 1277 are sent before the handshake and they force the rate control to use lower rates based on rising packet error rate.

They are NullFunc frames with the Power Management bit set, btw.

Comment 20 Sujith 2013-06-06 01:49:35 UTC
(In reply to Garrett Mitchener from comment #18)
> Is this an option for building the ath9k driver ...?

Yes, if CONFIG_ATH9K_RATE_CONTROL is not enabled in the kernel config minstrel is chosen.

> I'll try to get some logs & such next time I'm on campus.  What should I try
> to get from debugfs?

/sys/kernel/debug/ieee80211/phy#/xmit
/sys/kernel/debug/ieee80211/phy#/netdev:wlan#/stations/<associated_ap>/rc_stats

Comment 21 Sujith 2013-06-06 01:56:52 UTC
For getting the debugfs stats, please use the default ath9k rate control so we know if excessive retries are happening.

Comment 22 Linus Torvalds 2013-06-06 02:33:10 UTC
Should I just make CONFIG_ATH9K_RATE_CONTROL have a "depends on BROKEN" or at least make it "default n" in the meantime.

Right now it defaults to on, and is clearly not working well. I don't have access to a problematic network at the moment for testing, but I had issues at the hotel during my recent trip to Japan (sitting at Narita airport right now on the way back), being forced to use my *phone* for internet (USB tethering!) because the ath9k driver was unusable.

I've changed my personal config to not have CONFIG_ATH9K_RATE_CONTROL enabled, but since the problem only shows up in certain situations, I don't know if I can verify that it fixes it. The wireless network in the United lounge at Narita doesn't have the problem, for example..

Comment 23 Sujith 2013-06-06 02:38:09 UTC
It can be changed to "default n". Do you want me to send a patch or would you prefer to just change it in mainline ?

Comment 24 Linus Torvalds 2013-06-06 02:42:29 UTC
(In reply to Sujith from comment #23)
> It can be changed to "default n". Do you want me to send a patch or would
> you prefer to just change it in mainline ?

So the problem I see with just changing it to "default n" is that it won't fix people who already have it set because of legacy reasons.

Which is why making it "depends on BROKEN" would actually be better.

Alternatively, we could rename the option (so that old configs having the broken default don't trigger) and make the new name have "default n". Hmm?

Comment 25 Sujith 2013-06-06 02:50:06 UTC
> So the problem I see with just changing it to "default n" is that it won't
> fix people who already have it set because of legacy reasons.
> 
> Which is why making it "depends on BROKEN" would actually be better.
> 
> Alternatively, we could rename the option (so that old configs having the
> broken default don't trigger) and make the new name have "default n". Hmm?

That's a better approach - we can use CONFIG_ATH9K_LEGACY_RATE_CONTROL and set  it "n" by default.

Comment 26 Sujith 2013-06-06 03:14:35 UTC
Patch posted: https://patchwork.kernel.org/patch/2676971/

Comment 27 Sujith 2013-06-06 03:16:05 UTC
Also, this hasn't made it to 3.10 yet: https://patchwork.kernel.org/patch/2646451/

So, for now "iw dev wlan# set power_save off" can be used to improve link stability.

Comment 28 John W. Linville 2013-06-06 19:59:34 UTC
I merged the patches in question to the wireless.git tree and sent them along to Dave Miller a few minutes ago.

Comment 29 Garrett Mitchener 2013-06-08 20:21:31 UTC
I updated my netbook today (2013-06-08) I tried connecting to the campus network again, and the network connection is a lot more stable than it has been.  There have been a few updates to NetworkManager and the kernel, so maybe one of them helped.  Also, our IT crew has been doing some work on the campus wireless network so maybe they changed something.

Anyway, the connection worked for a just few minutes, then started going up and down again, and finally stabilized.  I'm attaching various log files...

Comment 30 Garrett Mitchener 2013-06-08 20:22:40 UTC
Created attachment 758623 [details]
output of lsmod

Comment 31 Garrett Mitchener 2013-06-08 20:23:04 UTC
Created attachment 758624 [details]
output of lspci

Comment 32 Garrett Mitchener 2013-06-08 20:23:38 UTC
Created attachment 758625 [details]
output of dmesg

Comment 33 Garrett Mitchener 2013-06-08 20:24:04 UTC
Created attachment 758626 [details]
/var/log/messages

Comment 34 Garrett Mitchener 2013-06-08 20:25:03 UTC
Created attachment 758627 [details]
copy of /sys/kernel/debug/ieee80211/phy0/ath9k/xmit

Comment 35 Garrett Mitchener 2013-06-08 20:26:33 UTC
Created attachment 758628 [details]
copy of /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/<station address>/rc_stats

Comment 36 Garrett Mitchener 2013-06-08 20:27:10 UTC
Created attachment 758629 [details]
copy of /var/log/wpa_supplicant.log

Comment 37 Garrett Mitchener 2013-06-08 20:27:51 UTC
Created attachment 758630 [details]
rc_stats after the network connection stabilized

Comment 38 Garrett Mitchener 2013-06-08 20:28:27 UTC
Created attachment 758631 [details]
wpa_supplicant.log after the network connection stabilized

Comment 39 Garrett Mitchener 2013-06-08 20:28:56 UTC
Created attachment 758632 [details]
xmit after the network connection stabilized

Comment 40 Garrett Mitchener 2013-06-08 20:34:45 UTC
By the way, the rc_stats one is hard to get because the subdirectory in /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/ corresponding to the hex address of the station keeps appearing and disappearing when the connection is unstable.

Comment 41 Sujith 2013-06-12 07:34:26 UTC
(In reply to Garrett Mitchener from comment #40)
> By the way, the rc_stats one is hard to get because the subdirectory in
> /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/ corresponding to the
> hex address of the station keeps appearing and disappearing when the
> connection is unstable.

rc_stats shows that ath9k RC is still being used, please switch to minstrel.

Comment 42 Garrett Mitchener 2013-06-21 20:47:03 UTC
I recompiled the kernel configured with ATH9K_RATE_CONTROL=no, and it didn't help.  I'm about to attach the log files.  I can still connect to the problem network if I do 'systemctl stop NetworkManager' and then run wpa_supplicant from a script, so I'm not convinced this is (just) a kernel driver problem.

Comment 43 Garrett Mitchener 2013-06-21 20:50:03 UTC
Created attachment 763954 [details]
output of lsmod

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

Comment 44 Garrett Mitchener 2013-06-21 20:50:52 UTC
Created attachment 763955 [details]
output of dmesg

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

Comment 45 Garrett Mitchener 2013-06-21 20:51:49 UTC
Created attachment 763956 [details]
/var/log/messages

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

Comment 46 Garrett Mitchener 2013-06-21 20:52:41 UTC
Created attachment 763967 [details]
copy of /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/<station address>/rc_stats

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

Comment 47 Garrett Mitchener 2013-06-21 20:53:13 UTC
Created attachment 763968 [details]
copy of /sys/kernel/debug/ieee80211/phy0/ath9k/xmit

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

Comment 48 Sujith 2013-06-24 13:06:16 UTC
Is "swerdna" the problematic network ?

Comment 49 Garrett Mitchener 2013-06-24 13:17:02 UTC
The one I have the most trouble with is 'cofc-secure', the main campus network at the College of Charleston.  It's WPA2 enterprise, with individual ids and passwords.

'swerdna' is one of the access points at my church, and I do sometimes have trouble there.  I think it's WPA with one password.  I'll collect some data next time I'm there.

Comment 50 Sujith 2013-06-25 07:42:07 UTC
This appears to be a problem with NM. The following snippet appears many times in the log:

Jun 21 12:23:54 falkor NetworkManager[1309]: <info> Activation (wlan0/wireless): access point 'cofc-secure' has security, but secrets are required.
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): device state change: config -> need-auth (reason 'none') [50 60 0]
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> Activation (wlan0) Stage 2 of 5 (Device Configure) complete.
Jun 21 12:23:54 falkor NetworkManager[1309]: <warn> No agents were available for this request.
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): device state change: need-auth -> failed (reason 'no-secrets') [60 120 7]
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> Marking connection 'cofc-secure' invalid.
Jun 21 12:23:54 falkor NetworkManager[1309]: <warn> Activation (wlan0) failed for connection 'cofc-secure'
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): device state change: failed -> disconnected (reason 'none') [120 30 0]
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): deactivating device (reason 'none') [0]

Comment 51 Garrett Mitchener 2013-07-30 17:28:31 UTC
It's July 30 and I'm still seeing this problem with a fully updated Fedora 19 netbook.

The College of Charleston is hosting Flock to Fedora in just a couple of weeks and I can't connect to our campus network...


What more can I do to help debug this?

Comment 52 Dan Williams 2013-07-30 17:33:25 UTC
When NM says "no agents were available for this request", that means nothing is listening for password requests to show you the dialog box.

On Fedora 18, GNOME Shell does this, but it has a bug where if you restart NetworkManager for any reason, it does not reconnect to NetworkManager to listen for these requests.  So if you've restarted NM anywhere, you also have to restart the shell with "Alt+F2" then type 'r' and hit return.  Then it will reconnect to NM and start providing secrets.

Comment 53 Garrett Mitchener 2013-07-30 18:26:53 UTC
No, that's not the problem, I'm seeing something else.  It happens on a fresh boot, fresh log in, I get the gnome password dialog, type in my password, and it just won't connect.  It repeatedly asks for my password, and yes I'm typing it correctly.

Comment 54 Justin P. Mattock 2013-08-05 00:30:57 UTC
maybe this is the same.. 
https://bugzilla.kernel.org/show_bug.cgi?id=60690
but could be wrong

Comment 55 Garrett Mitchener 2013-08-12 15:08:34 UTC
I think this bug might actually be a problem in wpa_supplicant

Possibly related:

bug #950518 here

This next one is an old bug by date, but it's not marked fixed in the w1.fi bugzilla, so it might still be present: wpa_supplicant responds to all requests for identity, not just the ones relevant to this machine: http://w1.fi/bugz/show_bug.cgi?id=455

In working with our network engineers, we found that my laptop was generating lots of EAP requests, and it was enough to trigger some intrusion detection mechanism that deauthenticated it.  So that might be a source of problems.

Comment 56 Garrett Mitchener 2013-08-12 15:12:17 UTC
Also possibly related, #981679 but that one was apparently resolved in fedora 19, where my bug was not

Comment 57 Garrett Mitchener 2013-08-12 15:13:41 UTC
minor correction, let's get the automagic link to bug #981679 this time

Comment 58 Justin P. Mattock 2013-08-12 15:14:20 UTC
tough to say if this is wpa_supplicant(s) issue. should kernel developers break userspace? 

in anycase the bisect can be done killing nm-applet(NM) and manually using wpa_supplicant. good kernel will connect reliably, bad kernel connects once then never connects until you rmmod your module and reload it.

Comment 59 Garrett Mitchener 2013-08-12 15:31:11 UTC
Possibly related:

bug #929217


This one may be related or may be another problem inside anaconda:

bug #892896

Comment 60 Garrett Mitchener 2013-08-12 15:33:14 UTC
I'm seeing this on one notebook with an ath9k wireless card, and another with intel wireless, and another with something else (don't have access to that one right now).  I don't think it's (just) a kernel problem.

Comment 61 Garrett Mitchener 2013-08-12 16:26:26 UTC
Additional info: The glitch http://w1.fi/bugz/show_bug.cgi?id=455 shows up when I start wireshark without any special settings: Unless you change a setting, wireshark puts the wifi device into promiscuous mode, which triggers the messy behavior where wpa_supplicant replies to requests for identity intended for other machines, which results in lots of EAP packets, which triggers the intrusion detection mechanism.  Now that I figured that out, I'm running wireshark in not-promiscuous mode, trying to get better packet captures and track down the problem.  So maybe it's not in wpa_supplicant.  *grr*

Comment 62 Justin P. Mattock 2013-08-12 16:51:32 UTC
I guess a good rule is:
https://lkml.org/lkml/2012/12/23/75
from the master!!

anyways that's the once reliable connect, then after that no connection. 
to see this you need to :

kill NM:
service network-manager stop && killall nm-applet
then
wpa_supplicant -Dwext -I wlan* -c /donde/esta

use CTRL-C to start and stop wpa_supplicant there you will see a reliable connection and an unreliable connection depending on the kernel version.

on the current mainline here I get no connection at all until I apply my patch.
if I
git reset --hard 6565ec then I can toggle wpa_supplicant on and off

Comment 63 Dan Williams 2013-08-12 17:44:04 UTC
Ok, so this seems to be a kernel issue, right?

Comment 64 Justin P. Mattock 2013-08-12 19:43:28 UTC
unless the author of the changes had decided to change the API _knowing_ that wpa_supplicant will break and informed them of the changes, then I think its safe to say its a kernel issue.

Comment 65 Garrett Mitchener 2013-08-19 14:28:50 UTC
Still not sure where my particular wifi problems are coming from: I'm using a different computer, an Asus UX31A which has Intel centrino wifi.  I'm using Fedora 19.  I have to use kernel 3.9.5 because of a bug in 3.19.  Using Gnome and NetworkManager, this notebook connects to the campus network (WPA2 enterprise) more or less okay, but then every few minutes, it disconnects and requests my password again.

I managed to configure the problematic network manually, running wpa_supplicant without NetworkManager, and it seemed a lot more stable, but I haven't run it long enough to know for sure.  That would indicate that my problem is caused by some problem between NetworkManager and my campus wifi network, and is probably not just a kernel issue, although it may have something to do with signal strength as reported by the kernel.

Possibly related:

bug #864906

bug #975244 (another ath9k weak signal report)

bug #929217 (another ath9k weak signal report)

bug #490493 (disconnections caused by automatic scanning for BSSIDS, and old bug report, but with recent comments indicating that it hasn't been resolved)

http://askubuntu.com/questions/224999/ubuntu-wifi-roaming-problem-how-do-i-set-wireless-sensitivity-on-ubuntu

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1057582

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1080634

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1091610

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1144720

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1154559

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1157601

One of the symptoms I sometimes see that relates to several of the above: I sign on to our network using NetworkManager, the icon in Gnome shows high strength, then immediately it roams to another access point with a much weaker signal.

I'm working with campus network engineers-- there may be some other intrusion detection feature or load-balancing feature that's complicating things.


Maybe related but not as promising / informative:

bug #982429 (maybe a problem with where passwords are stored rather than my problem)

bug #675226 (problem with NM and IPv6, which I believe we have on campus)

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1174259 (about WPA personal)

bug #965791 https://lkml.org/lkml/2013/5/22/115 (another WPA2 network, but this one is LEAP, mine is PEAP; intel wifi; something about where secrets are stored; supposedly fixed in kernel-3.9.4, so it shouldn't be affecting me, but just in case)

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1176359

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1182182



(Whatever is going on here must be pretty subtle.  It seems to have a long history of being discovered, reported, not fixed, and put on the back burner, rediscovered, reported again, not fixed, put on the back burner.....)

Comment 66 Garrett Mitchener 2013-08-19 14:44:56 UTC
By the way, we have a *lot* of access points around here, and NM seems to be responsible for a lot of roaming between them.  Is NM possibly searching for the best bandwidth and somehow ends up preferring distant access points that have less load but too weak of a signal?

Comment 67 Justin P. Mattock 2013-08-19 15:27:20 UTC
over here the problem is temporarily fixed with using my patch from:
https://bugzilla.kernel.org/show_bug.cgi?id=60690

machine runs flawlessly with wireless. hopefully the experts come in and review their code.

Comment 68 Justin P. Mattock 2013-08-19 16:18:49 UTC
should be fixed on the current Mainline or rawhid fedora 18
main thread here:
https://bugzilla.redhat.com/show_bug.cgi?id=981445

Comment 69 Garrett Mitchener 2013-08-19 19:11:47 UTC
In reply to comment 67 and comment 68:

I'm stuck with kernel 3.9.5 for the moment because of bug #989763

I'll try the rawhide kernel (link in that bug report, supposedly fixes it) when I get a chance and see if it fixes this bug too.

Comment 70 Garrett Mitchener 2013-08-22 18:32:33 UTC
Created attachment 789310 [details]
/var/log/messages from 2013-08-22, my Asus Zenbook

Part of /var/log/messages: I booted my Asus Zenbook UX31A and signed into Gnome.  The dialog window appeared asking me to confirm my wifi password (WPA2-PEAP-MSCHAPv2).  I clicked Ok, but it couldn't connect to the wifi.  Then I stopped NetworkManager and ran wpa_supplicant manually, and was able to connect.

Comment 71 Garrett Mitchener 2013-09-03 22:10:37 UTC
So far no luck with trying a 3.10 kernel -- until the graphics bug is fixed I can't tell if it fixes NetworkManager

FYI: I tried some Ubuntu live images.  The same problem is present in Ubuntu 13.04, except that the reconnection after deauthentication is smoother.  However, I see no deauthentications at all with Ubuntu 12.10 live, and the network connection seems to be completely steady.  So something happened to introduce this bug in between those versions of NetworkManager and/or the kernel?

Comment 72 Garrett Mitchener 2013-09-23 17:02:37 UTC
Trying fedora 19 with kernel 3.11.1-200 -- Connection is slightly more stable at times, but I had trouble getting it to connect, and it still deauthenticates and disconnects, asks for my password again, same problem...

Still no problems running wpa_supplicant manually with no NetworkManager

Comment 73 Dan Williams 2013-09-26 18:20:53 UTC
Aug 22 14:10:09 falkor NetworkManager[584]: <info> (wlp2s0): supplicant interface state: associating -> associated
Aug 22 14:10:33 falkor NetworkManager[584]: <warn> Activation (wlp2s0/wireless): association took too long.
Aug 22 14:10:33 falkor kernel: [  120.077269] wlp2s0: deauthenticating from 00:0f:7d:3b:ed:23 by local choice (reason=3)

At this point, we need the supplicant logs to see what's going on here.  Can you:

mv /usr/sbin/wpa_supplicant /
killall -TERM wpa_supplicant
/wpa_supplicant -dddtu

and then reconnect to the network and wait for the problem to occur.  Then please attach the supplicant output.  You can get things back to normal by "mv /wpa_supplicant /usr/sbin/" when you're done.

(There may be private information in these logs, so feel free to sanitize them or send them directly to my email address if you'd prefer them not to be public.)

Comment 74 Dan Williams 2013-09-27 17:18:38 UTC
Garrett: this is a wpa_supplicant bug:

1380298605.254235: RSN: PMK from pre-auth - hexdump(len=32): [REMOVED]
1380298605.254245: RSN: Replace PMKSA entry for the current AP
1380298605.254247: nl80211: Delete PMKID for 00:0f:7d:xx:xx:xx
1380298605.254345: wlp2s0: RSN: PMKSA cache entry free_cb: 00:0f:7d:xx:xx:xx reason=1
1380298605.254346: RSN: Flush PMKSA cache entry for 00:0f:7d:xx:xx:xx
1380298605.254347: nl80211: Delete PMKID for 00:0f:7d:xx:xx:xx
1380298605.254396: wlp2s0: RSN: PMKSA cache entry free_cb: 00:0f:7d:xx:xx:xx reason=0
1380298605.254397: wlp2s0: RSN: removed current PMKSA entry
1380298605.254400: wlp2s0: Request to deauthenticate - bssid=00:0f:7d:xx:xx:xx pending_bssid=00:00:00:00:00:00 reason=1 state=COMPLETED
1380298605.254401: wpa_driver_nl80211_deauthenticate(addr=00:0f:7d:xx:xx:xx reason_code=1)
1380298605.260269: wlp2s0: Event DEAUTH (12) received
1380298605.260274: wlp2s0: Deauthentication notification
1380298605.260275: wlp2s0:  * reason 1 (locally generated)
1380298605.260276: Deauthentication frame IE(s) - hexdump(len=0): [NULL]
1380298605.260280: wlp2s0: CTRL-EVENT-DISCONNECTED bssid=00:0f:7d:xx:xx:xx reason=1 locally_generated=1

and happens when you have a WPA Enterprise network with many access points, that supports RSN Preauthentication, which yours does.  The supplicant mishandles the PMKSA cache and flushes the entry for the current access point, which triggers a disconnect.  What specific version of wpa_supplicant are you using?

Comment 75 Garrett Mitchener 2013-09-27 18:32:06 UTC
That's with:

wpa_supplicant-2.0-3.fc19.x86_64

Comment 76 Dan Williams 2013-10-08 14:44:17 UTC
(In reply to Garrett Mitchener from comment #75)
> That's with:
> 
> wpa_supplicant-2.0-3.fc19.x86_64

So there's really two bugs in this one here.  First is the bug that  Linus is describing, which could be fixed already; his report was about WPA-PSK networks.  Second is your bug, which we've already identified as a supplicant bug, but that is specific to WPA-Enterprise networks (which use PMK caching).  So best I think is to clone this bug for wpa_supplicant; use the "Clone this bug" link at the top-right of this page and set the component to wpa_supplicant.  Thanks!

We'll leave this bug open and try to see if Linus' bug has been fixed by now.

Comment 77 Dan Williams 2013-10-08 14:48:24 UTC
(In reply to Dan Williams from comment #76)
> (In reply to Garrett Mitchener from comment #75)
> > That's with:
> > 
> > wpa_supplicant-2.0-3.fc19.x86_64
> 
> So there's really two bugs in this one here.  First is the bug that  Linus
> is describing, which could be fixed already; his report was about WPA-PSK
> networks.  Second is your bug, which we've already identified as a
> supplicant bug, but that is specific to WPA-Enterprise networks (which use
> PMK caching).  So best I think is to clone this bug for wpa_supplicant; use
> the "Clone this bug" link at the top-right of this page and set the
> component to wpa_supplicant.  Thanks!

Since I'm here, I'll just clone it.  New supplicant bug is bug #1016707.

Comment 78 Dan Williams 2013-11-20 18:55:48 UTC
I'm going to go ahead and close this bug, since the original problem was an issue with ath9k rate control and has since been fixed in the kernel.  NetworkManager asking the supplicant to use nl80211 instead of WEXT is the right thing to do, since WEXT is effectively dead, and any issues with nl80211 should get fixed kernel side.  Lastly, Garrett's problem is a separate supplicant issue with PMKSA cache eviction, which is being tracked in another bug.


Note You need to log in before you can comment on or make changes to this bug.