Bug 1016707

Summary: wpa_supplicant-2.0-3.fc19 still has PMK caching problems
Product: [Fedora] Fedora Reporter: Dan Williams <dcbw>
Component: wpa_supplicantAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: daniel, dcbw, garrett.mitchener, jasonmcarman, j, linville, robert.shade, sgruszka, sujith, torvalds
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: wpa_supplicant-2.0-8.fc19 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 927191
: 1032758 (view as bug list) Environment:
Last Closed: 2013-12-17 19:05:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 927191    
Bug Blocks:    

Description Dan Williams 2013-10-08 14:47:41 UTC
+++ This bug was initially created as a clone of Bug #927191 +++

Description of problem:

On *some* networks, NM seems to have problems ever associating. The machine works fine on other wireless networks, so I'm not sure what triggers the problem. For example, I can connect to my Android phone hotspot (WPA2) fine, but when I try to connect to the particular network in the hotel room I am right now, it never successfully completes. It keeps disassociating, and asking for the WPA password over and over again.

And I know the wireless network works fine. My phone can connect to it, and I can even connect to it on the same laptop if I kill NetworkManager and then just do it manually with wpa_supplicant.

Version-Release number of selected component (if applicable):

NetworkManager-0.9.8.0-1.fc18.x86_64

How reproducible:

100% reproducible on that particular network (essid: "lodgecondo", password "condo000", WPA2/psk). But it works on my home network, and with my phone hotspot. 

Steps to Reproduce:
1. Find a network that shows this issue
2. Try to connect to it
3. Run around naked with a chainsaw to let out your frustrations
  
Actual results:

This is from dmesg:
  wlan0: authenticate with 50:46:5d:02:85:08
  wlan0: send auth to 50:46:5d:02:85:08 (try 1/3)
  wlan0: authenticated
  wlan0: associate with 50:46:5d:02:85:08 (try 1/3)
  wlan0: RX AssocResp from 50:46:5d:02:85:08 (capab=0x411 status=0 aid=16)
  wlan0: associated
  wlan0: disassociating from 50:46:5d:02:85:08 by local choice (reason=3)
  cfg80211: Calling CRDA to update world regulatory domain
  cfg80211: World regulatory domain updated:
  cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
  cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (2457000 KHz - 2482000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211: Calling CRDA for country: US
  cfg80211: Regulatory domain changed to country: US
  cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
  cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2700 mBm)
  cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 1700 mBm)
  cfg80211:   (5250000 KHz - 5330000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5490000 KHz - 5600000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5650000 KHz - 5710000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
  cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 3000 mBm)
  cfg80211:   (57240000 KHz - 63720000 KHz @ 2160000 KHz), (N/A, 4000 mBm)

Rinse and repeat. Notice the "wlan0: disassociating from..".

Expected results:

This is dmesg with just wpa_supplicant run by hand:

  wlan0: authenticate with 50:46:5d:02:85:08
  wlan0: send auth to 50:46:5d:02:85:08 (try 1/3)
  wlan0: authenticated
  wlan0: associate with 50:46:5d:02:85:08 (try 1/3)
  wlan0: RX AssocResp from 50:46:5d:02:85:08 (capab=0x411 status=0 aid=16)
  wlan0: associated
  IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready

The wireless works fine, I'm doing this bug-report with that wireless network right now. Just not with NetworkManager running.

Additional info:

This is a ChromeBook pixel with the ath9k chipset, with up-to-date Fedora18 as of today (except I obviously have my own kernel - but I checked, and it fails the same way with the currenf Fedora kernel too).

--- Additional comment from Linus Torvalds on 2013-03-25 05:55:22 CDT ---

This is the wpa_supplicant.log file with extra debug output as documented at

  https://live.gnome.org/NetworkManager/Debugging

for a failed event (well, several failures, as far as I can tell.

Does this make anybody go "oh, I know what's up?"

--- Additional comment from Linus Torvalds on 2013-03-25 11:00:51 CDT ---

It's cut off in the middle of a line because apparently wpa_supplicant ends up buffering standard output to a file, but I suspect nobody cares. It has successfully authenticated already and finished the connection.

This comes with the PSK key visible, this is not some super-secure installation, it's a rental condo.

--- Additional comment from Linus Torvalds on 2013-03-25 21:05:58 CDT ---

Ok, I couldn't get NetworkManager or dbus debugging to work, so I brute-forced it by renaming "wpa_supplicant" as "wpa_supplicant.real" and then have just a shell-script wpa_supplicant that added the -ddKt flags and then ran the real one.

That worked, and here's the wpa_supplicant log with key data and timestamps for the non-working NetworkManager case.

It definitely uses the right passphrase, and the resulting binary PSK key seems to be identical too. Any ideas?

--- Additional comment from Robert Shade on 2013-03-25 22:21:43 CDT ---

Observations:

In the working trace, wpa_supplicant is using WEXT, whereas the other is using nl80211.

For some reason, when under nl80211, wpa_supplicant doesn't get a response for its EAPOL-Key 2/4 and the connection times out.

--- Additional comment from Dan Williams on 2013-03-26 09:00:49 CDT ---

As on my reply to your private mail thread, the WEXT vs. nl80211 thing could well be an issue here, we need to figure out why the supplicant/kernel aren't connecting correctly with nl80211.

If you haven't already, make sure you move NetworkManager out of the way so that systemd doesn't respawn it after you've stopped it (annoying, I know).  Then run your successful supplicant test from comment #2, but when spawning the supplicant, use:

wpa_supplicant -dddtuK -Dnl80211 -C /home/torvalds/wpa_conf.txt -i wlan0

and lets see if that works; that'll help point the blame finger in the right direction.

--- Additional comment from Linus Torvalds on 2013-03-26 10:19:35 CDT ---

Yup, it seems to be related to -Dnl80211. This log was created with

wpa_supplicant -dddtuK -Dnl80211 -i wlan0 -c wpa_conf.txt > nl80211.log

--- Additional comment from Linus Torvalds on 2013-03-26 10:24:39 CDT ---

This should be very similar to the 716083 attachement, but I made it right after the failing case with the exact same flags except for dropping the -Dnl80211 part. Just in case the extra -d and -u flags make a difference.

--- Additional comment from Garrett Mitchener on 2013-04-10 08:56:41 CDT ---

I'm seeing this bug too.  I'm using Fedora 18 with an Asus netbook.  The campus wireless uses WPA2.  The network connection works with this machine and Windows XP, Fedora 17 (before I upgraded to F18), and Ubuntu 12.10.

--- Additional comment from Dan Williams on 2013-04-15 15:14:28 CDT ---

(In reply to comment #8)
> I'm seeing this bug too.  I'm using Fedora 18 with an Asus netbook.  The
> campus wireless uses WPA2.  The network connection works with this machine
> and Windows XP, Fedora 17 (before I upgraded to F18), and Ubuntu 12.10.

Garrett, what kind of wifi card do you have?  Running 'lspci' and 'lsusb' would certainly tell us; you could attach the output of those commands here.

--- Additional comment from Garrett Mitchener on 2013-04-16 11:57:48 CDT ---

Output from lspci on my asus EEEPC netbook

--- Additional comment from Garrett Mitchener on 2013-04-16 11:59:06 CDT ---

Output from lsusb on my asus EEEPC netbook

--- Additional comment from Sujith on 2013-06-05 12:32:50 CDT ---

(In reply to Garrett Mitchener from comment #8)
> I'm seeing this bug too.  I'm using Fedora 18 with an Asus netbook.  The
> campus wireless uses WPA2.  The network connection works with this machine
> and Windows XP, Fedora 17 (before I upgraded to F18), and Ubuntu 12.10.

Garrett, are you still seeing this issue ?

--- Additional comment from Dan Williams on 2013-06-05 12:53:04 CDT ---

So Garrett's machine has ath9k, which isn't as mature as other drivers, so that could be the cause of the problems, also the nl80211/wext thing.  Logs from 'dmesg' would be helpful to see if the device simply loses sync with the AP or what.

--- Additional comment from Sujith on 2013-06-05 14:24:50 CDT ---

Yes, I'd like ath9k logs along with some debugfs statistics to debug this issue.

--- Additional comment from Sujith on 2013-06-05 14:29:28 CDT ---

Ok, I got a copy of the packet capture that Linus sent to Jouni and here are some observations:

In the success case, MSG 2/4 is sent at MCS 3.
In the failure case, MSG 2/4 is sent at MCS 13.

From the wpa_s logs, MSG 2/4 appears to cause the timeout and eventual disconnect.

This is a problem with the ath9k rate control, where it sends EAPOL packets at really high rates even in very congested environments for the reason that it starts from the *highest* rate and then starts ramping down based on the packet error rate.

But, this will cause authentication timeouts because the algorithm needs time to come down to lower rates and a failed EAPOL packet breaks things.

minstrel doesn't have this problem because it starts from the lowest rate and goes upward. I am not sure about the difference between nl80211/WEXT and why that would make a difference in rate selection.

This has to be verified - so if someone who can reproduce this uses minstrel to check if the issue is still seen, that'd give us a clue.

--- Additional comment from Jouni Malinen on 2013-06-05 18:37:23 CDT ---

(In reply to Sujith from comment #15)
> Ok, I got a copy of the packet capture that Linus sent to Jouni and here are
> some observations:
> 
> In the success case, MSG 2/4 is sent at MCS 3.
> In the failure case, MSG 2/4 is sent at MCS 13.

Ah, great. That makes things easier to understand. Based on this, I'm now easily able to reproduce the issue and get capture files that look very similar to the one from Linus.

> This is a problem with the ath9k rate control, where it sends EAPOL packets
> at really high rates even in very congested environments for the reason that
> it starts from the *highest* rate and then starts ramping down based on the
> packet error rate.
> 
> But, this will cause authentication timeouts because the algorithm needs
> time to come down to lower rates and a failed EAPOL packet breaks things.

Yes, that needs to be fixed. The behavior seemed to be to try two first data frames after association with MCS 13 and only then drop to MCS 3. Guess how many times the AP that Linus tried to use transmitted EAPOL-Key 1/4.. Just twice. The last success case worked due to another bug (in mac80211, I'd assume) that leaked a data frame that should not have been there. That was enough to convince rate control to drop TX rate for the second (and final with this AP) EAPOL-Key msg 2/4 attempt.

I did not check what rates were included in the retransmit sequence, but apparently not low enough to make this work.

This seems to match quite well with the description of WPA2-Personal working with phone hotspot or home network and not with a hotel network. I'd assume the first two have much better signal strength while some hotels I've been to would be in quite the opposite end of the scale.. ;-) (The one in the capture file was not really that bad, but it was not good enough to get anything like MCS 13 going through.)

> minstrel doesn't have this problem because it starts from the lowest rate
> and goes upward. I am not sure about the difference between nl80211/WEXT and
> why that would make a difference in rate selection.

I'd assume the nl80211 vs. WEXT part is unrelated.

> This has to be verified - so if someone who can reproduce this uses minstrel
> to check if the issue is still seen, that'd give us a clue.

I can confirm that this resolves the issue that I'm seeing. For others who would like to test this, disabling CONFIG_ATH9K_RATE_CONTROL=y allows this to be worked around until ath9k rate control can be improved or the default rate control algorithm for ath9k gets changed.

--- Additional comment from Jouni Malinen on 2013-06-05 19:00:19 CDT ---

(In reply to Jouni Malinen from comment #16)
> The last success case worked due to another bug (in mac80211, I'd
> assume) that leaked a data frame that should not have been there. That was
> enough to convince rate control to drop TX rate for the second (and final
> with this AP) EAPOL-Key msg 2/4 attempt.

> I'd assume the nl80211 vs. WEXT part is unrelated.

Actually, I think I need to take that back.. That "leaked data frame" is something that I think was recently identified as a bug with WEXT. So that bug may actually be enough to explain why the WEXT case was more likely to work here since there may be extra data frames before the 4-way handshake has been completed and with nl80211, those are dropped before given to the driver (which is the correct behavior).

--- Additional comment from Garrett Mitchener on 2013-06-05 19:58:58 CDT ---

(In reply to Jouni Malinen from comment #16)
 
> I can confirm that this resolves the issue that I'm seeing. For others who
> would like to test this, disabling CONFIG_ATH9K_RATE_CONTROL=y allows this
> to be worked around until ath9k rate control can be improved or the default
> rate control algorithm for ath9k gets changed.

Is this an option for building the ath9k driver ...?

I'll try to get some logs & such next time I'm on campus.  What should I try to get from debugfs?

--- Additional comment from Sujith on 2013-06-05 20:43:05 CDT ---

(In reply to Jouni Malinen from comment #17)
> Actually, I think I need to take that back.. That "leaked data frame" is
> something that I think was recently identified as a bug with WEXT. So that
> bug may actually be enough to explain why the WEXT case was more likely to
> work here since there may be extra data frames before the 4-way handshake
> has been completed and with nl80211, those are dropped before given to the
> driver (which is the correct behavior).

Yes, frames 1257, 1268 and 1277 are sent before the handshake and they force the rate control to use lower rates based on rising packet error rate.

They are NullFunc frames with the Power Management bit set, btw.

--- Additional comment from Sujith on 2013-06-05 20:49:35 CDT ---

(In reply to Garrett Mitchener from comment #18)
> Is this an option for building the ath9k driver ...?

Yes, if CONFIG_ATH9K_RATE_CONTROL is not enabled in the kernel config minstrel is chosen.

> I'll try to get some logs & such next time I'm on campus.  What should I try
> to get from debugfs?

/sys/kernel/debug/ieee80211/phy#/xmit
/sys/kernel/debug/ieee80211/phy#/netdev:wlan#/stations/<associated_ap>/rc_stats

--- Additional comment from Sujith on 2013-06-05 20:56:52 CDT ---

For getting the debugfs stats, please use the default ath9k rate control so we know if excessive retries are happening.

--- Additional comment from Linus Torvalds on 2013-06-05 21:33:10 CDT ---

Should I just make CONFIG_ATH9K_RATE_CONTROL have a "depends on BROKEN" or at least make it "default n" in the meantime.

Right now it defaults to on, and is clearly not working well. I don't have access to a problematic network at the moment for testing, but I had issues at the hotel during my recent trip to Japan (sitting at Narita airport right now on the way back), being forced to use my *phone* for internet (USB tethering!) because the ath9k driver was unusable.

I've changed my personal config to not have CONFIG_ATH9K_RATE_CONTROL enabled, but since the problem only shows up in certain situations, I don't know if I can verify that it fixes it. The wireless network in the United lounge at Narita doesn't have the problem, for example..

--- Additional comment from Sujith on 2013-06-05 21:38:09 CDT ---

It can be changed to "default n". Do you want me to send a patch or would you prefer to just change it in mainline ?

--- Additional comment from Linus Torvalds on 2013-06-05 21:42:29 CDT ---

(In reply to Sujith from comment #23)
> It can be changed to "default n". Do you want me to send a patch or would
> you prefer to just change it in mainline ?

So the problem I see with just changing it to "default n" is that it won't fix people who already have it set because of legacy reasons.

Which is why making it "depends on BROKEN" would actually be better.

Alternatively, we could rename the option (so that old configs having the broken default don't trigger) and make the new name have "default n". Hmm?

--- Additional comment from Sujith on 2013-06-05 21:50:06 CDT ---

> So the problem I see with just changing it to "default n" is that it won't
> fix people who already have it set because of legacy reasons.
> 
> Which is why making it "depends on BROKEN" would actually be better.
> 
> Alternatively, we could rename the option (so that old configs having the
> broken default don't trigger) and make the new name have "default n". Hmm?

That's a better approach - we can use CONFIG_ATH9K_LEGACY_RATE_CONTROL and set  it "n" by default.

--- Additional comment from Sujith on 2013-06-05 22:14:35 CDT ---

Patch posted: https://patchwork.kernel.org/patch/2676971/

--- Additional comment from Sujith on 2013-06-05 22:16:05 CDT ---

Also, this hasn't made it to 3.10 yet: https://patchwork.kernel.org/patch/2646451/

So, for now "iw dev wlan# set power_save off" can be used to improve link stability.

--- Additional comment from John W. Linville on 2013-06-06 14:59:34 CDT ---

I merged the patches in question to the wireless.git tree and sent them along to Dave Miller a few minutes ago.

--- Additional comment from Garrett Mitchener on 2013-06-08 15:21:31 CDT ---

I updated my netbook today (2013-06-08) I tried connecting to the campus network again, and the network connection is a lot more stable than it has been.  There have been a few updates to NetworkManager and the kernel, so maybe one of them helped.  Also, our IT crew has been doing some work on the campus wireless network so maybe they changed something.

Anyway, the connection worked for a just few minutes, then started going up and down again, and finally stabilized.  I'm attaching various log files...

--- Additional comment from Garrett Mitchener on 2013-06-08 15:22:40 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:23:04 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:23:38 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:24:04 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:25:03 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:26:33 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:27:10 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:27:51 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:28:27 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:28:56 CDT ---



--- Additional comment from Garrett Mitchener on 2013-06-08 15:34:45 CDT ---

By the way, the rc_stats one is hard to get because the subdirectory in /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/ corresponding to the hex address of the station keeps appearing and disappearing when the connection is unstable.

--- Additional comment from Sujith on 2013-06-12 02:34:26 CDT ---

(In reply to Garrett Mitchener from comment #40)
> By the way, the rc_stats one is hard to get because the subdirectory in
> /sys/kernel/debug/ieee80211/phy0/netdev:wlan0/stations/ corresponding to the
> hex address of the station keeps appearing and disappearing when the
> connection is unstable.

rc_stats shows that ath9k RC is still being used, please switch to minstrel.

--- Additional comment from Garrett Mitchener on 2013-06-21 15:47:03 CDT ---

I recompiled the kernel configured with ATH9K_RATE_CONTROL=no, and it didn't help.  I'm about to attach the log files.  I can still connect to the problem network if I do 'systemctl stop NetworkManager' and then run wpa_supplicant from a script, so I'm not convinced this is (just) a kernel driver problem.

--- Additional comment from Garrett Mitchener on 2013-06-21 15:50:03 CDT ---

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

--- Additional comment from Garrett Mitchener on 2013-06-21 15:50:52 CDT ---

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

--- Additional comment from Garrett Mitchener on 2013-06-21 15:51:49 CDT ---

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

--- Additional comment from Garrett Mitchener on 2013-06-21 15:52:41 CDT ---

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

--- Additional comment from Garrett Mitchener on 2013-06-21 15:53:13 CDT ---

kernel 3.9.5, with Fedora 18 patches, ATH9K_RATE_CONTROL=no

--- Additional comment from Sujith on 2013-06-24 08:06:16 CDT ---

Is "swerdna" the problematic network ?

--- Additional comment from Garrett Mitchener on 2013-06-24 08:17:02 CDT ---

The one I have the most trouble with is 'cofc-secure', the main campus network at the College of Charleston.  It's WPA2 enterprise, with individual ids and passwords.

'swerdna' is one of the access points at my church, and I do sometimes have trouble there.  I think it's WPA with one password.  I'll collect some data next time I'm there.

--- Additional comment from Sujith on 2013-06-25 02:42:07 CDT ---

This appears to be a problem with NM. The following snippet appears many times in the log:

Jun 21 12:23:54 falkor NetworkManager[1309]: <info> Activation (wlan0/wireless): access point 'cofc-secure' has security, but secrets are required.
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): device state change: config -> need-auth (reason 'none') [50 60 0]
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> Activation (wlan0) Stage 2 of 5 (Device Configure) complete.
Jun 21 12:23:54 falkor NetworkManager[1309]: <warn> No agents were available for this request.
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): device state change: need-auth -> failed (reason 'no-secrets') [60 120 7]
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> Marking connection 'cofc-secure' invalid.
Jun 21 12:23:54 falkor NetworkManager[1309]: <warn> Activation (wlan0) failed for connection 'cofc-secure'
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): device state change: failed -> disconnected (reason 'none') [120 30 0]
Jun 21 12:23:54 falkor NetworkManager[1309]: <info> (wlan0): deactivating device (reason 'none') [0]

--- Additional comment from Garrett Mitchener on 2013-07-30 12:28:31 CDT ---

It's July 30 and I'm still seeing this problem with a fully updated Fedora 19 netbook.

The College of Charleston is hosting Flock to Fedora in just a couple of weeks and I can't connect to our campus network...


What more can I do to help debug this?

--- Additional comment from Dan Williams on 2013-07-30 12:33:25 CDT ---

When NM says "no agents were available for this request", that means nothing is listening for password requests to show you the dialog box.

On Fedora 18, GNOME Shell does this, but it has a bug where if you restart NetworkManager for any reason, it does not reconnect to NetworkManager to listen for these requests.  So if you've restarted NM anywhere, you also have to restart the shell with "Alt+F2" then type 'r' and hit return.  Then it will reconnect to NM and start providing secrets.

--- Additional comment from Garrett Mitchener on 2013-07-30 13:26:53 CDT ---

No, that's not the problem, I'm seeing something else.  It happens on a fresh boot, fresh log in, I get the gnome password dialog, type in my password, and it just won't connect.  It repeatedly asks for my password, and yes I'm typing it correctly.

--- Additional comment from Justin P. Mattock on 2013-08-04 19:30:57 CDT ---

maybe this is the same.. 
https://bugzilla.kernel.org/show_bug.cgi?id=60690
but could be wrong

--- Additional comment from Garrett Mitchener on 2013-08-12 10:08:34 CDT ---

I think this bug might actually be a problem in wpa_supplicant

Possibly related:

bug #950518 here

This next one is an old bug by date, but it's not marked fixed in the w1.fi bugzilla, so it might still be present: wpa_supplicant responds to all requests for identity, not just the ones relevant to this machine: http://w1.fi/bugz/show_bug.cgi?id=455

In working with our network engineers, we found that my laptop was generating lots of EAP requests, and it was enough to trigger some intrusion detection mechanism that deauthenticated it.  So that might be a source of problems.

--- Additional comment from Garrett Mitchener on 2013-08-12 10:12:17 CDT ---

Also possibly related, #981679 but that one was apparently resolved in fedora 19, where my bug was not

--- Additional comment from Garrett Mitchener on 2013-08-12 10:13:41 CDT ---

minor correction, let's get the automagic link to bug #981679 this time

--- Additional comment from Justin P. Mattock on 2013-08-12 10:14:20 CDT ---

tough to say if this is wpa_supplicant(s) issue. should kernel developers break userspace? 

in anycase the bisect can be done killing nm-applet(NM) and manually using wpa_supplicant. good kernel will connect reliably, bad kernel connects once then never connects until you rmmod your module and reload it.

--- Additional comment from Garrett Mitchener on 2013-08-12 10:31:11 CDT ---

Possibly related:

bug #929217


This one may be related or may be another problem inside anaconda:

bug #892896

--- Additional comment from Garrett Mitchener on 2013-08-12 10:33:14 CDT ---

I'm seeing this on one notebook with an ath9k wireless card, and another with intel wireless, and another with something else (don't have access to that one right now).  I don't think it's (just) a kernel problem.

--- Additional comment from Garrett Mitchener on 2013-08-12 11:26:26 CDT ---

Additional info: The glitch http://w1.fi/bugz/show_bug.cgi?id=455 shows up when I start wireshark without any special settings: Unless you change a setting, wireshark puts the wifi device into promiscuous mode, which triggers the messy behavior where wpa_supplicant replies to requests for identity intended for other machines, which results in lots of EAP packets, which triggers the intrusion detection mechanism.  Now that I figured that out, I'm running wireshark in not-promiscuous mode, trying to get better packet captures and track down the problem.  So maybe it's not in wpa_supplicant.  *grr*

--- Additional comment from Justin P. Mattock on 2013-08-12 11:51:32 CDT ---

I guess a good rule is:
https://lkml.org/lkml/2012/12/23/75
from the master!!

anyways that's the once reliable connect, then after that no connection. 
to see this you need to :

kill NM:
service network-manager stop && killall nm-applet
then
wpa_supplicant -Dwext -I wlan* -c /donde/esta

use CTRL-C to start and stop wpa_supplicant there you will see a reliable connection and an unreliable connection depending on the kernel version.

on the current mainline here I get no connection at all until I apply my patch.
if I
git reset --hard 6565ec then I can toggle wpa_supplicant on and off

--- Additional comment from Dan Williams on 2013-08-12 12:44:04 CDT ---

Ok, so this seems to be a kernel issue, right?

--- Additional comment from Justin P. Mattock on 2013-08-12 14:43:28 CDT ---

unless the author of the changes had decided to change the API _knowing_ that wpa_supplicant will break and informed them of the changes, then I think its safe to say its a kernel issue.

--- Additional comment from Garrett Mitchener on 2013-08-19 09:28:50 CDT ---

Still not sure where my particular wifi problems are coming from: I'm using a different computer, an Asus UX31A which has Intel centrino wifi.  I'm using Fedora 19.  I have to use kernel 3.9.5 because of a bug in 3.19.  Using Gnome and NetworkManager, this notebook connects to the campus network (WPA2 enterprise) more or less okay, but then every few minutes, it disconnects and requests my password again.

I managed to configure the problematic network manually, running wpa_supplicant without NetworkManager, and it seemed a lot more stable, but I haven't run it long enough to know for sure.  That would indicate that my problem is caused by some problem between NetworkManager and my campus wifi network, and is probably not just a kernel issue, although it may have something to do with signal strength as reported by the kernel.

Possibly related:

bug #864906

bug #975244 (another ath9k weak signal report)

bug #929217 (another ath9k weak signal report)

bug #490493 (disconnections caused by automatic scanning for BSSIDS, and old bug report, but with recent comments indicating that it hasn't been resolved)

http://askubuntu.com/questions/224999/ubuntu-wifi-roaming-problem-how-do-i-set-wireless-sensitivity-on-ubuntu

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1057582

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1080634

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1091610

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1144720

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1154559

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1157601

One of the symptoms I sometimes see that relates to several of the above: I sign on to our network using NetworkManager, the icon in Gnome shows high strength, then immediately it roams to another access point with a much weaker signal.

I'm working with campus network engineers-- there may be some other intrusion detection feature or load-balancing feature that's complicating things.


Maybe related but not as promising / informative:

bug #982429 (maybe a problem with where passwords are stored rather than my problem)

bug #675226 (problem with NM and IPv6, which I believe we have on campus)

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1174259 (about WPA personal)

bug #965791 https://lkml.org/lkml/2013/5/22/115 (another WPA2 network, but this one is LEAP, mine is PEAP; intel wifi; something about where secrets are stored; supposedly fixed in kernel-3.9.4, so it shouldn't be affecting me, but just in case)

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1176359

https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1182182



(Whatever is going on here must be pretty subtle.  It seems to have a long history of being discovered, reported, not fixed, and put on the back burner, rediscovered, reported again, not fixed, put on the back burner.....)

--- Additional comment from Garrett Mitchener on 2013-08-19 09:44:56 CDT ---

By the way, we have a *lot* of access points around here, and NM seems to be responsible for a lot of roaming between them.  Is NM possibly searching for the best bandwidth and somehow ends up preferring distant access points that have less load but too weak of a signal?

--- Additional comment from Justin P. Mattock on 2013-08-19 10:27:20 CDT ---

over here the problem is temporarily fixed with using my patch from:
https://bugzilla.kernel.org/show_bug.cgi?id=60690

machine runs flawlessly with wireless. hopefully the experts come in and review their code.

--- Additional comment from Justin P. Mattock on 2013-08-19 11:18:49 CDT ---

should be fixed on the current Mainline or rawhid fedora 18
main thread here:
https://bugzilla.redhat.com/show_bug.cgi?id=981445

--- Additional comment from Garrett Mitchener on 2013-08-19 14:11:47 CDT ---

In reply to comment 67 and comment 68:

I'm stuck with kernel 3.9.5 for the moment because of bug #989763

I'll try the rawhide kernel (link in that bug report, supposedly fixes it) when I get a chance and see if it fixes this bug too.

--- Additional comment from Garrett Mitchener on 2013-08-22 13:32:33 CDT ---

Part of /var/log/messages: I booted my Asus Zenbook UX31A and signed into Gnome.  The dialog window appeared asking me to confirm my wifi password (WPA2-PEAP-MSCHAPv2).  I clicked Ok, but it couldn't connect to the wifi.  Then I stopped NetworkManager and ran wpa_supplicant manually, and was able to connect.

--- Additional comment from Garrett Mitchener on 2013-09-03 17:10:37 CDT ---

So far no luck with trying a 3.10 kernel -- until the graphics bug is fixed I can't tell if it fixes NetworkManager

FYI: I tried some Ubuntu live images.  The same problem is present in Ubuntu 13.04, except that the reconnection after deauthentication is smoother.  However, I see no deauthentications at all with Ubuntu 12.10 live, and the network connection seems to be completely steady.  So something happened to introduce this bug in between those versions of NetworkManager and/or the kernel?

--- Additional comment from Garrett Mitchener on 2013-09-23 12:02:37 CDT ---

Trying fedora 19 with kernel 3.11.1-200 -- Connection is slightly more stable at times, but I had trouble getting it to connect, and it still deauthenticates and disconnects, asks for my password again, same problem...

Still no problems running wpa_supplicant manually with no NetworkManager

--- Additional comment from Dan Williams on 2013-09-26 13:20:53 CDT ---

Aug 22 14:10:09 falkor NetworkManager[584]: <info> (wlp2s0): supplicant interface state: associating -> associated
Aug 22 14:10:33 falkor NetworkManager[584]: <warn> Activation (wlp2s0/wireless): association took too long.
Aug 22 14:10:33 falkor kernel: [  120.077269] wlp2s0: deauthenticating from 00:0f:7d:3b:ed:23 by local choice (reason=3)

At this point, we need the supplicant logs to see what's going on here.  Can you:

mv /usr/sbin/wpa_supplicant /
killall -TERM wpa_supplicant
/wpa_supplicant -dddtu

and then reconnect to the network and wait for the problem to occur.  Then please attach the supplicant output.  You can get things back to normal by "mv /wpa_supplicant /usr/sbin/" when you're done.

(There may be private information in these logs, so feel free to sanitize them or send them directly to my email address if you'd prefer them not to be public.)

--- Additional comment from Dan Williams on 2013-09-27 12:18:38 CDT ---

Garrett: this is a wpa_supplicant bug:

1380298605.254235: RSN: PMK from pre-auth - hexdump(len=32): [REMOVED]
1380298605.254245: RSN: Replace PMKSA entry for the current AP
1380298605.254247: nl80211: Delete PMKID for 00:0f:7d:xx:xx:xx
1380298605.254345: wlp2s0: RSN: PMKSA cache entry free_cb: 00:0f:7d:xx:xx:xx reason=1
1380298605.254346: RSN: Flush PMKSA cache entry for 00:0f:7d:xx:xx:xx
1380298605.254347: nl80211: Delete PMKID for 00:0f:7d:xx:xx:xx
1380298605.254396: wlp2s0: RSN: PMKSA cache entry free_cb: 00:0f:7d:xx:xx:xx reason=0
1380298605.254397: wlp2s0: RSN: removed current PMKSA entry
1380298605.254400: wlp2s0: Request to deauthenticate - bssid=00:0f:7d:xx:xx:xx pending_bssid=00:00:00:00:00:00 reason=1 state=COMPLETED
1380298605.254401: wpa_driver_nl80211_deauthenticate(addr=00:0f:7d:xx:xx:xx reason_code=1)
1380298605.260269: wlp2s0: Event DEAUTH (12) received
1380298605.260274: wlp2s0: Deauthentication notification
1380298605.260275: wlp2s0:  * reason 1 (locally generated)
1380298605.260276: Deauthentication frame IE(s) - hexdump(len=0): [NULL]
1380298605.260280: wlp2s0: CTRL-EVENT-DISCONNECTED bssid=00:0f:7d:xx:xx:xx reason=1 locally_generated=1

and happens when you have a WPA Enterprise network with many access points, that supports RSN Preauthentication, which yours does.  The supplicant mishandles the PMKSA cache and flushes the entry for the current access point, which triggers a disconnect.  What specific version of wpa_supplicant are you using?

--- Additional comment from Garrett Mitchener on 2013-09-27 13:32:06 CDT ---

That's with:

wpa_supplicant-2.0-3.fc19.x86_64

--- Additional comment from Dan Williams on 2013-10-08 09:44:17 CDT ---

(In reply to Garrett Mitchener from comment #75)
> That's with:
> 
> wpa_supplicant-2.0-3.fc19.x86_64

So there's really two bugs in this one here.  First is the bug that  Linus is describing, which could be fixed already; his report was about WPA-PSK networks.  Second is your bug, which we've already identified as a supplicant bug, but that is specific to WPA-Enterprise networks (which use PMK caching).  So best I think is to clone this bug for wpa_supplicant; use the "Clone this bug" link at the top-right of this page and set the component to wpa_supplicant.  Thanks!

We'll leave this bug open and try to see if Linus' bug has been fixed by now.

Comment 1 Jason Carman 2013-10-17 14:29:26 UTC
I've encountered this bug (or couple of bugs) recently in Fedora 19.  This happens both at home using WPA/WPA2 (confirmed switching to WEP solves the problem) as well as using the wireless at the college which I attend which I believe uses WPA Enterprise.

Using the following on a laptop:
supplicant package - wpa_supplicant-2.0-3.fc19.x86_64
kernel version - 3.11.4-201.fc19.x86_64
wireless nic -
03:00.0 Network controller: Broadcom Corporation BCM4313 802.11bgn Wireless Network Adapter (rev 01)
	Subsystem: Hewlett-Packard Company Device 1795
	Kernel driver in use: wl

Logs:
http://paste.fedoraproject.org/46354/13815989/
http://paste.fedoraproject.org/47486/13820187/

Comment 2 Garrett Mitchener 2013-10-17 14:34:15 UTC
I've posted logs at the other bug #927191 -- it was determined that there were two separate problems there, so we duplicated that report to here

Seems to be a common problem at colleges

Comment 3 Fedora Update System 2013-11-18 22:20:00 UTC
wpa_supplicant-2.0-8.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/wpa_supplicant-2.0-8.fc19

Comment 4 Fedora Update System 2013-11-18 22:22:11 UTC
wpa_supplicant-2.0-8.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/wpa_supplicant-2.0-8.fc20

Comment 5 Garrett Mitchener 2013-11-20 19:29:50 UTC
So far so good -- I've been using wpa_supplicant-2.0-8.fc19 for a few hours now and there is no sign of needless deauthentications, the connection is stable!  YAY!  Thanks Dan!

Comment 6 Fedora Update System 2013-11-21 04:39:23 UTC
Package wpa_supplicant-2.0-8.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing wpa_supplicant-2.0-8.fc19'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-21779/wpa_supplicant-2.0-8.fc19
then log in and leave karma (feedback).

Comment 7 Garrett Mitchener 2013-11-22 17:44:09 UTC
Update: So far, the above package is working well.

The troublesome wifi connection is still a bit "off" (sometimes asking for passwords twice in a row) but the constant deauthenticating is gone and the connection is essentially stable once established.  YAY!  Thank you!

I suspect (not sure) that the remaining annoying behavior may be something to do with resuming from suspend, and maybe the bugs where gnome-shell doesn't pass credentials back and forth to NetworkManager.

Comment 8 Jason Carman 2013-11-23 05:33:18 UTC
I'm still getting an Associated -> Disconnected loop, however I'm also getting a warning that my kernel is tainted (broadcom wl driver) that may be part of the issue.  I'll see if I can do some tinkering/troubleshooting tomorrow to further test and possibly fix the problem on my end and post feed back as soon as I can.  It's the end of the semester, so things have gotten kind of nuts.  Worst comes to worst I can try a reinstall after the course work for this semester is over (I need the Fedora installation as I'm packaging for Pidora for a current class).

Comment 9 Fedora Update System 2013-11-24 03:58:29 UTC
wpa_supplicant-2.0-8.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 10 Fedora Update System 2013-12-17 19:05:23 UTC
wpa_supplicant-2.0-8.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.