Bug 771542

Summary: wpa_supplicant + band steering = NO NETWORK FOR YOU
Product: [Fedora] Fedora Reporter: James Ralston <ralston>
Component: wpa_supplicantAssignee: John Greene <jogreene>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: brovvnout+rh, dcbw, jogreene, kalq28
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-13 10:07:02 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description James Ralston 2012-01-03 21:09:10 EST
Recently, I upgraded my laptop to a Dell Latitude E6420, which came with a spiffy Intel Corporation Centrino Advanced-N 6205 AGN wireless network controller, which is driven by the iwlwifi/iwlagn driver:

iwlwifi 0000:03:00.0: enabling device (0000 -> 0002)
iwlwifi 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
iwlwifi 0000:03:00.0: pci_resource_len = 0x00002000
iwlwifi 0000:03:00.0: pci_resource_base = ffffc90006870000
iwlwifi 0000:03:00.0: HW Revision ID = 0x34
iwlwifi 0000:03:00.0: Detected Intel(R) Centrino(R) Advanced-N 6205 AGN, REV=0xB0
iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S
iwlwifi 0000:03:00.0: device EEPROM VER=0x715, CALIB=0x6
iwlwifi 0000:03:00.0: Device SKU: 0X1f0
iwlwifi 0000:03:00.0: Valid Tx ant: 0X3, Valid Rx ant: 0X3
iwlwifi 0000:03:00.0: Tunable channels: 13 802.11bg, 24 802.11a channels
iwlwifi 0000:03:00.0: loaded firmware version build 42301

The Centrino Advanced-N 6205 AGN can speak 802.11b, 802.11g. 802.11n, 802.11a, carrier pigeon, Morse code, and smoke signals.

The access points (APs) that my employers have deployed utilize what they refer to as "band steering". Specifically, since our 2.4Ghz bands are overly congested, if an AP sees a MAC address on air at 5.0Ghz, and the station with that MAC address attempts to associate with a 2.4GHz band with a signal strength that would put it within range of the AP if it were to use the 5.0GHz band instead, the AP denies the association. The reasoning the AP employs is that any driver that isn't completely brain-damaged will retry different bands until it tries a 5.0GHz band, at which point the AP will permit the association.

Enter wpa_supplicant.

After considerable effort (which involved bludgeoning NetworkManager over the head with a shovel while yelling "I know you and wpa_supplicant are BFFs, but STOP INSTANTLY RESTARTING IT WHEN I DELIBERATELY KILL IT!"), I was able to run wpa_supplicant in debugging mode. And from that I see that whenever wpa_supplicant runs, it probes every single channel on every single AP it can find. This action immediately lets every AP within range know that my station can speak 5.0GHz, even if I had just changed my MAC address to something that no AP had seen before.

Then, inevitably, the very first channel that wpa_supplicant attempts to use is a 2.4GHz station. And when that attempt is denied (AP: "I know that you can speak 5.0Ghz to me, because I JUST SAW YOU DO IT, and furthermore, you're effing 15 FEET from me (I can see you RIGHT THERE), so GET THE HELL OFF the overloaded 2.4GHz range and go to 5.0Ghz"), rather than retrying a different channel, wpa_supplicant just gives up and exits.

This is akin to a person attempting to enter a room that has about 30 doors, and upon finding the first door he tries to be locked, concludes "WELL THERE CLEARLY ISN'T ANY WAY INTO THIS ROOM", and curls up in the corner to starve to death.

Now, according to the sample wpa_supplicant.conf file provided in /usr/share/doc/wpa_supplicant-0.7.3, if I set "ap_scan=0" in /etc/wpa_supplicant/wpa_supplicant.conf, wpa_supplicant will just let the driver handle AP scanning/selection. Which would solve this problem as long as the driver isn't as brain-damaged as wpa_supplicant is.

But alas, there's no way to tell, because wpa_supplicant seems to cheerfully ignore the ap_scan setting. As a matter of fact, wpa_supplicant seems to ignore *every* setting in the /etc/wpa_supplicant/wpa_supplicant.conf file, including the one that names the directory where is should create sockets for wpa_cli_/wpa_gui. (Which, for now, one has to manually created; see bug 748153.)

I also tried manually setting a channel via the "iwconfig freq" command. wpa_supplicant appears to override it. And I can't bypass wpa_supplicant entirely, because NetworkManager insists, INSISTS on using it, and provides no way to NOT use it.

Finally, I skimmed the sample config file to see if there was any option to affect wpa_supplicant's retry behavior, or to lock wpa_supplicant to using channels at a certain frequency. I couldn't find anything.

So, in summary, wpa_supplicant cannot cope with band steering, wpa_supplicant ignores the option that tells it to let the driver perform AP scanning/selection, and NetworkManager insists on using wpa_supplicant. So if anyone is attempting to associate with a network that uses band steering, and the first channel that wpa_supplicant attempts to use is denied by the AP, then you cannot get onto the network.

The optimal solution for this problem would be for wpa_supplicant to retry different channels if its first association attempt is denied. If that cannot be easily implemented, then wpa_supplicant should at least honor ap_scan=0 in the /etc/wpa_supplicant/wpa_supplicant.conf


Comment 1 James Ralston 2012-03-20 19:31:26 EDT
Ok, I see why my attempts to control wpa_supplicant's behavior via its config file weren't working: when the -u option is specified, wpa_supplicant never even looks at the config file; it gets all of its configuration information via NM via DBus.

So I'm not really sure whom to blame here. If wpa_supplicant doesn't expose a DBus knob to adjust freq_list, then I can't fault NM for not exposing a way to set it via the GUI configuration.

This wouldn't be quite as frustrating if wireless networking were falling over because the driver for my particular hardware was spotty. But I'm using the iwlwifi/iwlagn driver, which is one of the better ones. No, the reason why networking is falling over is because the userland tools (NM, wpa_supplicant) don't know how to deal with band steering, *AND* they don't expose any knobs to allow me to deal with band steering manually.
Comment 2 John Greene 2012-10-02 12:41:53 EDT
Well, I feel your pain: have same scars..  Just went through this with another customer using iwlwifi/iwlagn driver and supplicant.  

Can demystify a couple things for you as I've been up to my eyes in much of this for last few months (on SUSE mind you: translate as needed).  Here we go.

We had a 6300 Intel card: close to what you have but check your mileage!

 wpa_supplicant 7.3 base we helped them customize for touch screen.  The product was a deployed in the type of highly roam-able networks that employ band steering as you describe.  They were NOT using Network manager but rather had employed a simpler home brew app/scripts solution that switched between wired Ethernet and Wi-fi as it detected connectivity.  They built their kernel without NM: that's how it got removed.  Even it you get it out, you still have issues with band steering: more on that soon.

I know that the Intel 6300 card and driver doesn't have the smarts in kernel driver to perform scanning *and* roaming without assistance of wpa_supplicant.  Roaming support just isn't in the Intel kernel driver: it can scan (older driver versions ping the associated AP to ensure it's still there, but that creates other issues  with spontaneous disconnects but I digress). However it won't initiate roaming without supplicant requesting it.

So band steering.  Can confirm you are correct in that the large networks actively deny association to 2.4g devices, in hopes of driving it to 5g to avoid congestion.  In addition to devices not being aware of it so they do the wrong thing often. And its not a standardized thing: each network has it's own special sauce on how to do it and poor supplicant has no idea it's going on.  Suppplicant isn't that innocent either: it uses blacklisting to determine the APs it has tried and failed to connect to but the notion of band steering isn't factored into that logic.  It just marches merrily on down 2.4g channels looking for a victim and band hopping isn't often considered as it should be.  AP_SCAN won't help you: it mostly just says  who is driving the scan/roam bus, not whether it's a 2.4g or 5g bus.

So best advice I found for this problem: much code change needed to be able to support band steer in supplicant as it needs to be.

1.  Ask IT to turn off band steering: it's a problem for many devices.  Perhaps you get laughed out of the office, but likely they are getting whined at by others too if they admit it.  So failing that:jogreene@redhat.com

2. I wish supplicant had a band limit keyword: Like Band24G, or Band5G or whatever but we aren't there.  The closest I found in current code was to use the freq_limit keyword in the wpa_supplicant.conf file.  It's awkward but already there: limited testing said it was workable for avoiding band steering.  The general idea would be to specify an SSID and limit the freq_list to the A band channels yourself as a higher priority than G.  
Details from docs of wpa_supplicant:

# freq_list: Array of allowed frequencies
# Space-separated list of frequencies in MHz to allow for selecting the BSS. If
# set, scan results that do not match any of the specified frequencies are not
# considered when selecting a BSS.
so in wpa_supplicant.conf it would look like this
network {
# A band channels: these are only a few for example
    freq_list=5180 5240 5745 5825
# default is 0: bigger numbers tried first
    <...other required stuff...>

# same ssid but will allow 2.4 second priority.
network {
    <...other required stuff...>

Have you tried this?  Might be a good enough (yes, not preferred) workaround if you have only a few or single sites.
Comment 3 Dan Williams 2012-10-03 16:28:26 EDT
I also dearly wish the supplicant had a band config value.  However, upstream doesn't really want to add one, and suggests using freq_list instead.  Which, while it works, means you have to list every possible 2.4 or 5ghz frequency in the list.  Oh well, just means a bit more work for us.

In any case, there have been some fixes to the supplicant in the last year upstream specifically for band steering; I believe there have also been some changes to the mac80211 stack.  We'd need to identify which (if any) of these patches are candidates for backports.

And to James: the supplicant is not braindead; it's vendors who did not implement any standard way to indicate band preferences, which means the supplicant is unable to get the necessary information to make a decision.  The driver is no better at this; it doesn't have the necessary information either, it just makes a different decision than the supplicant.  The drivers decision is completely wrong for other cases, but apparently not for band steering in your instance.

This is certainly fixable in the supplicant in way that works for everyone, whereas ap_scan=0 is *NOT* workable for everyone, and will never, ever be a selectable option.  The supplicant must be fixed to work for all cases with ap_scan=1 (except in the case of hidden networks).

ap_scan=0 is actually a complete hack to allow very, very old hardware that handled scanning and AP selection in firmware (which is even *less* flexible than doing that in the supplicant) to mostly work in a modern world.
Comment 4 James Ralston 2012-10-18 20:54:23 EDT
In response to comment 2:

Yes, the config file gives us the ability to direct wpa_supplicant to specific frequencies for specific ssids.

But the problem (as I explained in comment 1) is that when NetworkManager launches wpa_supplicant, it supplies the -u flag, which causes wpa_supplicant to TOTALLY IGNORE the wpa_supplicant.conf file provide by the -c flag. As in, it never even opens the file. (I proved this by temporarily replacing /usr/sbin/wpa_supplicant with an strace wrapper and then combing through the strace output.)

Really, I think that's the bug here: if both the -c and -u flags are specified, wpa_supplicant should still take its initial configuration from the configuration file specified by -c, THEN enable the DBus control interface.
Comment 5 James Ralston 2012-10-18 21:20:08 EDT
In response to comment 3:

I'd still argue that wpa_supplicant's scanning behavior is braindead. Here's the behavior I see: 

1. Try a pile of 2.4 GHz channels. Fail to authenticate to every one.
2. Try a 5 GHz channel. Successfully authenticate the very first time.
3. Wait between 30-120 seconds.
4. Disconnect from the 5 GHz channel.
5. Go to step #1.

Meaning, wpa_supplicant won't stay connected to a 5 GHz channel. Every 30-120 seconds, it disconnects and and tries to associate with a pile of 2.4 GHz channels, even though it never manages to associate with any of them, and always associates with any 5 GHz channel it tries the first time, every time.

This behavior renders the wireless interface unusable.

My guess is that wpa_supplicant keeps trying to associate to 2.4 GHz channels because it notices that they have a stronger signal, and that's the only criteria wpa_supplicant is using to trigger a rescan. That's not a bug per se, but that strategy fails miserably when the APs are performing band steering. wpa_supplicant needs to factor in past association success rate before it decides to dump a perfectly adequate channel in favor of attempting to associate to channels with better signal but 0% recent associate success rates.
Comment 6 Fedora End Of Life 2013-01-16 09:07:01 EST
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
Comment 7 Fedora End Of Life 2013-02-13 10:07:07 EST
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.