Bug 1441418

Summary: [NMCI] WPA2 PEAP connection sometimes weirdly fails
Product: Red Hat Enterprise Linux 7 Reporter: Vladimir Benes <vbenes>
Component: NetworkManagerAssignee: Beniamino Galvani <bgalvani>
Status: CLOSED DUPLICATE QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.4CC: aloughla, atragler, bgalvani, fgiudici, lrintel, rkhan, sukulkar, thaller, vbenes
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-06 16:05:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
journal trace log
none
log from successful run
none
[PATCH] nl80211: Fix race condition in detecting MAC change
none
[PATCH] device: delay bringing wifi interfaces up after MAC change none

Description Vladimir Benes 2017-04-11 21:51:20 UTC
Description of problem:
I see occasional fail in test that is creating WPA2 PEAP profile via editor and connecting it.
 

Version-Release number of selected component (if applicable):


How reproducible:
It's quite random but reproduceable (at least once in 20 attempts) 

Steps to Reproduce:
    Scenario: nmcli - wifi-sec - configure and connect WPA2-PEAP profile
    * Add a new connection of type "wifi" and options "ifname wlan0 con-name qe-wpa2-enterprise autoconnect off ssid qe-wpa2-enterprise"
    * Check ifcfg-name file created for connection "qe-wpa2-enterprise"
    * Open editor for connection "qe-wpa2-enterprise"
    * Set a property named "802-11-wireless-security.key-mgmt" to "wpa-eap" in editor
    * Set a property named "802-1x.eap" to "peap" in editor
    * Set a property named "802-1x.identity" to "Bill Smith" in editor
    * Set a property named "802-1x.ca-cert" to "file:///tmp/certs/eaptest_ca_cert.pem" in editor
    * Set a property named "802-1x.phase2-auth" to "gtc" in editor
    * Set a property named "802-1x.password" to "testing123" in editor
    * Save in editor
    * No error appeared in editor
    * Check value saved message showed in editor
    * Quit editor
    * Bring up connection "qe-wpa2-enterprise"
    Then "qe-wpa2-enterprise" is visible with command "iw dev wlan0 link"
    Then "\*\s+qe-wpa2-enterprise" is visible with command "nmcli -f IN-USE,SSID device wifi list"
    Then "inet 10." is visible with command "ip a s wlan0"


Actual results:
fail sometimes

Expected results:
no fail

Additional info:
Bigger timeouts between steps are not helpful. 
Not really sure if this isn't caused by infrastructure.

line 536:
dub 11 17:41:36 wlan-r2s10.wlan.rhts.eng.bos.redhat.com NetworkManager[2966]: <debug> [1491946896.6142] device[0x55fbb63c9180] (wlan0): wifi-scan: scanning requested but not allowed at this time

This is weird. I don't think it's present when it passes

Comment 1 Vladimir Benes 2017-04-11 21:51:51 UTC
Created attachment 1270957 [details]
journal trace log

Comment 2 Vladimir Benes 2017-04-11 21:54:47 UTC
Created attachment 1270958 [details]
log from successful run

Comment 3 Vladimir Benes 2017-04-12 08:27:21 UTC
I can see a bunch of warnings too:
(process:13115): GLib-GIO-WARNING **: gdbusobjectmanagerclient.c:1585: Processing InterfaceRemoved signal for path /org/freedesktop/NetworkManager/IP6Config/35 but no object proxy exists

(process:13115): libnm-WARNING **: no object known for /org/freedesktop/NetworkManager/AccessPoint/610

Comment 4 Thomas Haller 2017-04-13 13:07:41 UTC
the glib warnings are a separate issue.


In the attached logfile (and in 3 similar cases from beaker tests), I always see a timeout with: "device (wlan0): Activation: (wifi) association took too long" message.

»···»···/* Connection failed; either driver problems, the encryption key is
»···»··· * wrong, or the passwords or certificates were wrong.
»···»··· */
»···»···_LOGW (LOGD_DEVICE | LOGD_WIFI,
»···»···       "Activation: (wifi) association took too long");

so, we ask for new secrets (which then fails).


But the issue is that the first association attempt fails. Unknown reason so far.

Comment 5 Beniamino Galvani 2017-04-21 09:13:54 UTC
I noticed something similar, sometimes the wifi activation gets
stuck. Looking at wpa_supplicant logs, it seems that the randomization
of the scan MAC address causes problems with wpa_supplicant, that
loses the authentication event and makes the connection time out.

On a successful run:

 nl80211: Drv Event 37 (NL80211_CMD_AUTHENTICATE) received for wlp4s0
 nl80211: MLME event 37 (NL80211_CMD_AUTHENTICATE) on wlp4s0(02:7a:1f:ab:00:cd) A1=02:7a:1f:ab:00:cd A2=f4:f2:6d:1a:59:96
 nl80211: Authenticate event

Instead on failure:

 nl80211: Drv Event 37 (NL80211_CMD_AUTHENTICATE) received for wlp4s0
 nl80211: MLME event 37 (NL80211_CMD_AUTHENTICATE) on wlp4s0(b2:bf:62:b9:fa:88) A1=02:7a:1f:ab:00:cd A2=f4:f2:6d:1a:59:96
 nl80211: wlp4s0: Ignore MLME frame event for foreign address

b2:bf:62:b9:fa:88 is the scan MAC address set by NM:

 device (wlp4s0): set-hw-addr: set MAC address to B2:BF:62:B9:FA:88 (scanning)

I'm further investigating this.

Comment 6 Thomas Haller 2017-04-21 11:34:35 UTC
when I added set-hw-addr, I noticed that older versions of supplicants failed to properly reload when the MAC address changes externally.

Newer versions seem fixed. The version in rhel-7 back then seemed not entirely correct to me at that time, but later I was unable to reproduce the issue.
Maybe it is the same issue. Which supplicant version were you using? See https://bugzilla.redhat.com/show_bug.cgi?id=1374756#c7

Comment 7 Vladimir Benes 2017-04-21 11:48:42 UTC
both
wpa_supplicant-2.0-21.el7_3.x86_64
wpa_supplicant-2.6-3.el7.x86_64

Comment 8 Beniamino Galvani 2017-04-21 11:55:59 UTC
I'm using wpa_supplicant compiled from git master. wpa_supplicant
should detect a change in the MAC address after this commit:

https://w1.fi/cgit/hostap/commit/?id=3e0272ca00ce1df35b45e7d739dd7e935f13fd84

However, there is a race condition here:

https://w1.fi/cgit/hostap/tree/src/drivers/driver_nl80211.c?id=8befe8a993be661492eae73e2bc76b62c7a9d9f6#n999

When wpa_s receives the down event, it is possible that we have
already brought the interface up again and so the down-up cycle is
ignored.

Comment 10 Beniamino Galvani 2017-04-24 15:31:10 UTC
Created attachment 1273634 [details]
[PATCH] nl80211: Fix race condition in detecting MAC change

A patch to fix the race in wpa_supplicant.

Comment 11 Thomas Haller 2017-04-24 15:37:52 UTC
(In reply to Beniamino Galvani from comment #10)
> Created attachment 1273634 [details]
> [PATCH] nl80211: Fix race condition in detecting MAC change
> 
> A patch to fix the race in wpa_supplicant.

lgtm

Comment 12 Beniamino Galvani 2017-04-24 16:18:59 UTC
Created attachment 1273645 [details]
[PATCH] device: delay bringing wifi interfaces up after MAC change

Until wpa_supplicant gets patched, this is the only workaround I found for NM.

Comment 13 Thomas Haller 2017-04-24 16:33:29 UTC
(In reply to Beniamino Galvani from comment #12)
> Created attachment 1273645 [details]
> [PATCH] device: delay bringing wifi interfaces up after MAC change
> 
> Until wpa_supplicant gets patched, this is the only workaround I found for
> NM.

lgtm

Comment 17 Beniamino Galvani 2017-05-11 08:55:22 UTC
It turns out that the the output is buffered, and so timestamps are not reliable. I captured a dump of the traffic and there were missing EAPOL frames from the AP which caused an authentication timeout. But now I'm not able to reproduce the issue anymore.

Comment 18 Beniamino Galvani 2017-06-06 16:05:29 UTC
The root cause seems the same of bug 1449301, closing as duplicate.

*** This bug has been marked as a duplicate of bug 1449301 ***