This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours

Bug 1025371

Summary: [abrt] NetworkManager-0.9.9.0-14.git20131003.fc20: _g_log_abort: Process /usr/sbin/NetworkManager was killed by signal 6 (SIGABRT)
Product: [Fedora] Fedora Reporter: Philipp Dreimann <philipp>
Component: NetworkManagerAssignee: Dan Williams <dcbw>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: cmaser, danw, dcbw, jklimes, thaller, vonbrand
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
URL: https://retrace.fedoraproject.org/faf/reports/bthash/7abbe5f0439f22beb187e572a69df5cd6bb70ee8
Whiteboard: abrt_hash:c18be0bb4e7fa3dac71c9dfeb88c3d43365d7909
Fixed In Version: NetworkManager-0.9.9.0-19.git20131003.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-11-25 22:58:27 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
File: backtrace
none
File: cgroup
none
File: core_backtrace
none
File: dso_list
none
File: environ
none
File: limits
none
File: maps
none
File: open_fds
none
File: proc_pid_status
none
File: var_log_messages
none
[PATCH] core: fix crash when connecting to wifi (rh #1025371)
none
[PATCH] core: cleanup handling of AP in nm-device-wifi and fix crash
none
[PATCH v2 1/2] core: cleanup handling of AP in nm-device-wifi and fix crash
none
[PATCH v2 2/2] wifi: always assume can_scan_ssid and use ap_scan=1 for infra mode none

Description Philipp Dreimann 2013-10-31 10:49:25 EDT
Version-Release number of selected component:
NetworkManager-0.9.9.0-14.git20131003.fc20

Additional info:
reporter:       libreport-2.1.9
backtrace_rating: 4
cmdline:        /usr/sbin/NetworkManager --no-daemon
crash_function: _g_log_abort
executable:     /usr/sbin/NetworkManager
kernel:         3.11.6-301.fc20.x86_64
runlevel:       N 5
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (4 frames)
 #2 _g_log_abort at gmessages.c:255
 #5 handle_ip_config_timeout at devices/nm-device-wifi.c:3097
 #6 act_stage4_ip6_config_timeout at devices/nm-device-wifi.c:3168
 #7 nm_device_activate_ip6_config_timeout at devices/nm-device.c:3708
Comment 1 Philipp Dreimann 2013-10-31 10:49:33 EDT
Created attachment 817946 [details]
File: backtrace
Comment 2 Philipp Dreimann 2013-10-31 10:49:36 EDT
Created attachment 817947 [details]
File: cgroup
Comment 3 Philipp Dreimann 2013-10-31 10:49:40 EDT
Created attachment 817948 [details]
File: core_backtrace
Comment 4 Philipp Dreimann 2013-10-31 10:49:43 EDT
Created attachment 817949 [details]
File: dso_list
Comment 5 Philipp Dreimann 2013-10-31 10:49:46 EDT
Created attachment 817950 [details]
File: environ
Comment 6 Philipp Dreimann 2013-10-31 10:49:49 EDT
Created attachment 817951 [details]
File: limits
Comment 7 Philipp Dreimann 2013-10-31 10:49:52 EDT
Created attachment 817952 [details]
File: maps
Comment 8 Philipp Dreimann 2013-10-31 10:49:56 EDT
Created attachment 817953 [details]
File: open_fds
Comment 9 Philipp Dreimann 2013-10-31 10:49:59 EDT
Created attachment 817954 [details]
File: proc_pid_status
Comment 10 Philipp Dreimann 2013-10-31 10:50:02 EDT
Created attachment 817955 [details]
File: var_log_messages
Comment 11 Thomas Haller 2013-11-01 05:46:01 EDT
Created attachment 818211 [details]
[PATCH] core: fix crash when connecting to wifi (rh #1025371)

Hi,

I looked into it, but I am unsure what caused the crash. What do you think about the attached patch?
Comment 12 Thomas Haller 2013-11-01 05:47:51 EDT
(In reply to Thomas Haller from comment #11)
> [...] What do you think about the attached patch?


"you" in the sense: "dear general audience, whoever reads this" :)
Comment 13 Thomas Haller 2013-11-01 16:00:56 EDT
Following annotations are by dcbw how to properly handle APs. Will come up with a new patch soon...


deactivate(): on deactivate/unavailable, remove all "fake" APs from the scan list
∘ merge_scanned_ap(): just use priv->current_ap?
∘ if device roams to non-fake AP, remove fake AP from list?
∘ link_timeout_cb(): just use current_ap as long as fakes are removed?  first connect the fake AP should be current due to act_stage1_prepare()
∘ supplicant_iface_state_cb(): use connection instead of AP
∘ supplicant_connection_timeout_cb(): use connection instead of AP
∘ act_stage1_prepare(): same behavior, use activation AP
∘ act_stage2_config(): 
∘ nm_supplicant_config_add_setting_wireless(): remove is_broadcast argument and use ap_scan=1 for everything
∘ build_supplicant_config(): pass frequency into here instead of NMAccessPoint, pass that on to nm_supplicant_config_add_setting_wireless()
∘ handle_ip_config_timeout(): use the connection only
∘ activation_success_handler(): use current_ap
∘ activation_failure_handler(): leave fake AP removal to deactivate


dcbw: "act_stage2_config() was the only one I wasn't quite sure about"
Comment 14 Thomas Haller 2013-11-04 07:34:36 EST
Created attachment 819142 [details]
[PATCH] core: cleanup handling of AP in nm-device-wifi and fix crash

Some notes in comments #13 I did not really understand. Please review.
Comment 15 Dan Winship 2013-11-05 13:39:11 EST
In several places, dcbw said "use connection instead of AP", meaning, call nm_device_get_connection(), and get the relevant data from there rather than from the AP. (eg, in supplicant_iface_state_cb(), you can get the ssid from the connection). I'm not totally sure why he suggested this, but you didn't do it...

I'm pretty sure deactivate() should still be using get_activation_ap(); there's no way to add additional fake APs beyond the one you initially connect to, so it's only the original one that could be fake, and it's possible that that's not the one you're connected to currently.

act_stage2_config() ought to be able to use priv->current_ap, since that gets set at the end of act_stage1_prepare().

The comment in nm_supplicant_config_add_setting_wireless() is sort of weird now.
Comment 16 Jirka Klimes 2013-11-06 04:44:01 EST
(In reply to Dan Winship from comment #15)
> In several places, dcbw said "use connection instead of AP", meaning, call
> nm_device_get_connection(), and get the relevant data from there rather than
> from the AP. (eg, in supplicant_iface_state_cb(), you can get the ssid from
> the connection). I'm not totally sure why he suggested this, but you didn't
> do it...
> 
Maybe because it is sometimes easier to get e.g. ssid from the connection, and it is guaranteed to be there (otherwise the connection would not be valid).

> The comment in nm_supplicant_config_add_setting_wireless() is sort of weird
> now.
You mean the comment about the ap_scan == 1? Yeah, the previous comment is now misleading.
Comment 17 Jirka Klimes 2013-11-06 04:52:26 EST
*** Bug 854073 has been marked as a duplicate of this bug. ***
Comment 18 Dan Williams 2013-11-06 12:40:42 EST
I should have clarified my notes around nm_supplicant_config_add_setting_wireless().

We should use ap_scan=1 *except* for AP/IBSS/AdHoc, where ap_scan=2 is required.  ap_scan for "infra" mode is all historical and was for old, crappy, and proprietary drivers that we should really stop hacking stuff for.  Those drivers did not support probe-scanning for hidden APs and thus the supplicant just had to send all the config to the driver and hope things worked.

All relevant and non-crappy drivers these days support at least one SSID probe and thus is_broadcast affecting ap_scan should no longer be something we support.  If you have an old, crappy WEXT/proprietary/staging driver, and you use hidden APs, you're doing it wrong.

So, in short, we must keep the ap_scan=2 logic for AP+AdHoc, but we can remove the is_broadcast and has_scan_capa_ssid arguments and the code where they change ap_scan.
Comment 19 Thomas Haller 2013-11-06 13:45:28 EST
Created attachment 820662 [details]
[PATCH v2 1/2] core: cleanup handling of AP in nm-device-wifi and fix crash
Comment 20 Thomas Haller 2013-11-06 13:46:05 EST
Created attachment 820663 [details]
[PATCH v2 2/2] wifi: always assume can_scan_ssid and use ap_scan=1 for infra mode
Comment 21 Thomas Haller 2013-11-06 13:52:56 EST
comment #19 reworks the patch from comment #14, the handling of the AP list.

(In reply to Dan Winship from comment #15)
> I'm pretty sure deactivate() should still be using get_activation_ap();
> there's no way to add additional fake APs beyond the one you initially
> connect to, so it's only the original one that could be fake, and it's
> possible that that's not the one you're connected to currently.

I don't think so. I made sure, that a fake AP can only be the current_ap. Whenever we set the current_ap to another AP, the previous AP will be removed (iff it was fake). Thus, either get_activation_ap() returns the current_ap, or it is not fake.


I also removed get_activation_ap() altogether, as it's no longer used.

The change in activation_success_handler is interesting, because in line 3235 the AP gets removed, but it is still kept as current_ap. This is the only case, where current_ap is not in ap_list.


The patch for setting scan_ap=1 was extracted to comment #20
Comment 22 Thomas Haller 2013-11-08 12:20:06 EST
Push a workaround to upstream master:

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=788eed99de51cd35adeb6585379b5e920c79b3f3

This patch tries to avoid any crashes due to roaming AP.

Still, we will cleanup AP handling soon, revert this commit and push the patches that are discuessed here.
Comment 23 Fedora Update System 2013-11-20 09:31:51 EST
NetworkManager-0.9.9.0-19.git20131003.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/NetworkManager-0.9.9.0-19.git20131003.fc20
Comment 24 Horst H. von Brand 2013-11-20 13:03:41 EST
No idea. Happily typing away at LibreOffice at the moment.

reporter:       libreport-2.1.9
backtrace_rating: 4
cmdline:        /usr/sbin/NetworkManager --no-daemon
crash_function: _g_log_abort
executable:     /usr/sbin/NetworkManager
kernel:         3.11.8-300.fc20.x86_64
package:        NetworkManager-0.9.9.0-18.git20131003.fc20
reason:         Process /usr/sbin/NetworkManager was killed by signal 6 (SIGABRT)
runlevel:       N 5
type:           CCpp
uid:            0
Comment 25 Thomas Haller 2013-11-22 08:08:28 EST
The previous patches from comment #19 and comment #20 do no longer apply on current master. Pushed rebased version to branch th/rh1025371_wifi_ap_rework
Comment 26 Dan Winship 2013-11-22 13:42:45 EST
I don't see anything obviously wrong with the patches, but I'm not very familiar with that code.
Comment 27 Fedora Update System 2013-11-23 22:41:40 EST
Package NetworkManager-0.9.9.0-19.git20131003.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing NetworkManager-0.9.9.0-19.git20131003.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2013-21973/NetworkManager-0.9.9.0-19.git20131003.fc20
then log in and leave karma (feedback).
Comment 28 Fedora Update System 2013-11-25 22:58:27 EST
NetworkManager-0.9.9.0-19.git20131003.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 29 Dan Williams 2013-11-27 19:53:08 EST
(In reply to Thomas Haller from comment #25)
> The previous patches from comment #19 and comment #20 do no longer apply on
> current master. Pushed rebased version to branch th/rh1025371_wifi_ap_rework

In set_current_ap(), I would completely skip the whole dbus path stuff, that's really old and no longer useful.  Instead, since you added the bits that check whether new == old and return, we don't need anything about the dbus paths.  So we can get rid of 'old_path' and always call g_object_notify() since the property will always have changed at the end of that function.

The remove_access_point() changes don't look quite right.  I think that ties things too closely together, since now remove_access_point() calls set_current_ap(), which calls remove_access_point(), all gated by a variable.  That's pretty error prone.

I think instead, we should just not call set_current_ap() from remove_access_point().  That means we don't need the 'remove_old_fake' parameter to set_current_ap(); a fake AP should simply always be removed if it used to be the current AP but no longer is current.

So, that leaves us with the following callers of remove_access_point() that expect the current AP to be removed too:

remove_all_aps() - called by:
   device_state_changed() & dispose() - bring back set_current_ap(NULL) like before
   act_stage1_prepare() - current AP isn't set yet when remove_all_aps() is called, so nothing to clear

cull_scan_list() - current AP should always be in the scan list, so it should never be removed here

link_timeout_cb() - just call set_current_ap(NULL)?  This will remove the AP from the scan list now

activation_success_handler() - call set_current_ap (tmp_ap)?, which will remove the fake from the list now

Next up, I would say we should have a new constraint; that remove_access_point() should g_warn_if_fail (access point in list);  I can't think of a good reason to call remove_access_point() if that AP isn't in the list.  This means we should call set_current_ap(NULL) *after* removing the current access point where we need to do that, not before like the previous code did.

Didn't have time to more closely review the other changes, but I'll try to do that soon.  Thanks!  This is looking like a great cleanup actually.
Comment 30 Thomas Haller 2013-12-02 13:58:07 EST
(In reply to Dan Williams from comment #29)

done, and pushed 2 !fixup commits. Please re-review.
Comment 31 Dan Williams 2013-12-02 18:54:07 EST
> core: cleanup handling of AP in nm-device-wifi and fix crash

You can use nm_device_get_connection() in supplicant_iface_state_cb() instead of manually getting the ActRequest and then the connection.  Saves a few LoC.


That's all I see from the patch review; I'd like to put the new code through some tests with AdHoc, hidden AP, etc to make sure it does the same things as before.  But the existing branch is already a great cleanup.

Future stuff we can do is to start watching the supplicant's CurrentBSS property and just get rid of all the find_active_ap() type stuff and AP matching from card attributes, since the supplicant knows what we're connected to.
Comment 32 Thomas Haller 2013-12-03 05:26:37 EST
(In reply to Dan Williams from comment #31)
> > core: cleanup handling of AP in nm-device-wifi and fix crash
> 
> You can use nm_device_get_connection() in supplicant_iface_state_cb()
> instead of manually getting the ActRequest and then the connection.  Saves a
> few LoC.

Done, !fixup pushed.


> That's all I see from the patch review; I'd like to put the new code through
> some tests with AdHoc, hidden AP, etc to make sure it does the same things
> as before.  But the existing branch is already a great cleanup.
> 
> Future stuff we can do is to start watching the supplicant's CurrentBSS
> property and just get rid of all the find_active_ap() type stuff and AP
> matching from card attributes, since the supplicant knows what we're
> connected to.

Can you please elaborate on what you mean exactly? This sounds like an independent change, maybe we first get these commits done?
Comment 33 Dan Williams 2013-12-03 11:39:25 EST
(In reply to Thomas Haller from comment #32)
> (In reply to Dan Williams from comment #31)
> > > core: cleanup handling of AP in nm-device-wifi and fix crash
> > 
> > You can use nm_device_get_connection() in supplicant_iface_state_cb()
> > instead of manually getting the ActRequest and then the connection.  Saves a
> > few LoC.
> 
> Done, !fixup pushed.
> 
> 
> > That's all I see from the patch review; I'd like to put the new code through
> > some tests with AdHoc, hidden AP, etc to make sure it does the same things
> > as before.  But the existing branch is already a great cleanup.
> > 
> > Future stuff we can do is to start watching the supplicant's CurrentBSS
> > property and just get rid of all the find_active_ap() type stuff and AP
> > matching from card attributes, since the supplicant knows what we're
> > connected to.
> 
> Can you please elaborate on what you mean exactly? This sounds like an
> independent change, maybe we first get these commits done?

Yes sorry, this would be an independent change for the future and should not block these patches.  I'm just brainstorming here, as it involves a lot of the current code and I was thinking about how to simplify it in the future.
Comment 34 Dan Williams 2013-12-03 17:53:42 EST
One more comment; with impl_device_get_access_points(), I believe any time an AP is added to the scan list, it must have been exported to D-Bus, so perhaps we should g_assert (dbus_path) instead of the if (dbus_path)/g_ptr_array_add() ?
Comment 35 Thomas Haller 2013-12-04 06:04:59 EST
(In reply to Dan Williams from comment #34)
> One more comment; with impl_device_get_access_points(), I believe any time
> an AP is added to the scan list, it must have been exported to D-Bus, so
> perhaps we should g_assert (dbus_path) instead of the if
> (dbus_path)/g_ptr_array_add() ?

You are right, all APs from priv->ap_list are always exported on DBUS. I undid that change completely (without g_assert) and repushed.
Comment 36 Dan Williams 2013-12-05 10:24:17 EST
Just for the record, these are the things I'm trying to test:

∘ start adhoc, make sure fake goes away on disconnect - FAIL (not regression)
∘ connect to hidden, ensure fake goes away on connect - PASS
∘ disabling wifi kills AP list - PASS
∘ for 2-node adhoc network, ensure priv->current_ap gets updated with new BSSID if card's BSSID coalesces
∘ start AP hotspot, ensure it goes away on disconnect
∘ if AP got turned off, ensure it gets removed from the scan list on deactivate
∘ on activation, ensure logged SSID is correct
∘ ensure connecting with wrong WEP key fails with "could not get IP configuration" message and asks for new secrets
Comment 37 Dan Williams 2013-12-05 12:25:41 EST
One other change, to fix the adhoc-goes-away-on-deactivate thing, I think we should do something like this in set_current_ap():


if (old_ap) {
	if (nm_ap_get_fake (old_ap) || adhoc (old_ap) || ap_mode (old_ap)) {
		remove_access_point (self, old_ap);
		nm_device_recheck_available_connections (NM_DEVICE (self));
	}

"fake" covers both IBSS and hidden SSID, but will be cleared if the AP was found in the scan list.  After a successful connection, the AP will be found in the scan list, even if it's the IBSS or Hotspot-AP  NM created, then the fake flag will get cleared, and this check won't hit.
Comment 38 Dan Williams 2013-12-05 12:47:28 EST
The current branch also fails for the "if AP got turned off, ensure it gets removed from the scan list on deactivate" test; admittedly the old code wasn't very clear here.

One way to fix that is to change link_timeout_cb() and set_current_ap() to do something like http://bigw.org/~dan/remove-dead-ap.patch or if you can think of a cleaner way we should do that instead.
Comment 39 Dan Williams 2013-12-05 12:50:49 EST
On second though I actually don't like having set_current_ap() return a value like that (since most of the time the return value is success/fail, but here it's not), so perhaps we relax the check in remove_access_point() that warns if the AP to be removed is not in the list, and then for link_timeout_cb(), do:

remove_access_point (self, gone_ap);
set_current_ap (self, NULL);

(otherwise, if priv->current_ap was fake, IBSS, or AP mode, set_current_ap() would try to remove it, and remove_access_point() would print a warning because it was not in the list)
Comment 40 Thomas Haller 2013-12-06 07:34:10 EST
Apparently, you did not checkout at the newest !fixup from 12/04.
Anyway, I pushed yet two other fixups to the branch.


(In reply to Dan Williams from comment #37)
> One other change, to fix the adhoc-goes-away-on-deactivate thing, I think we
> should do something like this in set_current_ap():

done

(In reply to Dan Williams from comment #39)

yeah, possible too. How do you like, adding a parameter force_remove_old_ap to set_current_ap? Only downside, now we have two boolean parameters, which is slightly confusing (but still acceptable, I'd say)
Comment 41 Dan Williams 2013-12-12 00:07:08 EST
(In reply to Thomas Haller from comment #40)
> Apparently, you did not checkout at the newest !fixup from 12/04.
> Anyway, I pushed yet two other fixups to the branch.
> 
> 
> (In reply to Dan Williams from comment #37)
> > One other change, to fix the adhoc-goes-away-on-deactivate thing, I think we
> > should do something like this in set_current_ap():
> 
> done
> 
> (In reply to Dan Williams from comment #39)
> 
> yeah, possible too. How do you like, adding a parameter force_remove_old_ap
> to set_current_ap? Only downside, now we have two boolean parameters, which
> is slightly confusing (but still acceptable, I'd say)

I think the extra parameter is OK for now; we could later remove the recheck_available_connections parameter from set_current_ap() by doing something clever elsewhere.

I'm fine with the branch as it is now, squash and merge whenever you like.  Thanks!