Bug 1371126 - layer 2-only device is taken down when NetworkManager stops
Summary: layer 2-only device is taken down when NetworkManager stops
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: NetworkManager
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: 7.4
Assignee: Thomas Haller
QA Contact: Desktop QE
URL:
Whiteboard:
: 1455126 (view as bug list)
Depends On: 1452062
Blocks: 1390667
TreeView+ depends on / blocked
 
Reported: 2016-08-29 12:08 UTC by Thomas Haller
Modified: 2017-08-01 09:17 UTC (History)
15 users (show)

Fixed In Version: NetworkManager-1.4.0-13.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1390667 (view as bug list)
Environment:
Last Closed: 2017-08-01 09:17:07 UTC


Attachments (Terms of Use)
test script for creating devices and stopping NetworkManager (2.80 KB, text/plain)
2016-09-22 10:05 UTC, Thomas Haller
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2299 normal SHIPPED_LIVE Moderate: NetworkManager and libnl3 security, bug fix and enhancement update 2017-08-01 12:40:28 UTC

Description Thomas Haller 2016-08-29 12:08:51 UTC
IF=enp0s25
nmcli con add type ethernet con-name t-$IF ifname $IF autoconnect no connection.master i-bond0 connection.slave-type bond
nmcli con add type bond con-name t-bond0 ifname i-bond0 autoconnect no ipv4.method disabled ipv6.method ignore

nmcli connection up t-$IF


killall NetworkManager



notice: 

manager: (i-bond0): removing device (allow_unmanage 1, managed 1)
device (i-bond0): state change: activated -> deactivating (reason 'unmanaged') [100 110 3]


dues to both ip states being 4 (IP_FAIL)


  if (priv->ip4_state != IP_DONE && priv->ip6_state != IP_DONE)

in https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/devices/nm-device.c?id=1.4.0#n3515


See also 
https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=25aaaab3b7ee677f494e8b3b3c310294137626a1
which already wants to leave layer-2 only up.

Comment 2 Thomas Haller 2016-08-29 16:55:32 UTC
please review: th/device-ip-state-for-disabled-rh1371126

Comment 3 Dan Williams 2016-08-29 21:04:40 UTC
Wouldn't you need to update nm-modem.c too for the same reason in nm_modem_stage3_ip[4|6]_config_start()?  It overrides act_stage3_ip6_config_start() and would need the same fixes.

I think there are two cases here:

1) when an IP method has been requested (with or without may-fail).  Here, we should set IP_FAIL and may-fail will figure out whether the entire activation should fail.  And we could rename RETURN_STOP to RETURN_FAIL to make it clearer that this is a hard failure when success was expected.

2) when the IP method has been explicitly disabled (with disable/ignore).  Here I think your patch is correct and this should not set IP_FAIL.  But we need to update nm-modem.c too.

NMDeviceModem overrides the NMDevice::act_stage3_ip4_config_start() with its own logic.  That gets to nm_modem_stage3_ip[4|6]_config_start() where the same logic needs to be applied for ignore/disabled.

Now, the other use of STOP in this function doesn't indicate failure so we should replace that with SUCCESS too.  What happens here is that NMModem decides what IP methods to ask the modem to use based on modem capabilities, IP method, and may-fail.  If the modem doesn't support IPv6 and ipv6.may-fail=true then NM will request only IPv4 and we'll get priv->ip6_method = NM_MODEM_IP_METHOD_UNKNOWN and should treat that as ipv6.method=ignore and thus return SUCCESS.  nm_device_activate_stage3_ip6_start() still gets called even though the modem doesn't support IPv6, because the modem capabilities don't get pushed up to NMDevice.

So in summary, I think with a little rework we can clarify how IP failure and disable works nicely:

- RETURN_FAIL will set IP_FAIL and will *always* indicate that the IP method was requested and failed, including if the device doesn't support the method.  This would require switching nm-modem.c::nm_modem_stage3_ip[4|6]_config_start() to return RETURN_FAIL in the 'default' switch case, which is supposed to indicate that the modem doesn't support the IP method.

- If the IP method was disabled/ignored that results in RETURN_SUCCESS and IP_DONE because nothing failed, like your patch does for nm-device.c, but we also need to do it in nm-modem.c where the code checks disabled/ignored and currently returns STOP.

Does that all sound reasonable?

Comment 4 Dan Williams 2016-08-29 21:06:43 UTC
(In reply to Dan Williams from comment #3)
> Now, the other use of STOP in this function doesn't indicate failure so we
> should replace that with SUCCESS too.  What happens here is that NMModem
> decides what IP methods to ask the modem to use based on modem capabilities,
> IP method, and may-fail.  If the modem doesn't support IPv6 and
> ipv6.may-fail=true then NM will request only IPv4 and we'll get
> priv->ip6_method = NM_MODEM_IP_METHOD_UNKNOWN and should treat that as
> ipv6.method=ignore and thus return SUCCESS. 
> nm_device_activate_stage3_ip6_start() still gets called even though the
> modem doesn't support IPv6, because the modem capabilities don't get pushed
> up to NMDevice.

Grr, I changed my mind after writing this so please ignore this section.  The "In summary" and everything after reflects my opinion.

Comment 5 Thomas Haller 2016-08-31 18:13:06 UTC
(In reply to Dan Williams from comment #4)
> Grr, I changed my mind after writing this so please ignore this section. 
> The "In summary" and everything after reflects my opinion.

I think I addressed you comment. Reworked and repushed to th/device-ip-state-for-disabled-rh1371126

Comment 6 Beniamino Galvani 2016-09-02 06:18:27 UTC
th/device-ip-state-for-disabled-rh1371126 LGTM

Comment 8 Thomas Haller 2016-09-22 10:05:45 UTC
Created attachment 1203646 [details]
test script for creating devices and stopping NetworkManager

strange... testing on NetworkManager 1.4.0-9.el7, I now cannot reproduce the original issue -- although the fix from comment 7 is not in rhel-7.3

Note also, the patch from comment 7 introduced a regression, see https://bugzilla.gnome.org/show_bug.cgi?id=771579

Comment 9 Thomas Haller 2016-09-27 16:41:52 UTC
(In reply to Thomas Haller from comment #8)
> Created attachment 1203646 [details]
> test script for creating devices and stopping NetworkManager
> 
> strange... testing on NetworkManager 1.4.0-9.el7, I now cannot reproduce the
> original issue -- although the fix from comment 7 is not in rhel-7.3
> 
> Note also, the patch from comment 7 introduced a regression, see
> https://bugzilla.gnome.org/show_bug.cgi?id=771579

I was unable to reproduce it, because 
 - unmanaged_on_quit() returned FALSE,
   - because nm_platform_link_can_assume() returned FALSE,
     - because it didn't find an IPv6 address in my test. If you set 
       ipv6.method=ignore, it is quite possible that you get an IPv6 address 
       assigned. In which case, the original issue is reproducible.

See also bug 1378418

Comment 13 Thomas Haller 2016-10-13 13:57:24 UTC
@NetworkManager-team: please review https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=th%2Fno-unmanage-on-quit-rh1371126

Comment 14 Dan Williams 2016-10-13 20:56:38 UTC
+	if (!NM_IN_STRSET (nm_utils_get_ip_config_method (connection, NM_TYPE_SETTING_IP4_CONFIG),
+	                   NM_SETTING_IP6_CONFIG_METHOD_SHARED))

Shouldn't that be IP4_CONFIG_METHOD_SHARED?  Not that it makes a difference operationally, but should probably be changed here.

I'm a bit worried about other types of devices that aren't wifi and require external daemons though, like ADSL, Bluetooth, WWAN, etc.  We can't pick up and assume that configuration on restart.  How are those now handled?

Comment 15 Thomas Haller 2016-10-14 04:28:35 UTC
(In reply to Dan Williams from comment #14)
> +	if (!NM_IN_STRSET (nm_utils_get_ip_config_method (connection,
> NM_TYPE_SETTING_IP4_CONFIG),
> +	                   NM_SETTING_IP6_CONFIG_METHOD_SHARED))
> 
> Shouldn't that be IP4_CONFIG_METHOD_SHARED?  Not that it makes a difference
> operationally, but should probably be changed here.
> 
> I'm a bit worried about other types of devices that aren't wifi and require
> external daemons though, like ADSL, Bluetooth, WWAN, etc.  We can't pick up
> and assume that configuration on restart.  How are those now handled?

you are right on both accounts. How about v2? (repushed).

Comment 16 Thomas Haller 2016-10-14 10:02:14 UTC
new scratch build with version 2: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11908088

Comment 17 Beniamino Galvani 2016-10-26 12:03:34 UTC
v2 LGTM

Comment 20 Thomas Haller 2016-10-31 13:20:42 UTC
scratch-build v3 https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11985134

this is the patch that was eventually merged upstream.

Comment 23 Vladimir Benes 2017-06-02 08:12:33 UTC
This is still not working well for bond. reverting to assigned

Comment 24 Vladimir Benes 2017-06-02 08:12:57 UTC
*** Bug 1455126 has been marked as a duplicate of this bug. ***

Comment 25 Thomas Haller 2017-06-06 14:24:18 UTC
This bug failed QA due to CI failure from bug 1455126.

However, the test bond_leave_L2_only_up_when_going_down is no longer failing.

Instead, failing is the test before it (vlan_over_no_L3_bond_restart_persistence, bug 1452062.

Setting Depends on rh 1452062.

Comment 26 Vladimir Benes 2017-06-06 14:33:17 UTC
Bringing this back to ON_QA as the test is not failing anymore. (Side effect of some other fix?)

Comment 27 Thomas Haller 2017-06-11 21:05:43 UTC
(In reply to Thomas Haller from comment #25)
> This bug failed QA due to CI failure from bug 1455126.
> 
> However, the test bond_leave_L2_only_up_when_going_down is no longer failing.
> 
> Instead, failing is the test before it
> (vlan_over_no_L3_bond_restart_persistence, bug 1452062.
> 
> Setting Depends on rh 1452062.

bug 1452062 seems to be fixed now too.

This issue is likely fixed for good. If not, please open a new bug instead of reopening, because this issue is rather unspecific.

Comment 28 errata-xmlrpc 2017-08-01 09:17:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2299


Note You need to log in before you can comment on or make changes to this bug.