1548265 – Bond and slaves status are wrong when one of the slaves has the host connection

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1548265 - Bond and slaves status are wrong when one of the slaves has the host connection

Summary: Bond and slaves status are wrong when one of the slaves has the host connection

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	NetworkManager
Sub Component:
Version:	7.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Beniamino Galvani
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1473952 1547372 1554628 (view as bug list)
Depends On:
Blocks:	ovirt-node-ng-43-el76-platform 1556666 1570522
TreeView+	depends on / blocked

Reported:	2018-02-23 03:26 UTC by Huijuan Zhao
Modified:	2018-10-30 11:13 UTC (History)
CC List:	26 users (show)
Fixed In Version:	NetworkManager-1.10.2-14.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1570522 (view as bug list)
Environment:
Last Closed:	2018-10-30 11:11:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Screenshot of bond configuarion (150.26 KB, image/png) 2018-02-23 03:26 UTC, Huijuan Zhao	no flags	Details
All logs when both two slaves have dhcp IP (11.38 MB, application/x-gzip) 2018-02-23 03:29 UTC, Huijuan Zhao	no flags	Details
All logs when only one slave NIC has dhcp IP (11.42 MB, application/x-gzip) 2018-02-23 03:30 UTC, Huijuan Zhao	no flags	Details
[PATCH] manager: allow internal reactivation if the master changed (3.93 KB, patch) 2018-03-06 18:01 UTC, Beniamino Galvani	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3207	0	None	None	None	2018-10-30 11:13:25 UTC

Description Huijuan Zhao 2018-02-23 03:26:25 UTC

Created attachment 1399675 [details]
Screenshot of bond configuarion

Description of problem:
Create bond over two slaves in cockpit, if one of the slaves has the host connection, the status of bond and slaves are wrong.
1. If both two slaves are "ON" and have dhcp IP, the bond over the two slaves can not get dhcp IP, and the two slaves are still "ON" with dhcp IP.
2. If only primary slave is "ON" and has dhcp IP, the bond can get dhcp IP, but the primary slave is still "ON" with dhcp IP.

Expected results: Only bond can get dhcp IP,  the slaves have NO dhcp IP automatically.


Version-Release number of selected component (if applicable):
redhat-virtualization-host-4.1-20180218.0
imgbased-0.9.54-0.1.el7ev.noarch
cockpit-ws-160-2.el7.x86_64
cockpit-ovirt-dashboard-0.10.10-0.el7ev.noarch
glib-networking-2.50.0-1.el7.x86_64
libvirt-daemon-driver-network-3.9.0-12.el7.x86_64


How reproducible:
100%
Regression bug
Keywords: cockpit -> networking


Steps to Reproduce:

Scenario 1:
1. Install redhat-virtualization-host-4.1-20180218.0
2. Enable("ON") NIC1(eno1, can get dhcp ip: 10.73.130.225), connect host via cockpit:
   https://10.73.130.225:9090/network
3. Enter to networking page, enable("ON") NIC2(eno2, can get dhcp ip: 10.73.128.246),
4. Create bond0(Active Backup) over the above two slave NICs(eno1 is primary, eno2)

Actual results:
1. After step 4, bond0 can NOT get dhcp IP, but the two slaves are still connected with previous dhcp IP

Expected results:
1. After step 4, bond0 can get dhcp IP, the two slaves should have NO dhcp IP automatically.

Scenario 1:
1. Install redhat-virtualization-host-4.1-20180218.0
2. Enable("ON") NIC1(eno1, can get dhcp ip: 10.73.130.225), connect host via cockpit:
   https://10.73.130.225:9090/network
3. Enter to networking page, check NIC2(eno2) if "OFF" with no dhcp IP
4. Create bond0(Active Backup) over the above two slave NICs(eno1 is primary, eno2)

Actual results:
1. After step 4, bond0 can get dhcp IP, but the primary slave(eno1) is still connected with previous dhcp IP

Expected results:
1. After step 4, the primary slave(eno1) should have NO dhcp IP. But change to bond0 to be connected.


Additional info:
No such issue in rhel-7.4.
In rhel-7.4, the test results are same as the above expected results.

Comment 2 Huijuan Zhao 2018-02-23 03:29:00 UTC

Created attachment 1399678 [details]
All logs when both two slaves have dhcp IP

Comment 3 Huijuan Zhao 2018-02-23 03:30:18 UTC

Created attachment 1399679 [details]
All logs when only one slave NIC has dhcp IP

Comment 4 dguo 2018-02-23 09:05:13 UTC

This issue blocked to add the rhvh to engine
"""
Failed to configure management network on host dguo-intel. Host dguo-intel has an invalid bond interface (bond0 contains less than 2 active slaves) for the management network configuration.
"""
Test version:
rhvm-4.1.10.1-0.1.el7
rhvh-4.1-0.20180218.0(el7.5)

Comment 5 Marius Vollmer 2018-02-28 10:09:14 UTC

Using redhat-virtualization-host-4.1-20180218.0 is difficult for me.  Do you have a machine that I can log into?  (I guess I would also need a way to interact with the console, so that I can fix the network config after I break it.)

However, I think I can reproduce this on Fedora 27 and RHEL 7.5.  My current hunch is that a change in NetworkManager is responsible for the new behavior.

What is your version of NetworkManager?

What happens when you reboot?

Comment 6 Marius Vollmer 2018-02-28 14:20:42 UTC

Here is a reproducer with nmcli:

# nmcli dev con eth1
# nmcli con show "System eth1"
connection.id:                          System eth1
connection.uuid:                        9c92fad9-6ecb-3e6c-eb4d-8a47c6f50c04
connection.stable-id:                   --
connection.type:                        802-3-ethernet
connection.interface-name:              eth1
connection.autoconnect:                 no
connection.autoconnect-priority:        0
connection.autoconnect-retries:         -1 (default)
connection.auth-retries:                -1
connection.timestamp:                   1519826154
connection.read-only:                   no
connection.permissions:                 --
connection.zone:                        --
connection.master:                      --
connection.slave-type:                  --
connection.autoconnect-slaves:          -1 (default)
connection.secondaries:                 --
connection.gateway-ping-timeout:        0
connection.metered:                     unknown
connection.lldp:                        default
802-3-ethernet.port:                    --
802-3-ethernet.speed:                   0
802-3-ethernet.duplex:                  --
802-3-ethernet.auto-negotiate:          no
802-3-ethernet.mac-address:             --
802-3-ethernet.cloned-mac-address:      --
802-3-ethernet.generate-mac-address-mask:--
802-3-ethernet.mac-address-blacklist:   --
802-3-ethernet.mtu:                     auto
802-3-ethernet.s390-subchannels:        --
802-3-ethernet.s390-nettype:            --
802-3-ethernet.s390-options:            --
802-3-ethernet.wake-on-lan:             default
802-3-ethernet.wake-on-lan-password:    --
ipv4.method:                            disabled
ipv4.dns:                               --
ipv4.dns-search:                        --
ipv4.dns-options:                       ""
ipv4.dns-priority:                      0
ipv4.addresses:                         --
ipv4.gateway:                           --
ipv4.routes:                            --
ipv4.route-metric:                      -1
ipv4.route-table:                       0 (unspec)
ipv4.ignore-auto-routes:                no
ipv4.ignore-auto-dns:                   no
ipv4.dhcp-client-id:                    --
ipv4.dhcp-timeout:                      0 (default)
ipv4.dhcp-send-hostname:                yes
ipv4.dhcp-hostname:                     --
ipv4.dhcp-fqdn:                         --
ipv4.never-default:                     no
ipv4.may-fail:                          yes
ipv4.dad-timeout:                       -1 (default)
ipv6.method:                            ignore
ipv6.dns:                               --
ipv6.dns-search:                        --
ipv6.dns-options:                       ""
ipv6.dns-priority:                      0
ipv6.addresses:                         --
ipv6.gateway:                           --
ipv6.routes:                            --
ipv6.route-metric:                      -1
ipv6.route-table:                       0 (unspec)
ipv6.ignore-auto-routes:                no
ipv6.ignore-auto-dns:                   no
ipv6.never-default:                     no
ipv6.may-fail:                          yes
ipv6.ip6-privacy:                       -1 (unknown)
ipv6.addr-gen-mode:                     stable-privacy
ipv6.dhcp-send-hostname:                yes
ipv6.dhcp-hostname:                     --
ipv6.token:                             --
proxy.method:                           none
proxy.browser-only:                     no
proxy.pac-url:                          --
proxy.pac-script:                       --
GENERAL.NAME:                           System eth1
GENERAL.UUID:                           9c92fad9-6ecb-3e6c-eb4d-8a47c6f50c04
GENERAL.DEVICES:                        eth1
GENERAL.STATE:                          activated
GENERAL.DEFAULT:                        no
GENERAL.DEFAULT6:                       no
GENERAL.SPEC-OBJECT:                    --
GENERAL.VPN:                            no
GENERAL.DBUS-PATH:                      /org/freedesktop/NetworkManager/ActiveCo
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/Settings
GENERAL.ZONE:                           --
GENERAL.MASTER-PATH:                    --
IP4.GATEWAY:                            --
IP6.ADDRESS[1]:                         fd00:111:112:0:5054:1ff:fe00:1/64
IP6.ADDRESS[2]:                         fe80::5054:1ff:fe00:1/64
IP6.GATEWAY:                            fe80::5054:ff:fee3:6eab
IP6.ROUTE[1]:                           dst = ff00::/8, nh = ::, mt = 256, table
IP6.ROUTE[2]:                           dst = fe80::/64, nh = ::, mt = 256
IP6.ROUTE[3]:                           dst = ::/0, nh = fe80::5054:ff:fee3:6eab
IP6.ROUTE[4]:                           dst = fd00:111:112::/64, nh = ::, mt = 2

# nmcli con mod "System eth1" slave-type bond master bond0 autoconnect yes
# nmcli con add type bond ifname bond0 con-name bond0 mode active-backup autoconnect yes connection.autoconnect-slaves 1

On RHEL 7.5 with NetworkManager-1.10.2-12.el7.x86_64, bond0 activates, but never gets an IP address.  eth1 is still active with the same IP4.* and IP6.* values as shown above.
Bringin bond0 up again explicitly will then enslave eth1.

# nmcli dev
DEVICE      TYPE      STATE                                  CONNECTION  
eth0        ethernet  connected                              System eth0 
eth1        ethernet  connected                              System eth1 
bond0       bond      connecting (getting IP configuration)  bond0       

# nmcli dev con bond0
# nmcli dev
DEVICE      TYPE      STATE      CONNECTION  
eth0        ethernet  connected  System eth0 
bond0       bond      connected  bond0       
eth1        ethernet  connected  System eth1 

On Rhel 7.4 with NetworkManager-1.8.0-11.el7_4.x86_64, bond0 activates fully and gets an IP address.  eth1 is immediately enslaved.

# nmcli dev
DEVICE      TYPE      STATE      CONNECTION  
bond0       bond      connected  bond0       
eth0        ethernet  connected  System eth0 
eth1        ethernet  connected  System eth1

Comment 7 Huijuan Zhao 2018-03-01 02:28:59 UTC

(In reply to Marius Vollmer from comment #5)
> Using redhat-virtualization-host-4.1-20180218.0 is difficult for me.  Do you
> have a machine that I can log into?  (I guess I would also need a way to
> interact with the console, so that I can fix the network config after I
> break it.)
> 
> However, I think I can reproduce this on Fedora 27 and RHEL 7.5.  My current
> hunch is that a change in NetworkManager is responsible for the new behavior.
> 
> What is your version of NetworkManager?
> 
> What happens when you reboot?

1. This issue is also existed in redhat-virtualization-host-4.2-20180218.0, I have sent machine info to you via email, please review.

2. # rpm -qa | grep NetworkManager
NetworkManager-libnm-1.10.2-11.el7.x86_64
NetworkManager-team-1.10.2-11.el7.x86_64
NetworkManager-tui-1.10.2-11.el7.x86_64
NetworkManager-config-server-1.10.2-11.el7.noarch
NetworkManager-1.10.2-11.el7.x86_64

3. 
3.1 After create bond0 over eno1 and eno2, bond0 can not get dhcp ip. 
3.2 Then reboot machine, after reboot, the bond0 can get dhcp ip successfully. eno1 and eno2 released dhcp ip automatically. (It is the expected results)
3.3 Then delete bond0, eno1 and eno2 can get dhcp ip successful.
3.4 Create bond0 again over eno1 and eno2, the issue is back: bond0 can not get dhcp ip.

Comment 8 Marius Vollmer 2018-03-01 07:23:55 UTC

Just to keep a record: We have a test in Cockpit that is supposed to catch this regression.  It unfortunately didn't since it doesn't use DHCP anymore for the interface that is enslaved.

When making a bond, Cockpit will copy the ipv4 and ipv6 settings from the primary slave to the bond, the slave had a "manual" ipv4.method, and that was enough to give the bond the expected IP address, although it wasn't fully working, just as in the reproducer in comment 6.

I will fix the test so that it fails as expected.

Comment 9 Marius Vollmer 2018-03-01 07:24:55 UTC

In my testing, it didn't matter if the slave had the host connection or not.

Comment 10 Marius Vollmer 2018-03-05 07:52:49 UTC

*** Bug 1547372 has been marked as a duplicate of this bug. ***

Comment 11 Marius Vollmer 2018-03-05 09:11:18 UTC

Here is the pull request that adds a new test for this scenario to Cockpit:

    https://github.com/cockpit-project/cockpit/pull/8736

The new test fails on rhel-7-5 and passes on rhel-7-4, as expected.

Here is a pull request with a pull request with a potential workaround:

    https://github.com/cockpit-project/cockpit/pull/8769

This makes rhel-7-5 pass as well.  (But it might break other things, so this is not a slam dunk yet.)

Comment 12 Marius Vollmer 2018-03-05 09:14:00 UTC

> Here is a pull request with a pull request 

Upps. :)

Comment 13 Beniamino Galvani 2018-03-05 20:48:01 UTC

(In reply to Marius Vollmer from comment #6)
> Here is a reproducer with nmcli:
>
> # nmcli dev con eth1
> # nmcli con show "System eth1"
...
> # nmcli con mod "System eth1" slave-type bond master bond0 autoconnect yes
> # nmcli con add type bond ifname bond0 con-name bond0 mode active-backup
> autoconnect yes connection.autoconnect-slaves 1
>
> On RHEL 7.5 with NetworkManager-1.10.2-12.el7.x86_64, bond0 activates, but
> never gets an IP address.  eth1 is still active with the same IP4.* and
> IP6.* values as shown above.
> Bringin bond0 up again explicitly will then enslave eth1.

I think this is the expected behavior: any change to an active connection
doesn't have any effect until the connection is reactivated... this is
true for all properties and in my opinion there is no reason why
'master' should behave differently (actually a couple of properties are
applied automatically - zone and metered - but they don't tear the
interface down). Reactivating the connection automatically when
something changes leads to race conditions; for example if you do
this while the connection is active:

 nmcli con mod "System eth1" slave-type bond master bond0
 nmcli con up "System eth1"

the second command would fail in some cases because the
auto-reconnection triggered by the first command would interrupt the
user activation.

I consider the behavior that was in NM 1.8 buggy, and that bug was
caught by CI and fixed as bz1450219.

What do you think? Probably we can try to revert the change to keep
backwards compatibility, but this would introduce again the race
condition.

Comment 14 Marius Vollmer 2018-03-06 08:43:51 UTC

(In reply to Beniamino Galvani from comment #13)

> Reactivating the connection automatically when
> something changes leads to race conditions; for example if you do
> this while the connection is active:
> 
>  nmcli con mod "System eth1" slave-type bond master bond0
>  nmcli con up "System eth1"
> 
> the second command would fail in some cases because the
> auto-reconnection triggered by the first command would interrupt the
> user activation.

I think this is not the issue here.  Cockpit does not assume that simply changing connection settings will re-activate a connection.  Cockpit does this explicitly (and it also knows that renaming an interface needs a explicit disconnection as well, for example).


However, for creating bonds Cockpit relies on these assumptions:

 - Creating a new bond connection with "autoconnect: yes" will immediately activate the new connection.

 - Activating a bond connection with "autoconnect-slaves: yes" will immediately (re-) activate all the slave connections, in an order determined by NM that ensures that the bond will receive the same IP address consistently, now and on every future activation.

Are these valid assumptions?  I am not sure if this is documented anywhere.  Is it?

Cockpit does especially and explicitly not want to have anything to do with the order that the slaves are activated for a new bond.


The change in NM behavior that we observe now is this:

 - Activating a bond connection with "autoconnect-slaves: yes" will immediately activate slave connections that were inactive initially.

Is this intended behavior?  Is this actually what is happening?  It sure looks like it to me, but I have been dramatically confused before. :-)


Activating only a subset of slaves leads to obvious problems.  For example, the bond and one of the not-reactivated slaves might have the same MAC and might thus get the same IP.  See bug 1547372.

Comment 15 Marius Vollmer 2018-03-06 08:45:58 UTC

(In reply to Marius Vollmer from comment #14)

> The change in NM behavior that we observe now is this:
> 
>  - Activating a bond connection with "autoconnect-slaves: yes" will
> immediately activate slave connections that were inactive initially.

I should have added: "but it does not reactivate slave connections that were initially active".

Comment 16 Marius Vollmer 2018-03-06 08:50:33 UTC

(In reply to Marius Vollmer from comment #15)
> (In reply to Marius Vollmer from comment #14)
> 
> > The change in NM behavior that we observe now is this:
> > 
> >  - Activating a bond connection with "autoconnect-slaves: yes" will
> > immediately activate slave connections that were inactive initially.
> 
> I should have added: "but it does not reactivate slave connections that were
> initially active".

Actually, it is:

  - Creating a new bond connection with "autoconnect: yes" and
    "autoconnect-slaves: yes" will immediately activate the new bond connection
    and slave connections that were initially inactive, but it does not
    reactivate slave connections that were initially active.

Explicitly activating a bond works as before, it just the behavior at creation time that has changed.  Which makes it even more inconsistent, no?

(Sorry for taking so many tries to get the details right.)

Comment 17 Thomas Haller 2018-03-06 09:14:12 UTC

I think it's like this:


if you activate a master (bond) connection with connection.autoconnect-slaves=yes, it will activate all matching slaves profiles that:
  (1) are not currently active otherwise
  (2) that find a device that is not currently activated with another connection.

Note that issuing `nmcli connection up "$BOND_MASTER"` for a master which is currently active, means to first bring the master down, which also means to bring all its slaves down. (3) a slave connection cannot be active without a master.


In the example from comment 6, "System eth1" is already active, but not in the function of being a slave connection. Because at the point when "System eth1" was activate, it was not yet a slave. Because "System eth1" is active, it cannot get autoactivated by the master (1).
This seemingly contradicts (3), because "System eth1" is active without master. Indeed the profile itself is in the meantime a slave connection. But the configuration that is activated on the device is still not a slave-connection as long as it stays up.



>  - Activating a bond connection with "autoconnect-slaves: yes" will 
> immediately (re-) activate all the slave connections, in an order determined
> by NM that ensures that the bond will receive the same IP address 
> consistently, now and on every future activation.

Yes, NetworkManager tries to autoactivate slaves in an order that leads to a predictable MAC address. But that only applies to profiles that can currently be activated (that is, satisfy (1) and (2)).


> Are these valid assumptions?  I am not sure if this is documented anywhere.  
> Is it?

I don't think it's documented explicitly. The behavior ultimately should make sense, and what I said above makes sense to me, doesn't it? At least, if you consider that changing a profile, will not take effects immediately. So, making a non-slave profile a slave, doesn't have immediate effects on the activated device. What we do have is Tests for such behavior (or at least, we have many tests, and if this particular scenario is not covered, we should have a test).

Comment 18 Marius Vollmer 2018-03-06 10:49:55 UTC

(In reply to Thomas Haller from comment #17)
> I think it's like this:
> 
> if you activate a master (bond) connection with
> connection.autoconnect-slaves=yes, it will activate all matching slaves
> profiles that:
>   (1) are not currently active otherwise
>   (2) that find a device that is not currently activated with another
> connection.

The NM versions in RHEL 7.4 and 7.5 behave differently.  The one in 7.4 seems to activate all slaves unconditionally, while the one in 7.5 only activates the ones that satisfy (1) and (2).

This is the "regression" we are looking at.  Can you confirm that there was a change in behavior?

> Yes, NetworkManager tries to autoactivate slaves in an order that leads to a
> predictable MAC address. But that only applies to profiles that can
> currently be activated (that is, satisfy (1) and (2)).

If you take only a subset of the slaves, you don't get a predictable MAC address.

Comment 19 Beniamino Galvani 2018-03-06 17:57:28 UTC

(In reply to Marius Vollmer from comment #18)
> (In reply to Thomas Haller from comment #17)
> > I think it's like this:
> > 
> > if you activate a master (bond) connection with
> > connection.autoconnect-slaves=yes, it will activate all matching slaves
> > profiles that:
> >   (1) are not currently active otherwise
> >   (2) that find a device that is not currently activated with another
> > connection.
> 
> The NM versions in RHEL 7.4 and 7.5 behave differently.  The one in 7.4
> seems to activate all slaves unconditionally, while the one in 7.5 only
> activates the ones that satisfy (1) and (2).
> 
> This is the "regression" we are looking at.  Can you confirm that there was
> a change in behavior?

Yes, there was a change in behavior. We ensured that a connection started by user is never preempted by an internal activation. This is necessary to avoid race conditions as explained in bug 1450219. Probably we can restore the old behavior of always reconnecting slaves even if they are already active with a different master; I'll attach the patch.

Comment 20 Beniamino Galvani 2018-03-06 18:01:14 UTC

Created attachment 1404969 [details]
[PATCH] manager: allow internal reactivation if the master changed

Comment 21 Marius Vollmer 2018-03-07 07:04:05 UTC

(In reply to Beniamino Galvani from comment #19)
> (In reply to Marius Vollmer from comment #18)
>
> > This is the "regression" we are looking at.  Can you confirm that there was
> > a change in behavior?
> 
> Yes, there was a change in behavior.

Right, thanks for the confirmation!  Can you say which version of NM first had the new behavior?

From looking at our tests, it seems to have changed somewhere between 1.8.0 and 1.8.2.  Does that make sense?

Comment 22 Beniamino Galvani 2018-03-07 07:54:19 UTC

See: 

https://bugzilla.redhat.com/show_bug.cgi?id=1450219#c7

 $ git tag --contains 2236c3c728c49d2ebd68e83f1096b5180b2f41dd
 1.8.2
 1.8.3-dev
 1.8.4
 1.8.5-dev
 1.8.6
 1.8.7-dev

Yes, 1.8.2 is the first affected version.

Comment 23 Thomas Haller 2018-03-07 08:20:25 UTC

(In reply to Beniamino Galvani from comment #20)
> Created attachment 1404969 [details]
> [PATCH] manager: allow internal reactivation if the master changed

I am not convinced about this. What I described in comment 17 is how it should be and how it makes sense (to me) in the larger scheme. It is at least consistent, and I didn't need to explain parts like "except, if the profile of an activated connection changed in the meantime to become a slave, then instead...".
This patch is a hack to behave differently in a specific situation.

Here it is called a "regression". But I'd call the old behavior a bug that was fixed.

Yes, there is something like bug compatiblity, and NetworkManager should try hard not to break existing applications and setups. But is this really so important to stick to the broken behavior? If at all, I'd rather only apply this patch downstream for RHEL-7.5 (because 1.10+ and 1.8.2+ was already released with IMHO the better behavior. So reverting it back to fix a "regression" compared to 1.8.0 only changes behavior again.


Why is this any different from other cases where you modify a connection? If you modify a connection's interface-name, the changes do not take effect right away (as Marius said in comment 14). Making a non-slave profile a slave, is also a large and fundamental change to the profile. Why should that be treated special?

Comment 24 Marius Vollmer 2018-03-07 08:33:02 UTC

Here is a test script to reproduce the symptoms of bug 1547372.  You need the whole Cockpit CI infrastructure to actually run it, or you can look here:

    https://github.com/cockpit-project/cockpit/pull/8776

The script:

    def testReproducer(self):
        m = self.machine

        print(m.execute("rpmquery NetworkManager"))

        iface1 = self.add_iface()
        con1 = self.iface_con_id(iface1)
        mac1 = m.execute("cat /sys/class/net/%s/address" % iface1).strip()

        iface2 = self.add_iface()
        con2 = self.iface_con_id(iface2)

        m.execute("nmcli con up %s" % con1)
        m.execute("nmcli con down %s" % con2)

        m.execute("nmcli con mod '%s' slave-type bond master bond0 autoconnect yes" % con1)
        m.execute("nmcli con mod '%s' slave-type bond master bond0 autoconnect yes" % con2)
        m.execute("nmcli con add type bond ifname bond0 con-name bond0 mode active-backup autoconnect yes connection.autoconnect-slaves 1 ethernet.cloned-mac-address %s" % mac1)

    print(m.execute("sleep 3; ip addr"))

By looking at the output of "ip addr", you can see that with recent versions of NM, iface1 does not become a slave of bond0, and both iface1 and bond0 have the same IP address.

Summary of the test results:

centos-7: works, NetworkManager-1.8.0-11.el7_4.x86_64
rhel-7-4: works, NetworkManager-1.8.0-11.el7_4.x86_64
rhel-7: works, NetworkManager-1.8.0-11.el7_4.x86_64
rhel-atomic: works, NetworkManager-1.8.0-11.el7_4.x86_64

fedora-26: broken, NetworkManager-1.8.2-4.fc26.x86_64
fedora-atomic: broken, NetworkManager-1.8.6-1.fc27.x86_64
rhel-7-5: broken, NetworkManager-1.10.2-13.el7.x86_64
fedora-27: broken, NetworkManager-1.8.6-1.fc27.x86_64
fedora-i386: broken, NetworkManager-1.8.6-1.fc27.i686

Comment 25 Marius Vollmer 2018-03-07 08:36:27 UTC

(In reply to Beniamino Galvani from comment #22)

> Yes, 1.8.2 is the first affected version.

Nice, I like it when all the facts match up. :-)

Comment 26 Beniamino Galvani 2018-03-07 08:52:09 UTC

(In reply to Thomas Haller from comment #23)
> (In reply to Beniamino Galvani from comment #20)
> > Created attachment 1404969 [details]
> > [PATCH] manager: allow internal reactivation if the master changed
> 
> I am not convinced about this. What I described in comment 17 is how it
> should be and how it makes sense (to me) in the larger scheme. It is at
> least consistent, and I didn't need to explain parts like "except, if the
> profile of an activated connection changed in the meantime to become a
> slave, then instead...".
> This patch is a hack to behave differently in a specific situation.

The patch just achieves the goal of guaranteeing that when you activate a master connection with autoconnect-slaves=1 all slaves gets activated. This is simple enough to explain (in my opinion) and makes sense too. Moreover, it is what NM did until 1.8, when we inadvertently changed it. The implementation perhaps look like a hack, but the behavior certainly seems reasonable (and in fact users were relying on it).

> Here it is called a "regression". But I'd call the old behavior a bug that
> was fixed.

We have two requirements:

 (1) don't autoactivate connections that are already activated by user
 (2) activate all slaves when the master connects

Depending on whether we consider both real requirements and depending on their priorities, the old behavior can be seen either as a bug or as a feature.

> Yes, there is something like bug compatiblity, and NetworkManager should try
> hard not to break existing applications and setups. But is this really so
> important to stick to the broken behavior?

If Cockpit is fine in implementing a workaround for 1.10, I am fine with leaving things as they currently are, to avoid changing behavior again.

> If at all, I'd rather only apply
> this patch downstream for RHEL-7.5 (because 1.10+ and 1.8.2+ was already
> released with IMHO the better behavior. So reverting it back to fix a
> "regression" compared to 1.8.0 only changes behavior again.
> 
> 
> Why is this any different from other cases where you modify a connection? If
> you modify a connection's interface-name, the changes do not take effect
> right away (as Marius said in comment 14). Making a non-slave profile a
> slave, is also a large and fundamental change to the profile. Why should
> that be treated special?

As explained above, this is needed to honor the autoconnect-slaves priority. When the master gets activated there can be two cases:

 (1) the master property of the slave connection wasn't changed: then, since the master is down, the slave is too and we can just activate it
 (2) the master property was changed: then, the slave can be active with the (new) master down. In this case the patch is needed

Comment 27 Marius Vollmer 2018-03-07 09:08:10 UTC

(In reply to Thomas Haller from comment #23)

> What I described in comment 17 is how it
> should be and how it makes sense (to me) in the larger scheme.

Activating only a subset of the slaves when activating a bond does not make sense to me.  "nmcli con up bond0" doesn't do it.  It jeopardizes the predictable MAC guarantee, and causes the symptoms that people report here as blocker bugs.

NM spontaneously brings the system into a nonsensical run-time state and leaves it there forever, although the persistent configuration is sane.

> Why is this any different from other cases where you modify a connection?

This is about what happens when a newly created bond connections is activated automatically by NM.  (You call that "internal activation", right?)

It would be fine if NM wouldn't automatically activate new bond connections, but it does. And since it does, it should do it fully, as if the user had explicitly activated it.

Comment 28 Martin Pitt 2018-03-07 09:27:27 UTC

(In reply to Thomas Haller from comment #23)

> I am not convinced about this. What I described in comment 17 is how it
> should be and how it makes sense (to me) in the larger scheme. It is at
> least consistent,

IMHO it's not. It leads to bringing up a bond with only parts of its slaves, which can (and does) easily cause IP conflicts and such. IMHO consistent behaviours for the case that a newly defined bond has currently active slaves are to either (1) don't bring up the new bond at all, or (2) forcefully tear down the current interfaces and enslave them. In either case the system is in a consistent state afterwards.

However, (2) bears the possibility of severing the ssh (or cockpit or whatever) link that you are currently using to administer your machine, so I understand it makes sense to be cautious and never automatically tear down active devices/connections. I suppose something like that was the reason for the behaviour change?

Thus it seems to me that (1) might be the solution which is both safe and consistent - i. e. if a newly created bond/bridge/etc. defines slaves that are active, don't bring it up at all automatically. An explicit request to bring them up should then tear down existing active slaves, of course.

Comment 29 Marius Vollmer 2018-03-07 09:59:08 UTC

(In reply to Martin Pitt from comment #28)

> However, (2) bears the possibility of severing the ssh (or cockpit or
> whatever) link that you are currently using to administer your machine, so I
> understand it makes sense to be cautious and never automatically tear down
> active devices/connections. I suppose something like that was the reason for
> the behaviour change?

NetworkManager has checkpoints to cope with this.

We did some work some time ago to get the predictable MAC guarantee, and Cockpit takes also other measures that make it very likely that enslaving an active interface into a bond will "seamlessly" give the bond the same IP addresses as the interface had before, and no TCP connections will be severed.  (And the bond will also get that same IP on each future activation of the bond.)

Comment 30 Marius Vollmer 2018-03-07 13:11:39 UTC

Here is the PR again with the workaround that we consider putting into Cockpit:

    https://github.com/cockpit-project/cockpit/pull/8769

I did some work on it so that it breaks less of our existing tests, and I plan to get it ready to be merged.

Comment 31 Thomas Haller 2018-03-07 14:20:13 UTC

> The patch just achieves the goal of guaranteeing that when you activate a
> master connection with autoconnect-slaves=1 all slaves gets activated. This is 
> simple enough to explain (in my opinion) and makes sense too. Moreover, it is 
> what NM did until 1.8, when we inadvertently changed it. The implementation 
> perhaps look like a hack, but the behavior certainly seems reasonable (and in 
> fact users were relying on it).

This is a different interpretation of connnection.autoconnect-slaves, and `man nm-settings` isn't specific about what is supposed to happen.

connection.autoconnect=yes obviously only has effect for connections that are currently not active. I interpreted connection.autoconnect-slaves=yes the same. See comment 17. You now say, that connection.autoconnect-slaves should always forcefully activate all slaves. I see that that can make some sense, but it also seems dangerous, for example:

  - have non-slave profile "eth1" active.
  - have slave-profile "bond0-eth1" slave inactive
  - activate a master profile "bond0", with connection.autoconnect-slaves=yes
    (either, explicitly via `nmcli con up`, or maybe just create a new "bond0"
    profile and let the master profile autoconnect.

With the new interpretation, this would forcefully activate bond0-eth1.



I don't agree with the notion that a full master-slave setup requires *all* slaves to connect. If the slave is already busy with another profile (in the reproducer case, it's actually the same profile, but with an out-of-date non-slave configuration), then does the user really always want to steal the slave?


I think there is a point for both. If cockpit wants this forceful behavior, shouldn't it instead make sure that the slave candidates are disconnected first?


(ps, the nm-settings man-page should be fixed to clarify on the autoconnect-slaves behavior).



Ok, I don't disagree with the interpretion of autoconnect-slaves, I just didn't thought it should work that way. If everybody thinks that's better...

Comment 33 Marius Vollmer 2018-03-14 12:00:54 UTC

Our pull request with a possible workaround in Cockpit is (pretty much) ready to be merged:

    https://github.com/cockpit-project/cockpit/pull/8769

So I am confident that we have a working workaround.

What should we do with that PR? :-)  Merge?

Comment 34 Huijuan Zhao 2018-03-20 01:26:57 UTC

*** Bug 1554628 has been marked as a duplicate of this bug. ***

Comment 35 Huijuan Zhao 2018-03-20 01:28:51 UTC

Bridge also has the same issue, see Bug 1554628.

Comment 36 Thomas Haller 2018-03-20 15:33:35 UTC

proposed CI tests: https://github.com/NetworkManager/NetworkManager-ci/pull/158

Comment 37 Beniamino Galvani 2018-03-24 09:44:44 UTC

*** Bug 1473952 has been marked as a duplicate of this bug. ***

Comment 38 Ryan Barry 2018-03-27 10:26:17 UTC

Beniamino - 

Is this going to make 7.5? I see a scratch build, but the flags are 7.6

Comment 39 Beniamino Galvani 2018-03-28 06:59:25 UTC

It's too late for inclusion in 7.5 and the patch from comment 20 needs improvements. We can target the first z-stream.

Comment 40 Beniamino Galvani 2018-03-28 16:51:03 UTC

Pushed a possible fix to branch bg/autoconnect-slaves-rh1548265.

Comment 41 Dan Kenigsberg 2018-04-01 15:47:11 UTC

"No such issue in rhel-7.4." ==> marking as a regression.

Comment 43 Thomas Haller 2018-04-04 10:39:10 UTC

(In reply to Beniamino Galvani from comment #40)
> Pushed a possible fix to branch bg/autoconnect-slaves-rh1548265.

> Don't make it a
> construct property that must be always specified, so that we don't
> require every callers to set a reason even when it's not necessary.

can we make it a construct-property, and require the caller to set the correct reason? Note how currently reasons "AUTOCONNECT" and "USER_REQUEST" are unused. I think they should be used, and the caller should be forced to provide the correct reason.

Comment 44 Beniamino Galvani 2018-04-06 09:01:55 UTC

(In reply to Thomas Haller from comment #43)
> (In reply to Beniamino Galvani from comment #40)
> > Pushed a possible fix to branch bg/autoconnect-slaves-rh1548265.
> 
> > Don't make it a
> > construct property that must be always specified, so that we don't
> > require every callers to set a reason even when it's not necessary.
> 
> can we make it a construct-property, and require the caller to set the
> correct reason? Note how currently reasons "AUTOCONNECT" and "USER_REQUEST"
> are unused. I think they should be used, and the caller should be forced to
> provide the correct reason.

Updated branch bg/autoconnect-slaves-rh1548265.

Comment 45 Thomas Haller 2018-04-06 09:45:39 UTC

     if (   success
-        && nm_auth_subject_is_internal (nm_active_connection_get_subject (active))) {
+        && nm_auth_subject_is_internal (nm_active_connection_get_subject (active))
+        && nm_active_connection_get_activation_reason (active) != NM_ACTIVATION_REASON_AUTOCONNECT_SLAVES) {


shouldn't it be 
  get_activation_reason(active) == REASON_AUTOCONNECT
?

Arguably, the effect is the same, because the only other reason is REASON_USER, for which !is_internal()


Pushed a fixup, rest lgtm

Comment 46 Beniamino Galvani 2018-04-06 12:23:04 UTC

(In reply to Thomas Haller from comment #45)
>      if (   success
> -        && nm_auth_subject_is_internal (nm_active_connection_get_subject
> (active))) {
> +        && nm_auth_subject_is_internal (nm_active_connection_get_subject
> (active))
> +        && nm_active_connection_get_activation_reason (active) !=
> NM_ACTIVATION_REASON_AUTOCONNECT_SLAVES) {
> 
> 
> shouldn't it be 
>   get_activation_reason(active) == REASON_AUTOCONNECT
> ?
> 
> Arguably, the effect is the same, because the only other reason is
> REASON_USER, for which !is_internal()

Yes, it is equivalent.

Comment 49 Beniamino Galvani 2018-04-09 13:30:27 UTC

Branch merged upstream to master:

  https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=7a063a91c7f8db8f78cab124bf9ba80a242be233

and nm-1-10 branch:

  https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?h=nm-1-10&id=58df222cbb02281268469b122eb430efaf14fac6

Comment 50 Thomas Haller 2018-04-12 14:01:00 UTC

(In reply to Beniamino Galvani from comment #49)
> Branch merged upstream to master:
> 
>  
> https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/
> ?id=7a063a91c7f8db8f78cab124bf9ba80a242be233
> 
> and nm-1-10 branch:
> 
>  
> https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?h=nm-1-
> 10&id=58df222cbb02281268469b122eb430efaf14fac6

should probably also include https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=786adf969c297fc2e830b43cc9e9b1a154d43216

Comment 57 errata-xmlrpc 2018-10-30 11:11:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3207

Note You need to log in before you can comment on or make changes to this bug.

aloughla
atragler
bgalvani
cshao
danken
dougsland
fgiudici
huzhao
lrintel
mburman
mpitt
mtessun
mvollmer
phoracek
qiyuan
rbarry
rkhan
sbonazzo
sukulkar
thaller
toneata
vbenes
weiwang
yaniwang
ycui
yturgema