1444109 – [cockpit] - Creating active-backup bond with primary slave which has the active connection, leads to a situation in which the secondary slave activated and enslaved first

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1444109 - [cockpit] - Creating active-backup bond with primary slave which has the active connection, leads to a situation in which the secondary slave activated and enslaved first

Summary: [cockpit] - Creating active-backup bond with primary slave which has the acti...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	cockpit
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Marius Vollmer
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:	1472965
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-20 15:11 UTC by Michael Burman
Modified:	2021-01-15 07:34 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-15 07:34:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
NM log in debug (25.57 KB, application/x-gzip) 2017-04-20 15:29 UTC, Michael Burman	no flags	Details
View All

Description Michael Burman 2017-04-20 15:11:26 UTC

Description of problem:
[cockpit] - Creating active-backup bond with primary slave which has the active connection, leads to a situation in which the secondary slave activated and enslaved first.

cockpit version 135 includes some fixes for the bond create scenario via the 'network' tab in cockpit, but now, it seems that if trying to create an active-backup bond and the primary slave is the active connection interface of the host, we will end up with the wrong ip for the bond. The bond will get the MAC address of the secondary slave.
From the NM logs, it looks like, the primary slave is deactivated(as it should, cause it has bootproto=dhcp), the secondary slave activated first and enslaved first, and only then, the primary slave activated and enslaved. All of this causing the primary slave activated second and it's why the bond got the wrong ip, for the wrong MAC address.

A simple apply for the same bond from this current state(via cockpit, after it got new ip), will fix it, as now the primary slave will come first and we will get the desired ip and MAC address.

Version-Release number of selected component (if applicable):

How reproducible:
100%

Steps to Reproduce:
1. Create an active-backup bond mode via cockpit 135-4
For example:
enp4s0 - has the active host connection, set it as the primary slave of the bond.
enp6s0

Actual results:
enp6s0 activated and enslaved first. bond got the wrong MAC address and IP, not from the expected primary slave.

Expected results:
If primary slave is set for an active-backup bond, we should get the MAC address and IP from this slave. It should activated and enslaved first.

Additional info:
Will attach NM log soon.

Comment 2 Michael Burman 2017-04-20 15:27:41 UTC

NM log, not in debug - 

Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.6457] device (enp6s0): state change: prepare -> config (reason 'none') [40 50 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.6496] device (enp4s0): state change: disconnected -> prepare (reason 'none') [30 40 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.6511] device (enp4s0): state change: prepare -> config (reason 'none') [40 50 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.7426] device (bond1): state change: config -> ip-config (reason 'none') [50 70 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.7431] device (bond1): IPv4 config waiting until carrier is on
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.7432] device (bond1): IPv6 config waiting until carrier is on
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.7666] device (enp6s0): state change: config -> ip-config (reason 'none') [50 70 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9734] device (bond1): enslaved bond slave enp6s0
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9735] device (enp6s0): Activation: connection 'enp6s0' enslaved, continuing activation
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9737] device (bond1): IPv4 config waiting until carrier is on
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9737] device (bond1): IPv6 config waiting until carrier is on
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9745] device (enp6s0): state change: ip-config -> secondaries (reason 'none') [70 90 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9750] device (enp6s0): state change: secondaries -> activated (reason 'none') [90 100 0]
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9811] device (enp6s0): Activation: successful, device activated.
Apr 20 16:13:11 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693991.9825] device (enp4s0): state change: config -> ip-config (reason 'none') [50 70 0]
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1786] device (bond1): enslaved bond slave enp4s0
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1787] device (enp4s0): Activation: connection 'enp4s0' enslaved, continuing activation
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1787] device (bond1): IPv4 config waiting until carrier is on
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1788] device (bond1): IPv6 config waiting until carrier is on
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1796] device (enp4s0): state change: ip-config -> secondaries (reason 'none') [70 90 0]
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1802] device (enp4s0): state change: secondaries -> activated (reason 'none') [90 100 0]
Apr 20 16:13:12 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693992.1845] device (enp4s0): Activation: successful, device activated.
Apr 20 16:13:15 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693995.1962] device (enp6s0): link connected
Apr 20 16:13:15 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693995.2022] device (bond1): link connected
Apr 20 16:13:15 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693995.2027] dhcp4 (bond1): activation: beginning transaction (timeout in 45 seconds)
Apr 20 16:13:15 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693995.2066] dhcp4 (bond1): dhclient started with pid 3238
Apr 20 16:13:15 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPDISCOVER on bond1 to 255.255.255.255 port 67 interval 6 (xid=0x9d9f109)
Apr 20 16:13:15 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492693995.3887] device (enp4s0): link connected
Apr 20 16:13:21 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPDISCOVER on bond1 to 255.255.255.255 port 67 interval 11 (xid=0x9d9f109)
Apr 20 16:13:32 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPDISCOVER on bond1 to 255.255.255.255 port 67 interval 16 (xid=0x9d9f109)
Apr 20 16:13:48 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPDISCOVER on bond1 to 255.255.255.255 port 67 interval 16 (xid=0x9d9f109)
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPREQUEST on bond1 to 255.255.255.255 port 67 (xid=0x9d9f109)
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPOFFER from 10.35.128.254
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com dhclient[3238]: DHCPACK from 10.35.128.254 (xid=0x9d9f109)
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492694029.0959] dhcp4 (bond1):   address 10.35.128.227
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492694029.0963] dhcp4 (bond1):   plen 24 (255.255.255.0)
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492694029.0963] dhcp4 (bond1):   gateway 10.35.128.254
Apr 20 16:13:49 orchid-vds2.qa.lab.tlv.redhat.com NetworkManager[976]: <info>  [1492694029.0963] dhcp4 (bond1):   server identifier 10.35.28.1

Comment 3 Michael Burman 2017-04-20 15:29:49 UTC

Created attachment 1273019 [details]
NM log in debug

Comment 4 Michael Burman 2017-04-20 15:34:43 UTC

Version - 
cockpit-dashboard-135-4.el7.x86_64
cockpit-ovirt-dashboard-0.10.7-0.0.17.el7ev.noarch
cockpit-storaged-135-4.el7.noarch
cockpit-bridge-135-4.el7.x86_64
cockpit-ws-135-4.el7.x86_64
cockpit-system-135-4.el7.noarch

NetworkManager-libnm-1.4.0-19.el7_3.x86_64
NetworkManager-team-1.4.0-19.el7_3.x86_64
NetworkManager-config-server-1.4.0-19.el7_3.x86_64
NetworkManager-1.4.0-19.el7_3.x86_64
NetworkManager-tui-1.4.0-19.el7_3.x86_64

Comment 5 Marius Vollmer 2017-05-03 07:36:09 UTC

The idea was that Cockpit does not activate any slave, but leaves it to NM to determine the order.  NM itself is changing its behavior in this area, too, and I don't really know which version of it is where and behaves how.

Could you try this with just "nmcli"?

I'll assign this to NetworkManager.  Please assign back if you think I need to provide more evidence that Cockpit is "doing the right thing".

Comment 6 Michael Burman 2017-05-03 07:52:12 UTC

(In reply to Marius Vollmer from comment #5)
> The idea was that Cockpit does not activate any slave, but leaves it to NM
> to determine the order.  NM itself is changing its behavior in this area,
> too, and I don't really know which version of it is where and behaves how.
> 
> Could you try this with just "nmcli"?
> 
> I'll assign this to NetworkManager.  Please assign back if you think I need
> to provide more evidence that Cockpit is "doing the right thing".

If using nmcli, then you must specify that the primary slave will come and enslaved first. If not, the second slave can come before the primary and we will end up with the described situation in this bug. 
If not sending the correct order, we might end up with wrong ip and MAC for the bond. 
I don't think that NetworkManager knows the correct order for the slaves come up, unless sending and specifying this with nmcli command. 

For example, this is how i send it with nmcli :

[root@orchid-vds2 ~]# nmcli connection show 
NAME           UUID                                  TYPE            DEVICE 
System enp4s0  c81d9f81-beea-4b64-9568-631dc4a8e44e  802-3-ethernet  enp4s0 
virbr0         43b12d22-67be-420c-ac88-b4d7c4765caf  bridge          virbr0 
enp6s0         73127947-780e-408e-b3b9-a0955bee2b5d  802-3-ethernet  --     
ens1f0         fc6850dc-9b81-4371-b71e-6af577dacc63  802-3-ethernet  --     
ens1f1         bb874038-8edb-4827-9e0f-af12d0d14b51  802-3-ethernet  --     

[root@orchid-vds2 ~]# nmcli connection add type bond con-name bond1 ifname bond1 mode active-backup primary enp4s0; \
> nmcli connection modify id bond1 ipv4.method auto ipv6.method ignore; \
> nmcli con mod uuid c81d9f81-beea-4b64-9568-631dc4a8e44e ipv4.method disabled ipv6.method ignore; \
> nmcli connection modify uuid c81d9f81-beea-4b64-9568-631dc4a8e44e connection.slave-type bond connection.master bond1 connection.autoconnect yes; \
> nmcli connection modify id enp6s0 connection.slave-type bond connection.master bond1 connection.autoconnect yes; \
> nmcli con down uuid c81d9f81-beea-4b64-9568-631dc4a8e44e; \
> nmcli con up uuid c81d9f81-beea-4b64-9568-631dc4a8e44e; \
> nmcli con down id enp6s0; \
> nmcli con up id enp6s0; \
> nmcli con up id bond1

Connection 'bond1' (4b9d349e-4aa0-4ff4-a5e9-992024491030) successfully added.
Connection 'System enp4s0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/0)
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/5)
Connection 'enp6s0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/4)
Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/6)
Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/7)

- If i'm not sending to NM that 'System enp4s0' connection(the primary slave) should come first, then i might get the wrong MAC and IP for the bond and 'enp6s0' might come first, and not 'enp4s0'. And this exactly what happens when creating the bond in active-backup via cockpit. There is a meaning for the order of the slaves coming up and enslaved to the bond.

Comment 7 Edward Haas 2017-07-19 06:04:22 UTC

NM 1.8 has introduced a configuration parameter to control the order the slaves are coming up: slaves-order=name
But that needs to be set explicitly like vdsm does: https://gerrit.ovirt.org/#/c/78362/3/static/etc/NetworkManager/conf.d/vdsm.conf

What I find strange here is, that existing behaviour should not have been changed (the order was based on the iface index, but I guess you reused the same machines so why would that change?).

This may have also hit us on another BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1463218

Comment 8 Francesco Giudici 2017-07-19 17:00:37 UTC

Let me add here a little bit of context:
Some changes were applied to cockpit and NM in order to have coherent behavior when creating the bond and when restarting the machine.
As you know, the bond's MAC is taken by the first enslaved interface.
What happened in the past was that on bond creation, cockpit determined the MAC by the order it activated the slaves but on reboot was NetworkManager that picked up the order of activation.

This was addressed with two actions:
1) cockpit was changed to allow the order of enslaved interfaces to be picked up by NM: to do this it will activate as last the master interface
2) NM was patched to allow device activation on the basis of their name (option should be selected in config file, otherwise activation order will be on the basis of the ifindex... but it should have landed as default config for RHEV)

This would allow coherency in the activation (enslave) order (and so MAC address) during creation and boot.

There is no special enslave order for active-backup mode (the only one in which you can specify the primary device).

If a specific MAC is required, it can now be specified on the master
changing the 802-3-ethernet.cloned-mac-address property.
With nmcli would be:
nmcli connection modify BOND_CONN_NAME 802-3-ethernet.cloned-mac-address
DESIRED_MAC

The issue suggested by Edward is not related to this one but seems instead pointing to a NetworkManager bug.

I would just close this bug. Anyway, the only other option that you may want is to set the MAC of the primary interface in cockpit, leveraging the "cloned-mac-address property". For this reason, reassigning to cockpit, but I think could be closed.

Comment 9 Michael Burman 2017-07-20 07:38:12 UTC

(In reply to Francesco Giudici from comment #8)
> Let me add here a little bit of context:
> Some changes were applied to cockpit and NM in order to have coherent
> behavior when creating the bond and when restarting the machine.
> As you know, the bond's MAC is taken by the first enslaved interface.
> What happened in the past was that on bond creation, cockpit determined the
> MAC by the order it activated the slaves but on reboot was NetworkManager
> that picked up the order of activation.
> 
> This was addressed with two actions:
> 1) cockpit was changed to allow the order of enslaved interfaces to be
> picked up by NM: to do this it will activate as last the master interface
> 2) NM was patched to allow device activation on the basis of their name
> (option should be selected in config file, otherwise activation order will
> be on the basis of the ifindex... but it should have landed as default
> config for RHEV)
> 
> This would allow coherency in the activation (enslave) order (and so MAC
> address) during creation and boot.
> 
> There is no special enslave order for active-backup mode (the only one in
> which you can specify the primary device).
> 
> If a specific MAC is required, it can now be specified on the master
> changing the 802-3-ethernet.cloned-mac-address property.
> With nmcli would be:
> nmcli connection modify BOND_CONN_NAME 802-3-ethernet.cloned-mac-address
> DESIRED_MAC
> 
> The issue suggested by Edward is not related to this one but seems instead
> pointing to a NetworkManager bug.
> 
> I would just close this bug. Anyway, the only other option that you may want
> is to set the MAC of the primary interface in cockpit, leveraging the
> "cloned-mac-address property". For this reason, reassigning to cockpit, but
> I think could be closed.

I really don't think that this bug should be closed as it is still a bug.
I believe that it's now depends on BZ 1472965.

This bug can be closed only after tested and verified on latest cockpit version and latest rhel7.4 version with the fix for NM bug for setting up a bond^^.

Comment 10 Michael Burman 2017-08-29 09:03:53 UTC

I'm affected by this bug 100% of the times when creating a bond mode=1.
No matter which cockpit version i'm using(tested latest 141).
Every time when i'm trying to create a bond mode=1 and choosing primary slave, i always end up from the wrong IP of the second slave and loosing connection.

As currently on cockpit version 141 we still don't have an option set the MAC of the primary interface in cockpit, leveraging the "cloned-mac-address property", i always affected by this bug and it's not possible to work this way. 
What is the status of this report? does cockpit team going to fix it? add option to set the MAC of the primary interface?

Comment 11 Edward Haas 2017-08-29 11:20:34 UTC

VDSM is placing a configuration file under NetworkManager/conf.d to set the correct slave order.
But this is effective only after VDSM is installed and the host rebooted (or NM restarted).

https://github.com/oVirt/vdsm/blob/master/static/etc/NetworkManager/conf.d/vdsm.conf

The problem I think we face here is that the default NM slave order is still active at the stage cockpit is used. After VDSM is installed and host rebooted, the order used is as defined by vdsm.conf (slaves-order=name).

The only workaround I can think of is to install VDSM, reboot the host (or restart NM service) and then create the bond using cockpit.
Another option is to create the bond using RHV and not cockpit.

A proposed solution is for cockpit to deploy the required NM configuration (and restart NM service) per request.

Comment 13 RHEL Program Management 2021-01-15 07:34:24 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.