Bug 2099945 - [OVN] bonding fails after active-backup fail-over and reboot, kargs static IP
Summary: [OVN] bonding fails after active-backup fail-over and reboot, kargs static IP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Jaime Caamaño Ruiz
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks: 2103899
TreeView+ depends on / blocked
 
Reported: 2022-06-22 04:45 UTC by Ross Brattain
Modified: 2023-01-17 19:50 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:50:08 UTC
Target Upstream Version:
Embargoed:
rbrattai: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3203 0 None Merged Bug 2099945: configure-ovs: clone inactive autoconnect slaves 2022-07-20 14:24:25 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:50:25 UTC

Description Ross Brattain 2022-06-22 04:45:25 UTC
Description of problem:

vSphere UPI static IP active-backup bonding using kargs

NetworkManager enters an infinite link flap loop after active-backup primary slave link is restored.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. disconnect link of primary slave (ens192) using vSphere console
2. wait for link to fail over to backup slave (ens224).
3. reboot
4. re-connect old primary slave (ens192).

Actual results:

NetworkManager enters an infinite loop of link flaps.  Network connectivity to the node is lost.

Once old primary slave (ens192) is re-disconnected NetworkManager link flap stops and network recovers.


Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2206] device (bond0): assigned bond port ens192
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2207] device (ens192): Activation: connection 'ens192' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2213] policy: auto-activating connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2214] policy: auto-activating connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2267] device (ens192): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2272] device (ens224): Activation: starting connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2273] device (ens256): Activation: starting connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2275] device (bond0): disconnecting for new activation request.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2276] device (bond0): state change: ip-config -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2282] device (ens224): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2284] device (ens256): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2309] device (ens192): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2311] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.2596] device (bond0): released bond slave ens192
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <warn>  [1655849684.2676] device (ens192): queue-state[activated] reason:none, id:452056]: replace previously queued state change
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3073] device (bond0): Activation: starting connection 'ovs-if-phys0' (7ac68a21-25ac-47c4-a8ca-70a83181965d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3095] device (ens192): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3099] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3243] device (bond0): set-hw-addr: set-cloned MAC address to 00:50:56:AC:59:95 (00:50:56:AC:59:95)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3245] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3294] device (ens224): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3297] device (ens256): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3300] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3304] device (bond0): Activation: connection 'ovs-if-phys0' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3306] device (ens192): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3338] device (ens224): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3639] device (bond0): assigned bond port ens224
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3640] device (ens224): Activation: connection 'ens224-slave-ovs-clone' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3642] device (ens256): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3903] device (bond0): assigned bond port ens256
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3903] device (ens256): Activation: connection 'ens256-slave-ovs-clone' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3906] device (bond0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3910] policy: auto-activating connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3912] device (bond0): carrier: link connected
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3931] device (ens224): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3934] device (ens256): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3939] device (ens192): Activation: starting connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3940] device (bond0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3943] device (bond0): disconnecting for new activation request.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3943] device (bond0): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3944] device (bond0): releasing ovs interface bond0
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3946] device (bond0): released from master device bond0
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3948] device (ens192): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3953] device (ens224): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3954] device (ens224): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3957] device (ens224): Activation: successful, device activated.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3963] device (ens256): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3965] device (ens256): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.3967] device (ens256): Activation: successful, device activated.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4014] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4349] device (bond0): released bond slave ens224
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4779] device (bond0): released bond slave ens256
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4781] device (bond0): set-hw-addr: set MAC address to 00:50:56:AC:59:95 (restore)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4863] device (bond0): set-hw-addr: reset MAC address to D2:09:18:BB:91:74 (deactivate)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4866] device (bond0): Activation: starting connection 'bond0' (ef21706f-9968-419c-877d-f3c80a098daf)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4925] device (ens224): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4930] device (ens256): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.4935] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5079] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5097] device (ens192): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5099] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5101] device (ens224): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5109] device (ens256): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5120] device (ens192): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info>  [1655849684.5380] device (bond0): assigned bond port ens192



Expected results:

No link flaps, no network disconnction.


Additional info:

Unable to reproduce yet on IPI libvirt baremetal.

Comment 2 Ross Brattain 2022-06-22 05:05:13 UTC
Does not reproduce on RHEL8.6 worker with {ens192,ens224,bond0}.nmconnection

Comment 4 Jaime Caamaño Ruiz 2022-06-22 08:21:00 UTC
From configure-ovs perspective, this might be due to the fact that only slave active profiles are cloned. Since ens192 was down on reboot, it had no active profile, so no profile was cloned. Then when it is reconnected, the original profile activates, which then activates the original bond profile as well instead of the clone made for ovn-k, and so on...

Not sure though about the loop or if it is worth looking at it since now things are not working as we expect anyway.

This should work if there were no reboot.

I guess that we ought to clone all slave profiles, active or not. We could filter instead by autoconnect being set.

I don't consider this a regression. We probably introduced this issue when we started cloning the profiles, a while ago, and that was to solve another set of bonding issues that might had this scenario not working either.

Comment 5 Jaime Caamaño Ruiz 2022-06-22 10:56:26 UTC
@rbrattai

Ross, as time allows, please try out this tentative improvement: https://github.com/openshift/machine-config-operator/pull/3203

Comment 8 Ross Brattain 2022-07-18 00:14:28 UTC
@jcaamano

So the link flapping is fixed, but we are still switching MACs and potentially causing issues.

I'm not sure if switching MACs is acceptable because it can change DHCP leases which changes IPs which breaks etcd with EtcdCertSignerControllerDegraded.

 EtcdCertSignerControllerDegraded: [x509: certificate is valid for 192.168.123.58, fd2e:6f44:5dd8::47, not 192.168.123.68, x509: certificate is valid for ::1, 127.0.0.1, 192.168.123.58, ::1, fd2e:6f44:5dd8::47, not 192.168.123.68]...

We could require that all bonding slave MACs have the same IP via DHCP configuration, but that might not work in practice.


libvirt IPI RHCOS active-backup fail_over_mac=0

Initial state after install.

master-0-2 192.168.123.58

Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 3: enp5s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff permaddr 52:54:00:fb:79:00 promiscuity 0 minmtu 68 maxmtu 65535
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535

enp5s0 MAC == bond0 MAC == br-ex MAC

After disconnecting enp5s0 and rebooting, we reboot with the same config but then tear it all down and select enp6s0 with a different MAC 52:54:00:fb:79:00.

This causes node IP to change due to new DHCP lease.

Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]: 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 01:57:20 master-0-2 configure-ovs.sh[3051]: 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
Jul 14 01:57:20 master-0-2 configure-ovs.sh[3051]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc fq_codel master bond0 state DOWN group default qlen 1000
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff permaddr 52:54:00:9e:6b:cf promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535


Then we re-connect enp5s0 and disconnect enp6s0 and reboot.  This time we get a weird case where bond0 MAC is random and there is a mismatch.

I have no idea where bond0 MAC e2:88:e2:6b:53:d4 comes from, but bond0 is down anyway.

Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: 3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: 4: enp6s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: 11: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]:     link/ether e2:88:e2:6b:53:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: 3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: 4: enp6s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: 11: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]:     link/ether e2:88:e2:6b:53:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 4: enp6s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]:     link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 11: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]:     link/ether e2:88:e2:6b:53:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]:     link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535


Full logs  http://file.rdu.redhat.com/~rbrattai/logs/bz2099945-bond0-mac-changes.tar.xz

Comment 9 Jaime Caamaño Ruiz 2022-07-18 17:40:16 UTC
@rbrattai honestly I think this is not an issue, I would expect these configurations to be setup via setting a fixed MAC address or using dhcp client id on the bond itself, rather than relying on a specific mac of an undetermined slave?

Really when someone configures the dhcp server static mac assignment for a bond, how does it choose which mac is going to be used? There is no definition of which slave is going to be enslaved first or which one given link is going to be available or unavailable on a given boot. The only possible thing to do is configure a fixed mac on the bond itself or use some other mechanism like dhcp client id.

Comment 10 Ross Brattain 2022-07-18 19:24:10 UTC
I agree IP reservations are required.

The docs specify in two places.

https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal/installing-bare-metal-network-customizations.html#installation-network-user-infra_installing-bare-metal-network-customizations

"It is recommended to use a DHCP server for long-term management of the cluster machines. Ensure that the DHCP server is configured to provide persistent IP addresses, DNS server information, and hostnames to the cluster machines."

https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal_ipi/ipi-install-prerequisites.html#network-requirements-reserving-ip-addresses_ipi-install-prerequisites


Reserving IP addresses for nodes with the DHCP server

For the baremetal network, a network administrator must reserve a number of IP addresses, including:

    Two unique virtual IP addresses.

        One virtual IP address for the API endpoint.

        One virtual IP address for the wildcard ingress endpoint.

    One IP address for the provisioner node.

    One IP address for each control plane (master) node.

    One IP address for each worker node, if applicable.




So by implication all interfaces used for a bond should have the same IP reservation.

Our testing DHCP setups need adjustment for this bonding case, but the requirement is clear.

Verified on 4.12.0-0.nightly-2022-07-12-164246

Comment 13 errata-xmlrpc 2023-01-17 19:50:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.