Description of problem: vSphere UPI static IP active-backup bonding using kargs NetworkManager enters an infinite link flap loop after active-backup primary slave link is restored. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. disconnect link of primary slave (ens192) using vSphere console 2. wait for link to fail over to backup slave (ens224). 3. reboot 4. re-connect old primary slave (ens192). Actual results: NetworkManager enters an infinite loop of link flaps. Network connectivity to the node is lost. Once old primary slave (ens192) is re-disconnected NetworkManager link flap stops and network recovers. Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2206] device (bond0): assigned bond port ens192 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2207] device (ens192): Activation: connection 'ens192' enslaved, continuing activation Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2213] policy: auto-activating connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2214] policy: auto-activating connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2267] device (ens192): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2272] device (ens224): Activation: starting connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2273] device (ens256): Activation: starting connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2275] device (bond0): disconnecting for new activation request. Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2276] device (bond0): state change: ip-config -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2282] device (ens224): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2284] device (ens256): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2309] device (ens192): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2311] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2596] device (bond0): released bond slave ens192 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <warn> [1655849684.2676] device (ens192): queue-state[activated] reason:none, id:452056]: replace previously queued state change Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3073] device (bond0): Activation: starting connection 'ovs-if-phys0' (7ac68a21-25ac-47c4-a8ca-70a83181965d) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3095] device (ens192): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3099] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3243] device (bond0): set-hw-addr: set-cloned MAC address to 00:50:56:AC:59:95 (00:50:56:AC:59:95) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3245] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3294] device (ens224): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3297] device (ens256): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3300] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3304] device (bond0): Activation: connection 'ovs-if-phys0' enslaved, continuing activation Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3306] device (ens192): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3338] device (ens224): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3639] device (bond0): assigned bond port ens224 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3640] device (ens224): Activation: connection 'ens224-slave-ovs-clone' enslaved, continuing activation Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3642] device (ens256): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (bond0): assigned bond port ens256 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (ens256): Activation: connection 'ens256-slave-ovs-clone' enslaved, continuing activation Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3906] device (bond0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3910] policy: auto-activating connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3912] device (bond0): carrier: link connected Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3931] device (ens224): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3934] device (ens256): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3939] device (ens192): Activation: starting connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3940] device (bond0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): disconnecting for new activation request. Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3944] device (bond0): releasing ovs interface bond0 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3946] device (bond0): released from master device bond0 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3948] device (ens192): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3953] device (ens224): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3954] device (ens224): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3957] device (ens224): Activation: successful, device activated. Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3963] device (ens256): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3965] device (ens256): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3967] device (ens256): Activation: successful, device activated. Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4014] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4349] device (bond0): released bond slave ens224 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4779] device (bond0): released bond slave ens256 Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4781] device (bond0): set-hw-addr: set MAC address to 00:50:56:AC:59:95 (restore) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4863] device (bond0): set-hw-addr: reset MAC address to D2:09:18:BB:91:74 (deactivate) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4866] device (bond0): Activation: starting connection 'bond0' (ef21706f-9968-419c-877d-f3c80a098daf) Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4925] device (ens224): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4930] device (ens256): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4935] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5079] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5097] device (ens192): state change: prepare -> config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5099] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5101] device (ens224): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5109] device (ens256): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5120] device (ens192): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed') Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5380] device (bond0): assigned bond port ens192 Expected results: No link flaps, no network disconnction. Additional info: Unable to reproduce yet on IPI libvirt baremetal.
Does not reproduce on RHEL8.6 worker with {ens192,ens224,bond0}.nmconnection
From configure-ovs perspective, this might be due to the fact that only slave active profiles are cloned. Since ens192 was down on reboot, it had no active profile, so no profile was cloned. Then when it is reconnected, the original profile activates, which then activates the original bond profile as well instead of the clone made for ovn-k, and so on... Not sure though about the loop or if it is worth looking at it since now things are not working as we expect anyway. This should work if there were no reboot. I guess that we ought to clone all slave profiles, active or not. We could filter instead by autoconnect being set. I don't consider this a regression. We probably introduced this issue when we started cloning the profiles, a while ago, and that was to solve another set of bonding issues that might had this scenario not working either.
@rbrattai Ross, as time allows, please try out this tentative improvement: https://github.com/openshift/machine-config-operator/pull/3203
@jcaamano So the link flapping is fixed, but we are still switching MACs and potentially causing issues. I'm not sure if switching MACs is acceptable because it can change DHCP leases which changes IPs which breaks etcd with EtcdCertSignerControllerDegraded. EtcdCertSignerControllerDegraded: [x509: certificate is valid for 192.168.123.58, fd2e:6f44:5dd8::47, not 192.168.123.68, x509: certificate is valid for ::1, 127.0.0.1, 192.168.123.58, ::1, fd2e:6f44:5dd8::47, not 192.168.123.68]... We could require that all bonding slave MACs have the same IP via DHCP configuration, but that might not work in practice. libvirt IPI RHCOS active-backup fail_over_mac=0 Initial state after install. master-0-2 192.168.123.58 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 3: enp5s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff permaddr 52:54:00:fb:79:00 promiscuity 0 minmtu 68 maxmtu 65535 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: 7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 Jul 12 22:23:28 master-0-2 configure-ovs.sh[3765]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 enp5s0 MAC == bond0 MAC == br-ex MAC After disconnecting enp5s0 and rebooting, we reboot with the same config but then tear it all down and select enp6s0 with a different MAC 52:54:00:fb:79:00. This causes node IP to change due to new DHCP lease. Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000 Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 Jul 14 01:57:19 master-0-2 configure-ovs.sh[2982]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]: 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000 Jul 14 01:57:19 master-0-2 configure-ovs.sh[3051]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 01:57:20 master-0-2 configure-ovs.sh[3051]: 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 Jul 14 01:57:20 master-0-2 configure-ovs.sh[3051]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc fq_codel master bond0 state DOWN group default qlen 1000 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff permaddr 52:54:00:9e:6b:cf promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP group default qlen 1000 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: 12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 Jul 14 01:58:36 master-0-2 configure-ovs.sh[4960]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Then we re-connect enp5s0 and disconnect enp6s0 and reboot. This time we get a weird case where bond0 MAC is random and there is a mismatch. I have no idea where bond0 MAC e2:88:e2:6b:53:d4 comes from, but bond0 is down anyway. Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: 3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: 4: enp6s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: 11: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 Jul 14 04:32:03 master-0-2 configure-ovs.sh[3183]: link/ether e2:88:e2:6b:53:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: 3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: 4: enp6s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: 11: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 Jul 14 04:33:08 master-0-2 configure-ovs.sh[3289]: link/ether e2:88:e2:6b:53:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 3: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ovs-system state UP group default qlen 1000 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 4: enp6s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: link/ether 52:54:00:fb:79:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 11: bond0: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: link/ether e2:88:e2:6b:53:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 Jul 14 04:44:25 master-0-2 configure-ovs.sh[7842]: link/ether 52:54:00:9e:6b:cf brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535 Full logs http://file.rdu.redhat.com/~rbrattai/logs/bz2099945-bond0-mac-changes.tar.xz
@rbrattai honestly I think this is not an issue, I would expect these configurations to be setup via setting a fixed MAC address or using dhcp client id on the bond itself, rather than relying on a specific mac of an undetermined slave? Really when someone configures the dhcp server static mac assignment for a bond, how does it choose which mac is going to be used? There is no definition of which slave is going to be enslaved first or which one given link is going to be available or unavailable on a given boot. The only possible thing to do is configure a fixed mac on the bond itself or use some other mechanism like dhcp client id.
I agree IP reservations are required. The docs specify in two places. https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal/installing-bare-metal-network-customizations.html#installation-network-user-infra_installing-bare-metal-network-customizations "It is recommended to use a DHCP server for long-term management of the cluster machines. Ensure that the DHCP server is configured to provide persistent IP addresses, DNS server information, and hostnames to the cluster machines." https://docs.openshift.com/container-platform/4.10/installing/installing_bare_metal_ipi/ipi-install-prerequisites.html#network-requirements-reserving-ip-addresses_ipi-install-prerequisites Reserving IP addresses for nodes with the DHCP server For the baremetal network, a network administrator must reserve a number of IP addresses, including: Two unique virtual IP addresses. One virtual IP address for the API endpoint. One virtual IP address for the wildcard ingress endpoint. One IP address for the provisioner node. One IP address for each control plane (master) node. One IP address for each worker node, if applicable. So by implication all interfaces used for a bond should have the same IP reservation. Our testing DHCP setups need adjustment for this bonding case, but the requirement is clear. Verified on 4.12.0-0.nightly-2022-07-12-164246
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399