Created attachment 1889504 [details] workers ovs-configuration logs Created attachment 1889504 [details] workers ovs-configuration logs Description of problem: This issue is similar to BZ 2055433. This issue is *not* happening in 4.10.16. After installation of OCP 4.10 and during the installation of PAO profile, workers applying the new mc fail to obtain an IP and get disconnected from the cluster. The issue seems to be with ovs-configuration. This installation is on baremetal, using IPI. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.17 True False 99m Cluster version is 4.10.17 How reproducible: All the time in 4.10.17 Steps to Reproduce: 1. Install OCP 4.10.17 on baremetal 2. Install PAO and apply a profile, when workers are applying the new mc some of them will not come up again. Actual results: Some nodes stay NotReady and never join back the cluster: $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 100m v1.23.5+3afdacb master-1 Ready master 99m v1.23.5+3afdacb master-2 Ready master 99m v1.23.5+3afdacb worker-0 NotReady,SchedulingDisabled worker 68m v1.23.5+3afdacb worker-1 Ready worker 67m v1.23.5+3afdacb worker-2 NotReady,SchedulingDisabled worker 66m v1.23.5+3afdacb worker-3 NotReady,SchedulingDisabled worker 69m v1.23.5+3afdacb $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.17 True False False 82m baremetal 4.10.17 True False False 118m cloud-controller-manager 4.10.17 True False False 121m cloud-credential 4.10.17 True False False 144m cluster-autoscaler 4.10.17 True False False 118m config-operator 4.10.17 True False False 119m console 4.10.17 True False False 87m csi-snapshot-controller 4.10.17 True False False 118m dns 4.10.17 True True False 118m DNS "default" reports Progressing=True: "Have 4 available node-resolver pods, want 7." etcd 4.10.17 True False False 116m image-registry 4.10.17 True False False 109m ingress 4.10.17 True False True 74m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-77bb87bf75-vs82w" cannot be scheduled: 0/7 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) insights 4.10.17 True False False 12s kube-apiserver 4.10.17 True False False 114m kube-controller-manager 4.10.17 True False False 115m kube-scheduler 4.10.17 True False False 115m kube-storage-version-migrator 4.10.17 True False False 118m machine-api 4.10.17 True False False 115m machine-approver 4.10.17 True False False 118m machine-config 4.10.17 False False True 64m Cluster not available for [{operator 4.10.17}] marketplace 4.10.17 True False False 118m monitoring 4.10.17 False True True 58m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.10.17 True True True 119m DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2022-06-13T16:57:02Z... node-tuning 4.10.17 True False False 118m openshift-apiserver 4.10.17 True False False 112m openshift-controller-manager 4.10.17 True False False 114m openshift-samples 4.10.17 True False False 105m operator-lifecycle-manager 4.10.17 True False False 118m operator-lifecycle-manager-catalog 4.10.17 True False False 118m operator-lifecycle-manager-packageserver 4.10.17 True False False 112m service-ca 4.10.17 True False False 119m storage 4.10.17 True False False 119m $ oc -n openshift-ovn-kubernetes get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-4wpzl 6/6 Running 6 (121m ago) 122m 192.168.52.12 master-1 <none> <none> ovnkube-master-cl5zd 6/6 Running 1 (111m ago) 122m 192.168.52.11 master-0 <none> <none> ovnkube-master-nmcxg 6/6 Running 6 (121m ago) 122m 192.168.52.13 master-2 <none> <none> ovnkube-node-4jkb6 5/5 Running 0 89m 192.168.52.22 worker-2 <none> <none> ovnkube-node-5njr7 5/5 Running 0 122m 192.168.52.12 master-1 <none> <none> ovnkube-node-7dbwq 5/5 Running 0 92m 192.168.52.23 worker-3 <none> <none> ovnkube-node-9962r 5/5 Running 0 91m 192.168.52.20 worker-0 <none> <none> ovnkube-node-jwcch 5/5 Running 0 122m 192.168.52.13 master-2 <none> <none> ovnkube-node-q2g9m 5/5 Running 0 122m 192.168.52.11 master-0 <none> <none> ovnkube-node-v2dww 5/5 Running 0 90m 192.168.52.21 worker-1 <none> <none> The 3 workers with "NotReady" status are unreachable from anywhere when using baremetal network: $ for N in {0..3}; do echo worker-$N ===; ping -c2 -w0.1 worker-${N}; done worker-0 === PING worker-0.cluster5.dfwt5g.lab (192.168.52.20) 56(84) bytes of data. From provisioner.cluster5.dfwt5g.lab (192.168.52.10) icmp_seq=1 Destination Host Unreachable From provisioner.cluster5.dfwt5g.lab (192.168.52.10) icmp_seq=2 Destination Host Unreachable --- worker-0.cluster5.dfwt5g.lab ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1050ms pipe 2 worker-1 === PING worker-1.cluster5.dfwt5g.lab (192.168.52.21) 56(84) bytes of data. 64 bytes from worker-1.cluster5.dfwt5g.lab (192.168.52.21): icmp_seq=1 ttl=64 time=0.871 ms 64 bytes from worker-1.cluster5.dfwt5g.lab (192.168.52.21): icmp_seq=2 ttl=64 time=0.183 ms --- worker-1.cluster5.dfwt5g.lab ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 0.183/0.527/0.871/0.344 ms worker-2 === PING worker-2.cluster5.dfwt5g.lab (192.168.52.22) 56(84) bytes of data. From provisioner.cluster5.dfwt5g.lab (192.168.52.10) icmp_seq=1 Destination Host Unreachable From provisioner.cluster5.dfwt5g.lab (192.168.52.10) icmp_seq=2 Destination Host Unreachable --- worker-2.cluster5.dfwt5g.lab ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1017ms pipe 2 worker-3 === PING worker-3.cluster5.dfwt5g.lab (192.168.52.23) 56(84) bytes of data. From provisioner.cluster5.dfwt5g.lab (192.168.52.10) icmp_seq=1 Destination Host Unreachable From provisioner.cluster5.dfwt5g.lab (192.168.52.10) icmp_seq=2 Destination Host Unreachable --- worker-3.cluster5.dfwt5g.lab ping statistics --- 2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1064ms pipe 2 The only way to connect to them is through their provisioning IPs. Obtaining Provisioning IPs: $ metal_pod=$(oc -n openshift-machine-api get pod -l baremetal.openshift.io/cluster-baremetal-operator=metal3-state -o json | jq -r .items[].metadata.name) $ oc -n openshift-machine-api logs ${metal_pod} metal3-dnsmasq | awk '/DHCPOFFER/ {print$ 4}' | sort -u 172.22.1.161 172.22.1.252 172.22.3.227 172.22.4.131 172.22.4.67 172.22.7.243 They correspond as such: 172.22.1.161 # master-1 172.22.1.252 # worker-0 172.22.3.227 # master-0 172.22.4.131 # worker-3 172.22.4.67 # worker-1 172.22.7.243 # worker-2 We will query the workers as they are the ones with troubles: 172.22.4.67 # worker-1 OK 172.22.1.252 # worker-0 KO 172.22.7.243 # worker-2 KO 172.22.4.131 # worker-3 KO Getting the IPs on the worker nodes: $ for ip in 172.22.4.67 172.22.1.252 172.22.7.243 172.22.4.131; do echo ${ip} ===; ssh core@${ip} ip -4 -o a; echo; done 172.22.4.67 === 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 4: eno5np0 inet 172.22.4.67/21 brd 172.22.7.255 scope global dynamic noprefixroute eno5np0\ valid_lft 67sec preferred_lft 67sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8635196sec preferred_lft 8635196sec 19: br-ex inet 192.168.52.21/24 brd 192.168.52.255 scope global dynamic noprefixroute br-ex\ valid_lft 6096sec preferred_lft 6096sec 19: br-ex inet 192.168.52.18/32 scope global br-ex\ valid_lft forever preferred_lft forever 22: ovn-k8s-mp0 inet 10.129.2.2/23 brd 10.129.3.255 scope global ovn-k8s-mp0\ valid_lft forever preferred_lft forever 172.22.1.252 === 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 5: eno5np0 inet 172.22.1.252/21 brd 172.22.7.255 scope global dynamic noprefixroute eno5np0\ valid_lft 117sec preferred_lft 117sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8636550sec preferred_lft 8636550sec 172.22.7.243 === 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 3: eno5np0 inet 172.22.7.243/21 brd 172.22.7.255 scope global dynamic noprefixroute eno5np0\ valid_lft 112sec preferred_lft 112sec 172.22.4.131 === 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 5: eno5np0 inet 172.22.4.131/21 brd 172.22.7.255 scope global dynamic noprefixroute eno5np0\ valid_lft 85sec preferred_lft 85sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8636575sec preferred_lft 8636575sec Getting the status of ovs-configuration $ for ip in 172.22.4.67 172.22.1.252 172.22.7.243 172.22.4.131; do echo ${ip} ===; ssh core@${ip} 2>/dev/null systemctl status ovs-configuration; echo; done 172.22.4.67 === ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: inactive (dead) since Mon 2022-06-13 16:37:29 UTC; 1h 20min ago Main PID: 4739 (code=exited, status=0/SUCCESS) CPU: 1.427s Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: 16.1.15.0/30 dev enp1s0f4u4 proto kernel scope link src 16.1.15.2 metric 102 Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: 172.22.0.0/21 dev eno5np0 proto kernel scope link src 172.22.4.67 metric 100 Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: 192.168.52.0/24 dev br-ex proto kernel scope link src 192.168.52.21 metric 48 Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: 192.168.53.0/24 dev bond0.350 proto kernel scope link src 192.168.53.21 metric 400 Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: + ip -6 route show Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: ::1 dev lo proto kernel metric 256 pref medium Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: fe80::/64 dev br-ex proto kernel metric 48 pref medium Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: fe80::/64 dev eno5np0 proto kernel metric 100 pref medium Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: fe80::/64 dev enp1s0f4u4 proto kernel metric 102 pref medium Jun 13 16:37:31 worker-1 configure-ovs.sh[4739]: + exit 0 172.22.1.252 === ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2022-06-13 17:01:47 UTC; 56min ago Process: 7624 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=1/FAILURE) Main PID: 7624 (code=exited, status=1/FAILURE) CPU: 1.000s Jun 13 17:01:49 localhost configure-ovs.sh[7624]: vlan protocol 802.1Q id 350 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jun 13 17:01:49 localhost configure-ovs.sh[7624]: + ip route show Jun 13 17:01:49 localhost configure-ovs.sh[7624]: 16.1.15.0/30 dev enp1s0f4u4 proto kernel scope link src 16.1.15.2 metric 102 Jun 13 17:01:49 localhost configure-ovs.sh[7624]: 172.22.0.0/21 dev eno5np0 proto kernel scope link src 172.22.1.252 metric 100 Jun 13 17:01:49 localhost configure-ovs.sh[7624]: + ip -6 route show Jun 13 17:01:49 localhost configure-ovs.sh[7624]: ::1 dev lo proto kernel metric 256 pref medium Jun 13 17:01:49 localhost configure-ovs.sh[7624]: fe80::/64 dev eno5np0 proto kernel metric 100 pref medium Jun 13 17:01:49 localhost configure-ovs.sh[7624]: fe80::/64 dev enp1s0f4u4 proto kernel metric 102 pref medium Jun 13 17:01:49 localhost configure-ovs.sh[7624]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium Jun 13 17:01:49 localhost configure-ovs.sh[7624]: + exit 1 172.22.7.243 === ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2022-06-13 17:01:42 UTC; 56min ago Process: 7402 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=1/FAILURE) Main PID: 7402 (code=exited, status=1/FAILURE) CPU: 1.003s Jun 13 17:01:43 localhost configure-ovs.sh[7402]: 25: bond0.350@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 Jun 13 17:01:43 localhost configure-ovs.sh[7402]: link/ether b8:83:03:91:c5:11 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 Jun 13 17:01:43 localhost configure-ovs.sh[7402]: vlan protocol 802.1Q id 350 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jun 13 17:01:43 localhost configure-ovs.sh[7402]: + ip route show Jun 13 17:01:43 localhost configure-ovs.sh[7402]: 172.22.0.0/21 dev eno5np0 proto kernel scope link src 172.22.7.243 metric 100 Jun 13 17:01:43 localhost configure-ovs.sh[7402]: + ip -6 route show Jun 13 17:01:43 localhost configure-ovs.sh[7402]: ::1 dev lo proto kernel metric 256 pref medium Jun 13 17:01:43 localhost configure-ovs.sh[7402]: fe80::/64 dev eno5np0 proto kernel metric 100 pref medium Jun 13 17:01:43 localhost configure-ovs.sh[7402]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium Jun 13 17:01:43 localhost configure-ovs.sh[7402]: + exit 1 172.22.4.131 === ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2022-06-13 17:02:13 UTC; 55min ago Process: 7632 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=1/FAILURE) Main PID: 7632 (code=exited, status=1/FAILURE) CPU: 1.024s Jun 13 17:02:14 localhost configure-ovs.sh[7632]: vlan protocol 802.1Q id 350 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 Jun 13 17:02:14 localhost configure-ovs.sh[7632]: + ip route show Jun 13 17:02:14 localhost configure-ovs.sh[7632]: 16.1.15.0/30 dev enp1s0f4u4 proto kernel scope link src 16.1.15.2 metric 102 Jun 13 17:02:14 localhost configure-ovs.sh[7632]: 172.22.0.0/21 dev eno5np0 proto kernel scope link src 172.22.4.131 metric 100 Jun 13 17:02:14 localhost configure-ovs.sh[7632]: + ip -6 route show Jun 13 17:02:14 localhost configure-ovs.sh[7632]: ::1 dev lo proto kernel metric 256 pref medium Jun 13 17:02:14 localhost configure-ovs.sh[7632]: fe80::/64 dev eno5np0 proto kernel metric 100 pref medium Jun 13 17:02:14 localhost configure-ovs.sh[7632]: fe80::/64 dev enp1s0f4u4 proto kernel metric 102 pref medium Jun 13 17:02:14 localhost configure-ovs.sh[7632]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium Jun 13 17:02:14 localhost configure-ovs.sh[7632]: + exit 1 Number of reboots: $ for ip in 172.22.4.67 172.22.1.252 172.22.7.243 172.22.4.131; do echo ${ip} ===; ssh core@${ip} 2>/dev/null last; echo; done 172.22.4.67 === reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:35 still running reboot system boot 4.18.0-305.45.1. Mon Jun 13 16:30 - 16:33 (00:03) wtmp begins Mon Jun 13 16:30:09 2022 172.22.1.252 === core pts/0 172.22.0.1 Mon Jun 13 17:51 - 17:52 (00:00) reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:58 still running reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:34 - 16:56 (00:21) reboot system boot 4.18.0-305.45.1. Mon Jun 13 16:29 - 16:32 (00:03) wtmp begins Mon Jun 13 16:29:29 2022 172.22.7.243 === core pts/0 172.22.0.1 Mon Jun 13 17:52 - 17:55 (00:02) reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:58 still running reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:36 - 16:55 (00:18) reboot system boot 4.18.0-305.45.1. Mon Jun 13 16:31 - 16:34 (00:03) wtmp begins Mon Jun 13 16:31:10 2022 172.22.4.131 === core pts/0 172.22.0.1 Mon Jun 13 17:52 - 17:52 (00:00) reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:58 still running reboot system boot 4.18.0-305.49.1. Mon Jun 13 16:34 - 16:56 (00:22) reboot system boot 4.18.0-305.45.1. Mon Jun 13 16:29 - 16:32 (00:03) wtmp begins Mon Jun 13 16:29:05 2022 Similarly as in the BZ 2055433 the nodes with more reboots are not getting an IP in br-ex Expected results: All the nodes should get it's expected IPs and stay in the cluster Additional info: Logs are pulled through this job: https://www.distributed-ci.io/jobs/5d933fc9-b018-44e3-8eb7-683741c3f3c9/jobStates But those related to the workers will not be available as the workers are not reachable in their expected IP, I pulled the ovs-configuration logs and I'm including them as an attachment.
Looking into ovs-configuration_172.22.1.252.log configure-ovs initially configures the network correctly. Then there is a reboot. On reboot, configure-ovs tries to remove the configuration that it has previously done to build it and apply it again, as expected. For that, it removes the profiles it previously created from the filesystem, does a `nmcli connection reload` and waits for the device that it previously managed (in this case bond0) to be activated with whatever pre-existing network configuration was on the node, before proceeding. At this stage, bond0 device never gets its IP configuration after 60 secs, and remains in state "connecting (getting IP configuration)" The profiles activating bond0 and its slaves are the expected ones and the same ones that did configure the network correctly before configure-ovs did anything: Jun 13 17:00:32 localhost configure-ovs.sh[7624]: bond0:bond:connecting (getting IP configuration):none:none:/org/freedesktop/NetworkManager/Devices/20:bond0:75ac1a13-dbce-36e4-8ecb-c6ed6fce5322:/org/freedesktop/NetworkManager/ActiveConnection/20 Jun 13 17:00:32 localhost configure-ovs.sh[7624]: ens1f0:ethernet:connected:limited:limited:/org/freedesktop/NetworkManager/Devices/11:ens1f0:22f4a3bf-b99a-38ae-91a8-17796391e6aa:/org/freedesktop/NetworkManager/ActiveConnection/22 Jun 13 17:00:32 localhost configure-ovs.sh[7624]: ens1f1:ethernet:connected:limited:limited:/org/freedesktop/NetworkManager/Devices/12:ens1f1:1ffa4bbd-a16e-3ee8-8cde-582cd94ea8be:/org/freedesktop/NetworkManager/ActiveConnection/21 Somehow, dhcp does not resolve for bond0 and this is the last log from NM in that regard: Jun 13 16:59:25 localhost NetworkManager[4726]: <info> [1655139565.9936] dhcp4 (bond0): activation: beginning transaction (no timeout) I do notice that the mac address of bond0 changed during this process, from b8:83:03:91:c5:c8 when it worked, to b8:83:03:91:c5:c9 when it didn't. I guess these mac addresses correspond to the slave nics. In any case, configure-ovs is behaving as expected. @tonyg - is there any static assignment on the dhcp server done by mac? If so, does consider both slave nic mac addresses as any of them could be assigned to the bond? Or should bond0 profile be configured with `cloned-mac-address=b8:83:03:91:c5:c8`? - otherwise, can you provide both dhcp server logs, and NM debug logs?
Jaime, Thank you for the information, to answer your question yes we have static assignment for each MAC, and we did not consider having both mac addresses in the DHCP configuration. We are using dnsmasq and this was the setup for a cluster: dhcp-host=b8:83:03:91:c5:f8,192.168.12.11,master-0.cluster1.dfwt5g.lab dhcp-host=b8:83:03:92:c0:40,192.168.12.12,master-1.cluster1.dfwt5g.lab dhcp-host=b8:83:03:91:c5:34,192.168.12.13,master-2.cluster1.dfwt5g.lab dhcp-host=b8:83:03:8e:1e:10,192.168.12.20,worker-0.cluster1.dfwt5g.lab dhcp-host=b8:83:03:91:c5:20,192.168.12.21,worker-1.cluster1.dfwt5g.lab dhcp-host=b8:83:03:8e:0e:dc,192.168.12.22,worker-2.cluster1.dfwt5g.lab dhcp-host=b8:83:03:92:c0:48,192.168.12.23,worker-3.cluster1.dfwt5g.lab I made a test in another cluster, and confirmed you findings. So then I updated the DHCP config to include both MAC addresses : dhcp-host=b8:83:03:91:c5:f8,b8:83:03:91:c5:f9,192.168.12.11,master-0.cluster1.dfwt5g.lab dhcp-host=b8:83:03:92:c0:40,b8:83:03:92:c0:41,192.168.12.12,master-1.cluster1.dfwt5g.lab dhcp-host=b8:83:03:91:c5:34,b8:83:03:91:c5:35,192.168.12.13,master-2.cluster1.dfwt5g.lab dhcp-host=b8:83:03:8e:1e:10,b8:83:03:8e:1e:11,192.168.12.20,worker-0.cluster1.dfwt5g.lab dhcp-host=b8:83:03:91:c5:20,b8:83:03:91:c5:21,192.168.12.21,worker-1.cluster1.dfwt5g.lab dhcp-host=b8:83:03:8e:0e:dc,b8:83:03:8e:0e:dd,192.168.12.22,worker-2.cluster1.dfwt5g.lab dhcp-host=b8:83:03:92:c0:48,b8:83:03:92:c0:49,192.168.12.23,worker-3.cluster1.dfwt5g.lab Then I rebooted the nodes, they are accessible after the reboot, baremetal IP assigned, but we observed that the bond0 has the MAC of the second slave interface and br-ex the MAC of the first slave interface. Is this expected? Do you see any problem with that? [kni.dfwt5g.lab ~]$ for x in {0..3} ; do echo "===== worker-$x =====" ; ssh core@master-$x "ip a s bond0" 2>/dev/null ; done ===== worker-0 ===== 15: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:91:c5:f9 brd ff:ff:ff:ff:ff:ff ===== worker-1 ===== 15: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:92:c0:41 brd ff:ff:ff:ff:ff:ff ===== worker-2 ===== 15: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:91:c5:35 brd ff:ff:ff:ff:ff:ff ===== worker-3 ===== [kni.dfwt5g.lab ~]$ for x in {0..3} ; do echo "===== worker-$x =====" ; ssh core@master-$x "ip a s br-ex" 2>/dev/null ; done ===== worker-0 ===== 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:91:c5:f8 brd ff:ff:ff:ff:ff:ff inet 192.168.12.11/24 brd 192.168.12.255 scope global dynamic noprefixroute br-ex valid_lft 3911sec preferred_lft 3911sec inet6 fe80::ba83:3ff:fe91:c5f8/64 scope link noprefixroute valid_lft forever preferred_lft forever ===== worker-1 ===== 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:92:c0:40 brd ff:ff:ff:ff:ff:ff inet 192.168.12.12/24 brd 192.168.12.255 scope global dynamic noprefixroute br-ex valid_lft 7130sec preferred_lft 7130sec inet6 fe80::ba83:3ff:fe92:c040/64 scope link noprefixroute valid_lft forever preferred_lft forever ===== worker-2 ===== 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:91:c5:34 brd ff:ff:ff:ff:ff:ff inet 192.168.12.13/24 brd 192.168.12.255 scope global dynamic noprefixroute br-ex valid_lft 7190sec preferred_lft 7190sec inet 192.168.12.16/32 scope global br-ex valid_lft forever preferred_lft forever inet6 fe80::ba83:3ff:fe91:c534/64 scope link noprefixroute valid_lft forever preferred_lft forever ===== worker-3 ===== I'm running a full deployment again with this setup in case you want us to attach logs.
I redeployed the cluster using OCP 4.10.17, and I see it's not all the nodes that happens the behavior above, (also my echo command above was not showing the right server names, apologies about that), the deployment was completed successfully, only workers get rebooted after applying the PAO profile, we can see mostly the master nodes differ from MAC between bond0 and br-ex, workers most of them match the same MAC except one worker (worker-2) [kni.dfwt5g.lab ~]$ for x in {0..2} ; do echo "===== master-$x =====" ; ssh core@master-$x "ip a s br-ex | grep -B1 ether ; ip a s bond0" 2>/dev/null ; done ===== master-0 ===== 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:91:c5:f8 brd ff:ff:ff:ff:ff:ff 15: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:91:c5:f9 brd ff:ff:ff:ff:ff:ff ===== master-1 ===== 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:92:c0:40 brd ff:ff:ff:ff:ff:ff 15: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:92:c0:41 brd ff:ff:ff:ff:ff:ff ===== master-2 ===== 19: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:91:c5:34 brd ff:ff:ff:ff:ff:ff 15: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:91:c5:35 brd ff:ff:ff:ff:ff:ff [kni.dfwt5g.lab ~]$ for x in {0..3} ; do echo "===== worker-$x =====" ; ssh core@worker-$x "ip a s br-ex | grep -B1 ether ; ip a s bond0" 2>/dev/null ; done ===== worker-0 ===== 27: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:8e:1e:11 brd ff:ff:ff:ff:ff:ff 25: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:8e:1e:11 brd ff:ff:ff:ff:ff:ff ===== worker-1 ===== 27: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:91:c5:21 brd ff:ff:ff:ff:ff:ff 25: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:91:c5:21 brd ff:ff:ff:ff:ff:ff ===== worker-2 ===== 27: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:8e:0e:dd brd ff:ff:ff:ff:ff:ff 25: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:8e:0e:dc brd ff:ff:ff:ff:ff:ff ===== worker-3 ===== 27: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether b8:83:03:92:c0:49 brd ff:ff:ff:ff:ff:ff 25: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP group default qlen 1000 link/ether b8:83:03:92:c0:49 brd ff:ff:ff:ff:ff:ff I'm redeploying using 4.10.18 to see if the results are consistent. FYI, both bond0 slaves and bond0 use autoconnect-priority=99
*** Bug 2092129 has been marked as a duplicate of this bug. ***
Tested with fail_over_mac=1 on libvirt IPI Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[2100]: ++ nmcli --get-values bond.options conn show a661c652-5ada-3efd-9deb-f73f9d08a896 Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + bond_opts=mode=active-backup,fail_over_mac=1,miimon=100 Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + '[' -n mode=active-backup,fail_over_mac=1,miimon=100 ']' Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + extra_phys_args+=(bond.options "${bond_opts}") Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + MODE_REGEX='(^|,)mode=active-backup(,|$)' Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + MAC_REGEX='(^|,)fail_over_mac=(1|active|2|follow)(,|$)' Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + [[ mode=active-backup,fail_over_mac=1,miimon=100 =~ (^|,)mode=active-backup(,|$) ]] Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + [[ mode=active-backup,fail_over_mac=1,miimon=100 =~ (^|,)fail_over_mac=(1|active|2|follow)(,|$) ]] Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + clone_mac=0 Jun 27 14:20:00 master-0-0.qe.lab.redhat.com configure-ovs.sh[1965]: + '[' '!' 0 = 0 ']' MAC changes after disconnecting primary slave link. 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc fq_codel master bond0 state DOWN mode DEFAULT group default qlen 1000\ link/ether 52:54:00:b8:80:b8 brd ff:ff:ff:ff:ff:ff 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP mode DEFAULT group default qlen 1000\ link/ether 52:54:00:f9:a6:75 brd ff:ff:ff:ff:ff:ff 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\ link/ether 52:54:00:f9:a6:75 brd ff:ff:ff:ff:ff:ff 12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 52:54:00:b8:80:b8 brd ff:ff:ff:ff:ff:ff
RHEL 8 vSphere DHCP fail_over_mac=0 slaves have the same MAC. bond0 MAC does not change after disconnecting primary slave 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ens192: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:c4:28 brd ff:ff:ff:ff:ff:ff 3: ens224: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:c4:28 brd ff:ff:ff:ff:ff:ff permaddr 00:50:56:ac:4f:59 4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether b2:e5:ad:9f:40:23 brd ff:ff:ff:ff:ff:ff 5: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 72:b2:77:c5:fa:30 brd ff:ff:ff:ff:ff:ff 6: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 5e:76:3e:aa:50:98 brd ff:ff:ff:ff:ff:ff 7: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether 72:11:f6:47:1c:20 brd ff:ff:ff:ff:ff:ff 10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:c4:28 brd ff:ff:ff:ff:ff:ff 11: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:c4:28 brd ff:ff:ff:ff:ff:ff RHCOS UPI vSphere static-ip fail_over_mac=0 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ens192: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:04:8f brd ff:ff:ff:ff:ff:ff 3: ens224: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:04:8f brd ff:ff:ff:ff:ff:ff permaddr 00:50:56:ac:41:cd 4: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether 7e:e6:ed:fa:10:5c brd ff:ff:ff:ff:ff:ff 5: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether ee:11:ba:56:ca:40 brd ff:ff:ff:ff:ff:ff 6: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether f6:a2:c2:c4:fb:f4 brd ff:ff:ff:ff:ff:ff 7: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether 86:2f:27:81:0f:20 brd ff:ff:ff:ff:ff:ff 10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:04:8f brd ff:ff:ff:ff:ff:ff 11: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 00:50:56:ac:04:8f brd ff:ff:ff:ff:ff:ff [
RHCOS libvirt IPI DHCP fail_over_mac=0 bond0,br-ex works after primary slave is disconnected. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3: enp5s0: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc fq_codel master bond0 state DOWN mode DEFAULT group default qlen 1000\ link/ether 52:54:00:9b:66:d0 brd ff:ff:ff:ff:ff:ff 4: enp6s0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc fq_codel master bond0 state UP mode DEFAULT group default qlen 1000\ link/ether 52:54:00:9b:66:d0 brd ff:ff:ff:ff:ff:ff permaddr 52:54:00:59:28:45 5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether f6:a3:26:98:3d:e6 brd ff:ff:ff:ff:ff:ff 6: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether ca:fe:4b:2a:a5:78 brd ff:ff:ff:ff:ff:ff 7: ovn-k8s-mp0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 16:cc:80:a7:b4:6d brd ff:ff:ff:ff:ff:ff 8: br-int: <BROADCAST,MULTICAST> mtu 1400 qdisc noop state DOWN mode DEFAULT group default qlen 1000\ link/ether da:59:37:b1:5c:af brd ff:ff:ff:ff:ff:ff 11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\ link/ether 52:54:00:9b:66:d0 brd ff:ff:ff:ff:ff:ff 12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\ link/ether 52:54:00:9b:66:d0 brd ff:ff:ff:ff:ff:ff
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069