Description of problem: While attempting to add a new RHEL 8.5 node to an existing OCP 4.9.28 cluster using the scaleup ansible playbook, the node failed to report ready. The debug status of the kubelet service collected by ansible shows: E0504 11:46:35.931390 9709 kubelet.go:2360] \"Container runtime network not ready\" networkReady=\"NetworkReady=false reason :NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?\"" Rebooting the node resolved the issue. Logs from a sosreport taken prior to the reboot indicate that the configure-ovs.sh script run by ovs-configuration service failed to configure the network: May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: + for file in "${files[@]}" May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: ++ basename /etc/NetworkManager/systemConnectionsMerged/pteam0 slave 1-slave-ovs-clone.nmconnection May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: basename: extra operand ‘1-slave-ovs-clone.nmconnection’ May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: Try 'basename --help' for more information. May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: + file= May 04 11:35:24 XXXXXXXX systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE May 04 11:35:24 XXXXXXXX systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. May 04 11:35:24 XXXXXXXX systemd[1]: Failed to start Configures OVS with proper host networking configuration. May 04 11:35:24 XXXXXXXX systemd[1]: ovs-configuration.service: Consumed 1.783s CPU time Version-Release number of selected component (if applicable): Red Hat OpenShift Container Platform 4.9.28 Red Hat Enterprise Linux 8.5 How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Initial configure-ovs.sh runs successfully, and sets up host network configuration to allow it to be added to the cluster Additional info:
Scale succeeded with https://github.com/openshift/machine-config-operator/pull/3188 4.9.0-0.ci.test-2022-06-14-141650-ci-ln-rl16bwt-latest o49v23-xq6ss-rhel-0 Ready worker 57m v1.22.8+f34b40c 172.31.249.178 172.31.249.178 Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 o49v23-xq6ss-rhel-1 Ready worker 57m v1.22.8+f34b40c 172.31.249.122 172.31.249.122 Red Hat Enterprise Linux 8.4 (Ootpa) 4.18.0-372.9.1.el8.x86_64 cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8 o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + ip route show o49v23-xq6ss-rhel-1 configure-ovs.sh[1890]: default via 172.31.248.1 dev br-ex proto dhcp metric 49 o49v23-xq6ss-rhel-1 configure-ovs.sh[1890]: 172.31.248.0/23 dev br-ex proto kernel scope link src 172.31.249.122 metric 49 o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + ip -6 route show o49v23-xq6ss-rhel-1 configure-ovs.sh[1891]: ::1 dev lo proto kernel metric 256 pref medium o49v23-xq6ss-rhel-1 configure-ovs.sh[1891]: fe80::/64 dev br-ex proto kernel metric 1024 pref medium o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + exit 0 o49v23-xq6ss-rhel-1 systemd[1]: ovs-configuration.service: Succeeded. o49v23-xq6ss-rhel-1 systemd[1]: Started Configures OVS with proper host networking configuration. o49v23-xq6ss-rhel-1 systemd[1]: ovs-configuration.service: Consumed 1.533s CPU time o49v23-xq6ss-rhel-0 configure-ovs.sh[1361]: + ip route show o49v23-xq6ss-rhel-0 configure-ovs.sh[1894]: default via 172.31.248.1 dev br-ex proto dhcp metric 49 o49v23-xq6ss-rhel-0 configure-ovs.sh[1894]: 172.31.248.0/23 dev br-ex proto kernel scope link src 172.31.249.178 metric 49 o49v23-xq6ss-rhel-0 configure-ovs.sh[1361]: + ip -6 route show o49v23-xq6ss-rhel-0 configure-ovs.sh[1895]: ::1 dev lo proto kernel metric 256 pref medium o49v23-xq6ss-rhel-0 configure-ovs.sh[1895]: fe80::/64 dev br-ex proto kernel metric 1024 pref medium o49v23-xq6ss-rhel-0 configure-ovs.sh[1361]: + exit 0 o49v23-xq6ss-rhel-0 systemd[1]: ovs-configuration.service: Succeeded. o49v23-xq6ss-rhel-0 systemd[1]: Started Configures OVS with proper host networking configuration. o49v23-xq6ss-rhel-0 systemd[1]: ovs-configuration.service: Consumed 1.321s CPU time
logs with basename o49v23-xq6ss-rhel-1 configure-ovs.sh[1830]: ++ basename /etc/NetworkManager/systemConnectionsMerged/br-ex o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=br-ex o49v23-xq6ss-rhel-1 configure-ovs.sh[1833]: ++ basename /etc/NetworkManager/systemConnectionsMerged/br-ex.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=br-ex.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1838]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-br-ex o49v23-xq6ss-rhel-1 configure-ovs.sh[1842]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-br-ex.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1847]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-br-ex o49v23-xq6ss-rhel-1 configure-ovs.sh[1850]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-br-ex.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1853]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0 o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-phys0 o49v23-xq6ss-rhel-1 configure-ovs.sh[1855]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-phys0.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1859]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0 o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-phys0 o49v23-xq6ss-rhel-1 configure-ovs.sh[1862]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0.nmconnection o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-phys0.nmconnection
Re-tested with https://github.com/openshift/machine-config-operator/pull/3254 for BZ 2108538 vSphere UPI RHCOS active_backup fail_over_mac=0 /etc/NetworkManager/systemConnectionsMerged/ens192 test .nmconnection /etc/NetworkManager/systemConnectionsMerged/ens224 test .nmconnection /etc/NetworkManager/systemConnectionsMerged/ens256 test .nmconnection libvirt IPI RHCOS DHCP active_backup fail_over_mac=0 /etc/NetworkManager/systemConnectionsMerged/bond0 test .nmconnection /etc/NetworkManager/systemConnectionsMerged/enp5s0 test .nmconnection /etc/NetworkManager/systemConnectionsMerged/enp6s0 test .nmconnection BZ 2108538, PR 3254 is need to make sure bond0 MAC == br-ex MAC for vSphere. Spaces in file names work. Spaces in NetworkManager ids does not work, depends on BZ 2104386 Rebooting after link failure is also risky, that depends on BZ 2103899
Correction in last comment: "vSphere UPI RHCOS active_backup fail_over_mac=0" should be "RHEL8 vSphere DHCP active_backup fail_over_mac=0"
I should also note that we now set autoconnect-priority=99 in the slave .nmconnections See BZ 2055433 comment 1 and BZ 2089943 comment 9
Verified. PR 3254 is in 4.9.0-0.nightly-2022-07-26-141848
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.9.45 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5879