Bug 2087021 - configure-ovs.sh fails, blocking new RHEL node from being scaled up on cluster without manual reboot
Summary: configure-ovs.sh fails, blocking new RHEL node from being scaled up on cluste...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
: 4.9.z
Assignee: Periyasamy Palanisamy
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On: 2088519
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-17 06:57 UTC by Paul Webster
Modified: 2022-08-09 14:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2088519 (view as bug list)
Environment:
Last Closed: 2022-08-09 14:00:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3160 0 None Merged [release-4.9] Bug 2089763: configure-ovs: avoid restarting NetworkManager 2022-07-26 08:50:30 UTC
Github openshift machine-config-operator pull 3183 0 None Merged [release-4.9] Bug 2089763: configure-ovs: persist profiles after auto-connect has been set 2022-07-25 10:56:27 UTC
Github openshift machine-config-operator pull 3188 0 None Merged [release-4.9] Bug 2098099: configure-ovs: clone connection to avoid selinux problems 2022-07-25 10:56:29 UTC
Red Hat Product Errata RHSA-2022:5879 0 None None None 2022-08-09 14:01:27 UTC

Description Paul Webster 2022-05-17 06:57:15 UTC
Description of problem:

While attempting to add a new RHEL 8.5 node to an existing OCP 4.9.28 cluster using the scaleup ansible playbook, the node failed to report ready. The debug status of the kubelet service collected by ansible shows:

E0504 11:46:35.931390    9709 kubelet.go:2360] \"Container runtime network not ready\" networkReady=\"NetworkReady=false reason     :NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?\""

Rebooting the node resolved the issue.

Logs from a sosreport taken prior to the reboot indicate that the configure-ovs.sh script run by ovs-configuration service failed to configure the network:

May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: + for file in "${files[@]}"
May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: ++ basename /etc/NetworkManager/systemConnectionsMerged/pteam0 slave 1-slave-ovs-clone.nmconnection
May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: basename: extra operand ‘1-slave-ovs-clone.nmconnection’
May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: Try 'basename --help' for more information.
May 04 11:35:24 XXXXXXXX configure-ovs.sh[3668]: + file=
May 04 11:35:24 XXXXXXXX systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=1/FAILURE
May 04 11:35:24 XXXXXXXX systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
May 04 11:35:24 XXXXXXXX systemd[1]: Failed to start Configures OVS with proper host networking configuration.
May 04 11:35:24 XXXXXXXX systemd[1]: ovs-configuration.service: Consumed 1.783s CPU time

Version-Release number of selected component (if applicable):

Red Hat OpenShift Container Platform 4.9.28
Red Hat Enterprise Linux 8.5

How reproducible:

Unknown

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Initial configure-ovs.sh runs successfully, and sets up host network configuration to allow it to be added to the cluster

Additional info:

Comment 8 Ross Brattain 2022-06-15 00:22:29 UTC
Scale succeeded with https://github.com/openshift/machine-config-operator/pull/3188

4.9.0-0.ci.test-2022-06-14-141650-ci-ln-rl16bwt-latest


o49v23-xq6ss-rhel-0   Ready    worker   57m   v1.22.8+f34b40c   172.31.249.178   172.31.249.178   Red Hat Enterprise Linux 8.4 (Ootpa)                           4.18.0-372.9.1.el8.x86_64      cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8
o49v23-xq6ss-rhel-1   Ready    worker   57m   v1.22.8+f34b40c   172.31.249.122   172.31.249.122   Red Hat Enterprise Linux 8.4 (Ootpa)                           4.18.0-372.9.1.el8.x86_64      cri-o://1.22.5-3.rhaos4.9.gitb6d3a87.el8


o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + ip route show
o49v23-xq6ss-rhel-1 configure-ovs.sh[1890]: default via 172.31.248.1 dev br-ex proto dhcp metric 49
o49v23-xq6ss-rhel-1 configure-ovs.sh[1890]: 172.31.248.0/23 dev br-ex proto kernel scope link src 172.31.249.122 metric 49
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + ip -6 route show
o49v23-xq6ss-rhel-1 configure-ovs.sh[1891]: ::1 dev lo proto kernel metric 256 pref medium
o49v23-xq6ss-rhel-1 configure-ovs.sh[1891]: fe80::/64 dev br-ex proto kernel metric 1024 pref medium
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + exit 0
o49v23-xq6ss-rhel-1 systemd[1]: ovs-configuration.service: Succeeded.
o49v23-xq6ss-rhel-1 systemd[1]: Started Configures OVS with proper host networking configuration.
o49v23-xq6ss-rhel-1 systemd[1]: ovs-configuration.service: Consumed 1.533s CPU time


o49v23-xq6ss-rhel-0 configure-ovs.sh[1361]: + ip route show
o49v23-xq6ss-rhel-0 configure-ovs.sh[1894]: default via 172.31.248.1 dev br-ex proto dhcp metric 49
o49v23-xq6ss-rhel-0 configure-ovs.sh[1894]: 172.31.248.0/23 dev br-ex proto kernel scope link src 172.31.249.178 metric 49
o49v23-xq6ss-rhel-0 configure-ovs.sh[1361]: + ip -6 route show
o49v23-xq6ss-rhel-0 configure-ovs.sh[1895]: ::1 dev lo proto kernel metric 256 pref medium
o49v23-xq6ss-rhel-0 configure-ovs.sh[1895]: fe80::/64 dev br-ex proto kernel metric 1024 pref medium
o49v23-xq6ss-rhel-0 configure-ovs.sh[1361]: + exit 0
o49v23-xq6ss-rhel-0 systemd[1]: ovs-configuration.service: Succeeded.
o49v23-xq6ss-rhel-0 systemd[1]: Started Configures OVS with proper host networking configuration.
o49v23-xq6ss-rhel-0 systemd[1]: ovs-configuration.service: Consumed 1.321s CPU time

Comment 9 Ross Brattain 2022-06-15 00:24:51 UTC
logs with basename

o49v23-xq6ss-rhel-1 configure-ovs.sh[1830]: ++ basename /etc/NetworkManager/systemConnectionsMerged/br-ex
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=br-ex

o49v23-xq6ss-rhel-1 configure-ovs.sh[1833]: ++ basename /etc/NetworkManager/systemConnectionsMerged/br-ex.nmconnection
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=br-ex.nmconnection

o49v23-xq6ss-rhel-1 configure-ovs.sh[1838]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-br-ex

o49v23-xq6ss-rhel-1 configure-ovs.sh[1842]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-br-ex.nmconnection
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-br-ex.nmconnection

o49v23-xq6ss-rhel-1 configure-ovs.sh[1847]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-br-ex

o49v23-xq6ss-rhel-1 configure-ovs.sh[1850]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex.nmconnection
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-br-ex.nmconnection

o49v23-xq6ss-rhel-1 configure-ovs.sh[1853]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-phys0

o49v23-xq6ss-rhel-1 configure-ovs.sh[1855]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0.nmconnection
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-if-phys0.nmconnection

o49v23-xq6ss-rhel-1 configure-ovs.sh[1859]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-phys0

o49v23-xq6ss-rhel-1 configure-ovs.sh[1862]: ++ basename /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0.nmconnection
o49v23-xq6ss-rhel-1 configure-ovs.sh[1357]: + file=ovs-port-phys0.nmconnection

Comment 22 Ross Brattain 2022-07-25 15:51:23 UTC
Re-tested with https://github.com/openshift/machine-config-operator/pull/3254 for BZ 2108538


vSphere UPI RHCOS active_backup fail_over_mac=0

/etc/NetworkManager/systemConnectionsMerged/ens192     test .nmconnection
/etc/NetworkManager/systemConnectionsMerged/ens224     test .nmconnection
/etc/NetworkManager/systemConnectionsMerged/ens256     test .nmconnection


libvirt IPI RHCOS DHCP active_backup fail_over_mac=0

/etc/NetworkManager/systemConnectionsMerged/bond0 test .nmconnection
/etc/NetworkManager/systemConnectionsMerged/enp5s0 test .nmconnection
/etc/NetworkManager/systemConnectionsMerged/enp6s0 test .nmconnection


BZ 2108538, PR 3254 is need to make sure bond0 MAC == br-ex MAC for vSphere.


Spaces in file names work.

Spaces in NetworkManager ids does not work, depends on BZ 2104386

Rebooting after link failure is also risky, that depends on BZ 2103899

Comment 23 Ross Brattain 2022-07-25 15:53:18 UTC
Correction in last comment: "vSphere UPI RHCOS active_backup fail_over_mac=0" should be "RHEL8 vSphere DHCP active_backup fail_over_mac=0"

Comment 24 Ross Brattain 2022-07-25 16:15:07 UTC
I should also note that we now set 

autoconnect-priority=99

in the slave .nmconnections 

See BZ 2055433 comment 1 and BZ 2089943 comment 9

Comment 26 Ross Brattain 2022-07-27 23:34:51 UTC
Verified.

PR 3254 is in 4.9.0-0.nightly-2022-07-26-141848

Comment 29 errata-xmlrpc 2022-08-09 14:00:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.9.45 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5879


Note You need to log in before you can comment on or make changes to this bug.