Description of problem: After installation of OCP 4.10 and during the time when workers are attempting to join the cluster, some of the workers won't be able to become fully available due to problems with ovs-configuration. This installation is on baremetal, using IPI. Version-Release number of selected component (if applicable): 4.10.0-rc.2 2022-02-11 How reproducible: Quite frequently Steps to Reproduce: 1. Install OCP 4.10.0-rc2 on baremetal 2. Wait for the workers to attempt to join the cluster Actual results: $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 105m v1.23.3+2e8bad7 master-1 Ready master 104m v1.23.3+2e8bad7 master-2 Ready master 103m v1.23.3+2e8bad7 worker-0 Ready,SchedulingDisabled worker 73m v1.23.3+2e8bad7 worker-1 Ready,SchedulingDisabled worker 73m v1.23.3+2e8bad7 worker-2 Ready,SchedulingDisabled worker 73m v1.23.3+2e8bad7 worker-3 Ready worker 70m v1.23.3+2e8bad7 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.10.0-rc.2 True False False 65m baremetal 4.10.0-rc.2 True False False 100m cloud-controller-manager 4.10.0-rc.2 True False False 105m cloud-credential 4.10.0-rc.2 True False False 126m cluster-autoscaler 4.10.0-rc.2 True False False 100m config-operator 4.10.0-rc.2 True False False 101m console 4.10.0-rc.2 True False False 70m csi-snapshot-controller 4.10.0-rc.2 True False False 100m dns 4.10.0-rc.2 True True True 100m DNS default is degraded etcd 4.10.0-rc.2 True False False 99m image-registry 4.10.0-rc.2 True False False 92m ingress 4.10.0-rc.2 True False True 60m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-86d7dcd577-52hsx" cannot be scheduled: 0/7 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) were unschedulable. Make sure you have sufficient worker nodes.) insights 4.10.0-rc.2 True False False 94m kube-apiserver 4.10.0-rc.2 True False False 96m kube-controller-manager 4.10.0-rc.2 True False False 98m kube-scheduler 4.10.0-rc.2 True False False 98m kube-storage-version-migrator 4.10.0-rc.2 True False False 101m machine-api 4.10.0-rc.2 True False False 97m machine-approver 4.10.0-rc.2 True False False 100m machine-config 4.10.0-rc.2 True False False 94m marketplace 4.10.0-rc.2 True False False 100m monitoring 4.10.0-rc.2 False True True 45m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. network 4.10.0-rc.2 True True True 101m DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-kgs92 is in CrashLoopBackOff State... node-tuning 4.10.0-rc.2 True False False 60m openshift-apiserver 4.10.0-rc.2 True False False 94m openshift-controller-manager 4.10.0-rc.2 True False False 98m openshift-samples 4.10.0-rc.2 True False False 87m operator-lifecycle-manager 4.10.0-rc.2 True False False 100m operator-lifecycle-manager-catalog 4.10.0-rc.2 True False False 100m operator-lifecycle-manager-packageserver 4.10.0-rc.2 True False False 94m service-ca 4.10.0-rc.2 True False False 101m storage 4.10.0-rc.2 True False False 101m $ oc -n openshift-ovn-kubernetes get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-5gghf 6/6 Running 6 (103m ago) 105m 192.168.42.13 master-2 <none> <none> ovnkube-master-6ktmn 6/6 Running 6 (103m ago) 105m 192.168.42.12 master-1 <none> <none> ovnkube-master-zzjx4 6/6 Running 1 (95m ago) 105m 192.168.42.11 master-0 <none> <none> ovnkube-node-6xpdx 5/5 Running 0 72m 192.168.42.23 worker-3 <none> <none> ovnkube-node-h58gq 5/5 Running 0 105m 192.168.42.12 master-1 <none> <none> ovnkube-node-kgs92 4/5 CrashLoopBackOff 20 (54s ago) 75m 192.168.42.22 worker-2 <none> <none> ovnkube-node-pn4bs 5/5 Running 0 105m 192.168.42.11 master-0 <none> <none> ovnkube-node-xw8vg 4/5 CrashLoopBackOff 20 (12s ago) 75m 192.168.42.21 worker-1 <none> <none> ovnkube-node-z22z4 5/5 Running 0 105m 192.168.42.13 master-2 <none> <none> ovnkube-node-zcf98 4/5 CrashLoopBackOff 20 (2m42s ago) 75m 192.168.42.20 worker-0 <none> <none> Checking the worker nodes status of ovs-configuration: [kni.dfwt5g.lab ~]$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} systemctl|grep fail; done 0 ==== ● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online ● ovs-configuration.service loaded failed failed Configures OVS with proper host networking configuration 1 ==== ● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online ● ovs-configuration.service loaded failed failed Configures OVS with proper host networking configuration 2 ==== ● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online ● ovs-configuration.service loaded failed failed Configures OVS with proper host networking configuration 3 ==== ● NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online The worker nodes with SchedulingDisabled do not show br-ex $ for n in {0..3}; do echo $n ====; ssh core@worker-${n} ip -4 -o a; done 0 ==== 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eno1 inet 172.22.1.12/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\ valid_lft 80sec preferred_lft 80sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8637340sec preferred_lft 8637340sec 20: ovn-k8s-mp0 inet 10.129.2.2/23 brd 10.129.3.255 scope global ovn-k8s-mp0\ valid_lft forever preferred_lft forever 25: bond0 inet 192.168.42.20/24 brd 192.168.42.255 scope global dynamic noprefixroute bond0\ valid_lft 4726sec preferred_lft 4726sec 26: bond0.340 inet 192.168.43.79/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\ valid_lft 4726sec preferred_lft 4726sec 1 ==== 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eno1 inet 172.22.4.17/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\ valid_lft 113sec preferred_lft 113sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8637517sec preferred_lft 8637517sec 20: ovn-k8s-mp0 inet 10.131.0.2/23 brd 10.131.1.255 scope global ovn-k8s-mp0\ valid_lft forever preferred_lft forever 25: bond0 inet 192.168.42.21/24 brd 192.168.42.255 scope global dynamic noprefixroute bond0\ valid_lft 4907sec preferred_lft 4907sec 26: bond0.340 inet 192.168.43.46/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\ valid_lft 4906sec preferred_lft 4906sec 2 ==== 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eno1 inet 172.22.1.117/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\ valid_lft 106sec preferred_lft 106sec 18: ovn-k8s-mp0 inet 10.128.2.2/23 brd 10.128.3.255 scope global ovn-k8s-mp0\ valid_lft forever preferred_lft forever 24: bond0 inet 192.168.42.22/24 brd 192.168.42.255 scope global dynamic noprefixroute bond0\ valid_lft 4802sec preferred_lft 4802sec 25: bond0.340 inet 192.168.43.80/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\ valid_lft 4802sec preferred_lft 4802sec 3 ==== 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eno1 inet 172.22.5.62/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\ valid_lft 96sec preferred_lft 96sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8636416sec preferred_lft 8636416sec 16: bond0.340 inet 192.168.43.81/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\ valid_lft 3713sec preferred_lft 3713sec 22: br-ex inet 192.168.42.23/24 brd 192.168.42.255 scope global dynamic noprefixroute br-ex\ valid_lft 3710sec preferred_lft 3710sec 22: br-ex inet 192.168.42.18/32 scope global br-ex\ valid_lft forever preferred_lft forever 25: ovn-k8s-mp0 inet 10.130.2.2/23 brd 10.130.3.255 scope global ovn-k8s-mp0\ valid_lft forever preferred_lft forever The workers lacking the bridge (br-ex) have an additional reboot: $ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service --list-boots; done 0 ==== -2 51f57a67ac2c4529ac9c144b0a8aae4a Wed 2022-02-16 18:47:33 UTC—Wed 2022-02-16 18:53:05 UTC -1 85489437ec80423e8be86b508f5f528a Wed 2022-02-16 18:54:34 UTC—Wed 2022-02-16 19:12:13 UTC 0 6738ce8717e747d5bd3618759e1614bc Wed 2022-02-16 19:13:43 UTC—Wed 2022-02-16 20:05:56 UTC 1 ==== -2 3d252c2c3aed4699ac7a5738d1624585 Wed 2022-02-16 18:47:54 UTC—Wed 2022-02-16 18:53:47 UTC -1 7bab2cdbc42447a5926222cd3c7ac349 Wed 2022-02-16 18:55:16 UTC—Wed 2022-02-16 19:13:41 UTC 0 38bfd9ae284d4db183dd11594353cc4b Wed 2022-02-16 19:16:43 UTC—Wed 2022-02-16 20:05:56 UTC 2 ==== -2 94f5c2a85ecb412aac9d830123553f53 Wed 2022-02-16 18:47:53 UTC—Wed 2022-02-16 18:53:41 UTC -1 e0138b4297d14547b98e6db7fc59682a Wed 2022-02-16 18:55:24 UTC—Wed 2022-02-16 19:13:07 UTC 0 b23b690ccf4c4326b357ab1efa22fb0f Wed 2022-02-16 19:14:58 UTC—Wed 2022-02-16 20:05:57 UTC 3 ==== -1 3129a8e13bc84d339a39ae60311811ed Wed 2022-02-16 18:49:35 UTC—Wed 2022-02-16 18:55:01 UTC 0 3a70997b4977460d984d4085646549fb Wed 2022-02-16 18:58:20 UTC—Wed 2022-02-16 20:05:57 UTC The previous reboot suggests that the bridge was in place as an interface was found to convert to a bridge: $ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service -b -2 | grep -A1 convert_to; done 0 ==== 1 ==== 2 ==== 3 ==== Data from the specified boot (-2) is not available: No such boot ID in journal $ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service -b -1 | grep -A1 convert_to; done 0 ==== Feb 16 18:56:01 worker-0 configure-ovs.sh[4769]: + convert_to_bridge bond0 br-ex phys0 Feb 16 18:56:01 worker-0 configure-ovs.sh[4769]: + iface=bond0 1 ==== Feb 16 18:56:43 worker-1 configure-ovs.sh[4792]: + convert_to_bridge bond0 br-ex phys0 Feb 16 18:56:43 worker-1 configure-ovs.sh[4792]: + iface=bond0 2 ==== Feb 16 18:56:51 worker-2 configure-ovs.sh[4552]: + convert_to_bridge bond0 br-ex phys0 Feb 16 18:56:51 worker-2 configure-ovs.sh[4552]: + iface=bond0 3 ==== $ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service -b 0 | grep -A1 convert_to; done 0 ==== Feb 16 19:17:09 worker-0 configure-ovs.sh[7361]: + convert_to_bridge '' br-ex phys0 Feb 16 19:17:09 worker-0 configure-ovs.sh[7361]: + iface= 1 ==== Feb 16 19:20:09 worker-1 configure-ovs.sh[7362]: + convert_to_bridge '' br-ex phys0 Feb 16 19:20:09 worker-1 configure-ovs.sh[7362]: + iface= 2 ==== Feb 16 19:17:24 worker-2 configure-ovs.sh[7173]: + convert_to_bridge '' br-ex phys0 Feb 16 19:17:24 worker-2 configure-ovs.sh[7173]: + iface= 3 ==== Feb 16 18:59:47 worker-3 configure-ovs.sh[4526]: + convert_to_bridge bond0 br-ex phys0 Feb 16 18:59:47 worker-3 configure-ovs.sh[4526]: + iface=bond0 Expected results: All nodes should be available in the cluster. Additional info: If I restart the ovs-configuration service, on any node, then it's able to create the bridge (br-ex) without issues and the node is finally available in the cluster: [core@worker-0 ~]$ sudo systemctl restart ovs-configuration.service [core@worker-0 ~]$ ip -4 -o a 1: lo inet 127.0.0.1/8 scope host lo\ valid_lft forever preferred_lft forever 2: eno1 inet 172.22.1.12/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\ valid_lft 99sec preferred_lft 99sec 14: enp1s0f4u4 inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\ valid_lft 8630111sec preferred_lft 8630111sec 20: ovn-k8s-mp0 inet 10.129.2.2/23 brd 10.129.3.255 scope global ovn-k8s-mp0\ valid_lft forever preferred_lft forever 26: bond0.340 inet 192.168.43.79/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\ valid_lft 6723sec preferred_lft 6723sec 29: br-ex inet 192.168.42.20/24 brd 192.168.42.255 scope global dynamic noprefixroute br-ex\ valid_lft 6724sec preferred_lft 6724sec $ systemctl status ovs-configuration.service ● ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2022-02-16 21:51:01 UTC; 11min ago Main PID: 400088 (code=exited, status=0/SUCCESS) CPU: 2.136s $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready master 3h33m v1.23.3+2e8bad7 master-1 Ready master 3h32m v1.23.3+2e8bad7 master-2 Ready master 3h31m v1.23.3+2e8bad7 worker-0 Ready worker 3h1m v1.23.3+2e8bad7 worker-1 Ready,SchedulingDisabled worker 3h1m v1.23.3+2e8bad7 worker-2 Ready,SchedulingDisabled worker 3h1m v1.23.3+2e8bad7 worker-3 Ready worker 178m v1.23.3+2e8bad7 More logs, including journal, console and must-gather are available here: https://www.distributed-ci.io/jobs/5ff87c65-5645-4ddd-9222-cb59c07d4ac2/files (I was not able to upload must-gather as is larger than 20MB)
This is happens in 4.10 IPI installs with custom network configuration. Nodes boot with ip=dhcp karg which basically generates a DHCP NetworkManager (NM) profile that can activate any interface: /run/NetworkManager/system-connections/default_connection.nmconnection However, the node has custom NM profiles: /etc/NetworkManager/system-connections/bond0.nmconnection /etc/NetworkManager/system-connections/bond0.340.nmconnection /etc/NetworkManager/system-connections/ens1f0.nmconnection /etc/NetworkManager/system-connections/ens1f1.nmconnection /etc/NetworkManager/system-connections/ens3f0.nmconnection /etc/NetworkManager/system-connections/ens3f1.nmconnection Assuming for example that ens1f0 is a bond0 slave, on an initial boot ens1f0 is configured through the corresponding custom profile: ens1f0:22f4a3bf-b99a-38ae-91a8-17796391e6aa:802-3-ethernet:1645037701:Wed Feb 16 18\:55\:01 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/5:yes:ens1f0:activated:/org/freedesktop/NetworkManager/ActiveConnection/4:bond:/etc/NetworkManager/system-connections/ens1f0.nmconnection But when configure-ovs resets networking on a subsequent reboot, ens1f0 is activated with the default profile: Wired Connection:487ba214-7cab-4496-9a8f-015638080094:802-3-ethernet:1645038907:Wed Feb 16 19\:15\:07 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:yes:ens1f0:activating:/org/freedesktop/NetworkManager/ActiveConnection/7::/run/NetworkManager/system-connections/default_connection.nmconnection The consequence is that bond0 slaves are not activated through the intended custom profiles, are not slaved to bond0 and thus bond0 does not resolve DHCP properly and configure-ovs does not find an appropriate default gateway interface after network reset. The solution is to deploy the custom NM profiles with higher autoconnect-priority than the default (which is 0) so that the intended interfaces are always activated through those custom profiles instead of the default one. This might seem a regression in 4.10 due to configure-ovs resetting the network which was not being done on previous releases but it is actually uncovering a probably incorrect network configuration of the nodes.
Thanks for the explanation Jaime, It makes sense, we do not set autoconnect-priority in our custom network configuration, so it takes the default. We haven't observed this in OCP < 4.10. We will set a higher priority and see if it happens again.
@manrodri any updates, can we close this bug?
@jcaamano we haven't seen many occurrences since we added the autoconnect-priority in our custom manifests, but today one of the workers had this issue, it was a cluster with OpenShift 4.10.0 RC 2022-03-02 The full journal log can be found here: https://www.distributed-ci.io/files/50b1094f-a9b2-4a5d-b926-78d095e9f0d8 or attached as journal-worker-1-2022Mar08.log The behavior is the same as described initially in this BZ, the node started correctly in the first boot, but in the second reboot, the ovs-configuration.service failed. $ cat journal-worker-1-2022Mar08.log | grep configure-ovs.sh | grep -A1 convert_to Mar 08 06:56:09 worker-1 configure-ovs.sh[4778]: + convert_to_bridge bond0 br-ex phys0 Mar 08 06:56:09 worker-1 configure-ovs.sh[4778]: + iface=bond0 -- Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + convert_to_bridge '' br-ex phys0 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + iface= $ cat journal-worker-1-2022Mar08.log | grep configure-ovs.sh | less Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + convert_to_bridge '' br-ex phys0 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + iface= Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + bridge_name=br-ex Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + port_name=phys0 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + ovs_port=ovs-port-br-ex Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + ovs_interface=ovs-if-br-ex Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + default_port_name=ovs-port-phys0 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + bridge_interface_name=ovs-if-phys0 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + '[' '' = br-ex ']' Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + '[' -z '' ']' Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + echo 'ERROR: Unable to find default gateway interface' Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: ERROR: Unable to find default gateway interface Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + exit 1 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + handle_exit Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + e=1 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + '[' 1 -eq 0 ']' Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + echo 'ERROR: configure-ovs exited with error: 1' Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: ERROR: configure-ovs exited with error: 1 Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + print_state Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + echo 'Current device, connection, interface and routing state:' Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: Current device, connection, interface and routing state: Also FYI, we included autoconnect-priority=99 in all the following files, do you see a problem with that? /etc/NetworkManager/system-connections/bond0.nmconnection /etc/NetworkManager/system-connections/bond0.340.nmconnection /etc/NetworkManager/system-connections/ens1f0.nmconnection /etc/NetworkManager/system-connections/ens1f1.nmconnection /etc/NetworkManager/system-connections/ens3f0.nmconnection /etc/NetworkManager/system-connections/ens3f1.nmconnection Please let me know if we can provide more details, I'll keep checking more carefully for new occurrences.
> Mar 08 07:13:46 worker-1 configure-ovs.sh[7337]: + nmcli network off > Mar 08 07:13:46 worker-1 configure-ovs.sh[7337]: + systemctl restart NetworkManager > Mar 08 07:13:46 worker-1 configure-ovs.sh[7337]: + nmcli network on > 1. Why does nm-online -s -t 60 terminate before bond0 connection has finished activation? It's almost always most useful to collect full level=TRACE logs and to attache full logs. Anyway. If you first do `nmcli network off`, and then restart NM, then NM will start with network disabled. that means, it has nothing to do, and will very quickly reach "startup-complete" (which already unblocks `nm-online -s`). That afterwards something happens (`nmcli network on`) which causes additional work for NM, does not undo that NM already completely started. you'll also see a log-line with "audit:" which is about `nmcli network on`. If that happens *after* the "startup complete" line, then it's too late. > So in my case it is waiting until the default dhcp timeout of 45 secs. Yes, there is a race here. If you manage to sneak the `nmcli network on` in before "startup complete" happens, then NM will wait longer. I am not clear about what is done here. `nm-online -s` does not make much sense aside being an implementation detail of NetworkManager-wait-online.service. Is there a script that restarts NetworkManager? If you restart NetworkManager, probably NetworkManager-wait-online.service completes (or fails) right away, so it doesn't really matter what `nm-online -s` says afterwards. Also, restarting NetworkManager is usually not the right thing to do. Why is that done?
> - wait for all autoconnect profiles to be active ok. Seems cumbersome. And the script has races. - `nmcli network off` does not wait for all devices to be disconnected, so it's not sure that all is down during the next `systemctl (re)start`. - the race already discussed. > - wait for all autoconnect profiles to be active NM could do add API so this race could be avoided. On the D-Bus API there is currently a boolean flag `busctl get-property org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager Startup`, which starts with TRUE and then switches to FALSE (never to switch back). Internally, that is basically tied to notion of NetworkManager being busy doing something. If that busy flag would be exposed, then you could do `nmcli network on` and wait for the busy flag to go away. But, it's not clear whether that is really the right solution. And it would only be a future addition. @bgalvani, WDYT?
> We want a procedure to reload NM after making profile configuration changes (either through nmcli or directly on disk) a running host so that it behaves as on boot: can you share the script/code that you have for that?
Verified on 4.11.0-0.nightly-2022-05-20-213928 IPI baremetal dual-stack bonding `autoconnect-priority=99` /etc/NetworkManager/system-connections/enp6s0.nmconnection [connection] id=enp6s0 type=ethernet interface-name=enp6s0 master=bond0 slave-type=bond autoconnect=true autoconnect-priority=99 /etc/NetworkManager/system-connections/enp5s0.nmconnection: [connection] id=enp5s0 type=ethernet interface-name=enp5s0 master=bond0 slave-type=bond autoconnect=true autoconnect-priority=99 /etc/NetworkManager/system-connections/bond0.nmconnection: [connection] id=bond0 type=bond interface-name=bond0 autoconnect=true connection.autoconnect-slaves=1 autoconnect-priority=99 [bond] mode=802.3ad miimon=100 [ipv4] method=auto dhcp-timeout=2147483647 [ipv6] method=auto dhcp-timeout=2147483647 May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX") May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local conn=enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local active_state= May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']' May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for i in {1..10} May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection enp5s0-slave-ovs-clone' May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Error: Connection activation failed: Unknown reason May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Hint: use 'journalctl -xe NM_CONNECTION=a9b3f226-7435-47bd-a431-d106c8fd8f39 + NM_DEVICE=enp5s0' to get more details. May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + s=4 May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + sleep 5 May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + for i in {1..10} May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 2 to bring up connection enp5s0-slave-ovs-clone' May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: Attempt 2 to bring up connection enp5s0-slave-ovs-clone May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/20) May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + s=0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + break May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection enp5s0-slave-ovs-clone successfully' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Brought up connection enp5s0-slave-ovs-clone successfully May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp5s0-slave-ovs-clone connection.autoconnect yes May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX") May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp6s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=enp6s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp6s0-slave-ovs-clone May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection enp6s0-slave-ovs-clone already activated' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection enp6s0-slave-ovs-clone already activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp6s0-slave-ovs-clone connection.autoconnect yes May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-phys0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-phys0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-phys0 May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection ovs-if-phys0 already activated' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection ovs-if-phys0 already activated May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-phys0 connection.autoconnect yes May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state= May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for i in {1..10} May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection ovs-if-br-ex' May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection ovs-if-br-ex May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli conn up ovs-if-br-ex May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/21) May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + s=0 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + break May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection ovs-if-br-ex successfully' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Brought up connection ovs-if-br-ex successfully May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' -f /etc/ovnk/extra_bridge ']' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + handle_exit May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + e=0 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']' May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip route show May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via 192.168.123.1 dev br-ex proto dhcp metric 48 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: 192.168.123.0/24 dev br-ex proto kernel scope link src 192.168.123.97 metric 48 May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip -6 route show May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: ::1 dev lo proto kernel metric 256 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::4d dev br-ex proto kernel metric 48 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::/64 dev br-ex proto ra metric 48 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev br-ex proto kernel metric 48 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via fe80::5054:ff:fe97:2978 dev br-ex proto ra metric 48 pref medium
*** Bug 2089943 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069