2055433 – Unable to create br-ex as gateway is not found

Bug 2055433 - Unable to create br-ex as gateway is not found

Summary: Unable to create br-ex as gateway is not found

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	x86_64
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Jaime Caamaño Ruiz
QA Contact:	Ross Brattain
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2089943 (view as bug list)
Depends On:
Blocks:	2071574
TreeView+	depends on / blocked

Reported:	2022-02-16 22:29 UTC by tonyg
Modified:	2022-08-10 10:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Race condition in NetworkManager restart sometimes prevents DHCP resolution to complete successfully when setting up br-ex bridge on node boot when using OVN-Kuberenetes Consequence: Node sometimes in NotReady state after node boot Fix: NetworkManager is no longer restarted when setting up br-ex preventing the race condition. Result: Node should not be in NotReady state after boot
Clone Of:
Clones:	2091495 (view as bug list)
Environment:
Last Closed:	2022-08-10 10:50:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 3006	None	Merged	Bug 2055433: configure-ovs: set networking on before restarting NetworkManager	2022-06-03 21:51:12 UTC
Github	openshift machine-config-operator pull 3120	None	Merged	Bug 2078866: configure-ovs: avoid restarting NetworkManager	2022-06-03 21:51:14 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:50:35 UTC

Description tonyg 2022-02-16 22:29:28 UTC

Description of problem:

After installation of OCP 4.10 and during the time when workers are attempting to join the cluster, some of the workers won't be able to become fully available due to problems with ovs-configuration.

This installation is on baremetal, using IPI.

Version-Release number of selected component (if applicable):

4.10.0-rc.2 2022-02-11


How reproducible:

Quite frequently


Steps to Reproduce:
1. Install OCP 4.10.0-rc2 on baremetal
2. Wait for the workers to attempt to join the cluster


Actual results:

$ oc get nodes
NAME       STATUS                     ROLES    AGE    VERSION
master-0   Ready                      master   105m   v1.23.3+2e8bad7
master-1   Ready                      master   104m   v1.23.3+2e8bad7
master-2   Ready                      master   103m   v1.23.3+2e8bad7
worker-0   Ready,SchedulingDisabled   worker   73m    v1.23.3+2e8bad7
worker-1   Ready,SchedulingDisabled   worker   73m    v1.23.3+2e8bad7
worker-2   Ready,SchedulingDisabled   worker   73m    v1.23.3+2e8bad7
worker-3   Ready                      worker   70m    v1.23.3+2e8bad7

$ oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.10.0-rc.2   True        False         False      65m     
baremetal                                  4.10.0-rc.2   True        False         False      100m    
cloud-controller-manager                   4.10.0-rc.2   True        False         False      105m    
cloud-credential                           4.10.0-rc.2   True        False         False      126m    
cluster-autoscaler                         4.10.0-rc.2   True        False         False      100m    
config-operator                            4.10.0-rc.2   True        False         False      101m    
console                                    4.10.0-rc.2   True        False         False      70m     
csi-snapshot-controller                    4.10.0-rc.2   True        False         False      100m    
dns                                        4.10.0-rc.2   True        True          True       100m    DNS default is degraded
etcd                                       4.10.0-rc.2   True        False         False      99m     
image-registry                             4.10.0-rc.2   True        False         False      92m     
ingress                                    4.10.0-rc.2   True        False         True       60m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-86d7dcd577-52hsx" cannot be scheduled: 0/7 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) were unschedulable. Make sure you have sufficient worker nodes.)
insights                                   4.10.0-rc.2   True        False         False      94m     
kube-apiserver                             4.10.0-rc.2   True        False         False      96m     
kube-controller-manager                    4.10.0-rc.2   True        False         False      98m     
kube-scheduler                             4.10.0-rc.2   True        False         False      98m     
kube-storage-version-migrator              4.10.0-rc.2   True        False         False      101m    
machine-api                                4.10.0-rc.2   True        False         False      97m     
machine-approver                           4.10.0-rc.2   True        False         False      100m    
machine-config                             4.10.0-rc.2   True        False         False      94m     
marketplace                                4.10.0-rc.2   True        False         False      100m    
monitoring                                 4.10.0-rc.2   False       True          True       45m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
network                                    4.10.0-rc.2   True        True          True       101m    DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - pod ovnkube-node-kgs92 is in CrashLoopBackOff State...
node-tuning                                4.10.0-rc.2   True        False         False      60m     
openshift-apiserver                        4.10.0-rc.2   True        False         False      94m     
openshift-controller-manager               4.10.0-rc.2   True        False         False      98m     
openshift-samples                          4.10.0-rc.2   True        False         False      87m     
operator-lifecycle-manager                 4.10.0-rc.2   True        False         False      100m    
operator-lifecycle-manager-catalog         4.10.0-rc.2   True        False         False      100m    
operator-lifecycle-manager-packageserver   4.10.0-rc.2   True        False         False      94m     
service-ca                                 4.10.0-rc.2   True        False         False      101m    
storage                                    4.10.0-rc.2   True        False         False      101m    

$ oc -n openshift-ovn-kubernetes get pod -o wide
NAME                   READY   STATUS             RESTARTS         AGE    IP              NODE       NOMINATED NODE   READINESS GATES
ovnkube-master-5gghf   6/6     Running            6 (103m ago)     105m   192.168.42.13   master-2   <none>           <none>
ovnkube-master-6ktmn   6/6     Running            6 (103m ago)     105m   192.168.42.12   master-1   <none>           <none>
ovnkube-master-zzjx4   6/6     Running            1 (95m ago)      105m   192.168.42.11   master-0   <none>           <none>
ovnkube-node-6xpdx     5/5     Running            0                72m    192.168.42.23   worker-3   <none>           <none>
ovnkube-node-h58gq     5/5     Running            0                105m   192.168.42.12   master-1   <none>           <none>
ovnkube-node-kgs92     4/5     CrashLoopBackOff   20 (54s ago)     75m    192.168.42.22   worker-2   <none>           <none>
ovnkube-node-pn4bs     5/5     Running            0                105m   192.168.42.11   master-0   <none>           <none>
ovnkube-node-xw8vg     4/5     CrashLoopBackOff   20 (12s ago)     75m    192.168.42.21   worker-1   <none>           <none>
ovnkube-node-z22z4     5/5     Running            0                105m   192.168.42.13   master-2   <none>           <none>
ovnkube-node-zcf98     4/5     CrashLoopBackOff   20 (2m42s ago)   75m    192.168.42.20   worker-0   <none>           <none>

Checking the worker nodes status of ovs-configuration:

[kni.dfwt5g.lab ~]$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} systemctl|grep fail; done
0 ====
● NetworkManager-wait-online.service                                                                                                                              loaded failed failed    Network Manager Wait Online
● ovs-configuration.service                                                                                                                                       loaded failed failed    Configures OVS with proper host networking configuration
1 ====
● NetworkManager-wait-online.service                                                                                                                              loaded failed failed    Network Manager Wait Online
● ovs-configuration.service                                                                                                                                       loaded failed failed    Configures OVS with proper host networking configuration
2 ====
● NetworkManager-wait-online.service                                                                                                                              loaded failed failed    Network Manager Wait Online
● ovs-configuration.service                                                                                                                                       loaded failed failed    Configures OVS with proper host networking configuration
3 ====
● NetworkManager-wait-online.service                                                                                                                                           loaded failed failed    Network Manager Wait Online


The worker nodes with SchedulingDisabled do not show br-ex

$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} ip -4 -o a; done
0 ====
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eno1    inet 172.22.1.12/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\       valid_lft 80sec preferred_lft 80sec
14: enp1s0f4u4    inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\       valid_lft 8637340sec preferred_lft 8637340sec
20: ovn-k8s-mp0    inet 10.129.2.2/23 brd 10.129.3.255 scope global ovn-k8s-mp0\       valid_lft forever preferred_lft forever
25: bond0    inet 192.168.42.20/24 brd 192.168.42.255 scope global dynamic noprefixroute bond0\       valid_lft 4726sec preferred_lft 4726sec
26: bond0.340    inet 192.168.43.79/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\       valid_lft 4726sec preferred_lft 4726sec
1 ====
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eno1    inet 172.22.4.17/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\       valid_lft 113sec preferred_lft 113sec
14: enp1s0f4u4    inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\       valid_lft 8637517sec preferred_lft 8637517sec
20: ovn-k8s-mp0    inet 10.131.0.2/23 brd 10.131.1.255 scope global ovn-k8s-mp0\       valid_lft forever preferred_lft forever
25: bond0    inet 192.168.42.21/24 brd 192.168.42.255 scope global dynamic noprefixroute bond0\       valid_lft 4907sec preferred_lft 4907sec
26: bond0.340    inet 192.168.43.46/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\       valid_lft 4906sec preferred_lft 4906sec
2 ====
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eno1    inet 172.22.1.117/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\       valid_lft 106sec preferred_lft 106sec
18: ovn-k8s-mp0    inet 10.128.2.2/23 brd 10.128.3.255 scope global ovn-k8s-mp0\       valid_lft forever preferred_lft forever
24: bond0    inet 192.168.42.22/24 brd 192.168.42.255 scope global dynamic noprefixroute bond0\       valid_lft 4802sec preferred_lft 4802sec
25: bond0.340    inet 192.168.43.80/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\       valid_lft 4802sec preferred_lft 4802sec
3 ====
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eno1    inet 172.22.5.62/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\       valid_lft 96sec preferred_lft 96sec
14: enp1s0f4u4    inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\       valid_lft 8636416sec preferred_lft 8636416sec
16: bond0.340    inet 192.168.43.81/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\       valid_lft 3713sec preferred_lft 3713sec
22: br-ex    inet 192.168.42.23/24 brd 192.168.42.255 scope global dynamic noprefixroute br-ex\       valid_lft 3710sec preferred_lft 3710sec
22: br-ex    inet 192.168.42.18/32 scope global br-ex\       valid_lft forever preferred_lft forever
25: ovn-k8s-mp0    inet 10.130.2.2/23 brd 10.130.3.255 scope global ovn-k8s-mp0\       valid_lft forever preferred_lft forever


The workers lacking the bridge (br-ex) have an additional reboot:

$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service --list-boots; done
0 ====
-2 51f57a67ac2c4529ac9c144b0a8aae4a Wed 2022-02-16 18:47:33 UTC—Wed 2022-02-16 18:53:05 UTC
-1 85489437ec80423e8be86b508f5f528a Wed 2022-02-16 18:54:34 UTC—Wed 2022-02-16 19:12:13 UTC
 0 6738ce8717e747d5bd3618759e1614bc Wed 2022-02-16 19:13:43 UTC—Wed 2022-02-16 20:05:56 UTC
1 ====
-2 3d252c2c3aed4699ac7a5738d1624585 Wed 2022-02-16 18:47:54 UTC—Wed 2022-02-16 18:53:47 UTC
-1 7bab2cdbc42447a5926222cd3c7ac349 Wed 2022-02-16 18:55:16 UTC—Wed 2022-02-16 19:13:41 UTC
 0 38bfd9ae284d4db183dd11594353cc4b Wed 2022-02-16 19:16:43 UTC—Wed 2022-02-16 20:05:56 UTC
2 ====
-2 94f5c2a85ecb412aac9d830123553f53 Wed 2022-02-16 18:47:53 UTC—Wed 2022-02-16 18:53:41 UTC
-1 e0138b4297d14547b98e6db7fc59682a Wed 2022-02-16 18:55:24 UTC—Wed 2022-02-16 19:13:07 UTC
 0 b23b690ccf4c4326b357ab1efa22fb0f Wed 2022-02-16 19:14:58 UTC—Wed 2022-02-16 20:05:57 UTC
3 ====
-1 3129a8e13bc84d339a39ae60311811ed Wed 2022-02-16 18:49:35 UTC—Wed 2022-02-16 18:55:01 UTC
 0 3a70997b4977460d984d4085646549fb Wed 2022-02-16 18:58:20 UTC—Wed 2022-02-16 20:05:57 UTC

The previous reboot suggests that the bridge was in place as an interface was found to convert to a bridge:

$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service -b -2 | grep -A1 convert_to; done
0 ====
1 ====
2 ====
3 ====
Data from the specified boot (-2) is not available: No such boot ID in journal

$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service -b -1 | grep -A1 convert_to; done
0 ====
Feb 16 18:56:01 worker-0 configure-ovs.sh[4769]: + convert_to_bridge bond0 br-ex phys0
Feb 16 18:56:01 worker-0 configure-ovs.sh[4769]: + iface=bond0
1 ====
Feb 16 18:56:43 worker-1 configure-ovs.sh[4792]: + convert_to_bridge bond0 br-ex phys0
Feb 16 18:56:43 worker-1 configure-ovs.sh[4792]: + iface=bond0
2 ====
Feb 16 18:56:51 worker-2 configure-ovs.sh[4552]: + convert_to_bridge bond0 br-ex phys0
Feb 16 18:56:51 worker-2 configure-ovs.sh[4552]: + iface=bond0
3 ====


$ for n in {0..3}; do echo $n ====; ssh core@worker-${n} journalctl -u ovs-configuration.service -b 0 | grep -A1 convert_to; done
0 ====
Feb 16 19:17:09 worker-0 configure-ovs.sh[7361]: + convert_to_bridge '' br-ex phys0
Feb 16 19:17:09 worker-0 configure-ovs.sh[7361]: + iface=
1 ====
Feb 16 19:20:09 worker-1 configure-ovs.sh[7362]: + convert_to_bridge '' br-ex phys0
Feb 16 19:20:09 worker-1 configure-ovs.sh[7362]: + iface=
2 ====
Feb 16 19:17:24 worker-2 configure-ovs.sh[7173]: + convert_to_bridge '' br-ex phys0
Feb 16 19:17:24 worker-2 configure-ovs.sh[7173]: + iface=
3 ====
Feb 16 18:59:47 worker-3 configure-ovs.sh[4526]: + convert_to_bridge bond0 br-ex phys0
Feb 16 18:59:47 worker-3 configure-ovs.sh[4526]: + iface=bond0



Expected results:

All nodes should be available in the cluster.


Additional info:

If I restart the ovs-configuration service, on any node, then it's able to create the bridge (br-ex) without issues and the node is finally available in the cluster:

[core@worker-0 ~]$ sudo systemctl restart ovs-configuration.service 
[core@worker-0 ~]$ ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
2: eno1    inet 172.22.1.12/21 brd 172.22.7.255 scope global dynamic noprefixroute eno1\       valid_lft 99sec preferred_lft 99sec
14: enp1s0f4u4    inet 16.1.15.2/30 brd 16.1.15.3 scope global dynamic noprefixroute enp1s0f4u4\       valid_lft 8630111sec preferred_lft 8630111sec
20: ovn-k8s-mp0    inet 10.129.2.2/23 brd 10.129.3.255 scope global ovn-k8s-mp0\       valid_lft forever preferred_lft forever
26: bond0.340    inet 192.168.43.79/24 brd 192.168.43.255 scope global dynamic noprefixroute bond0.340\       valid_lft 6723sec preferred_lft 6723sec
29: br-ex    inet 192.168.42.20/24 brd 192.168.42.255 scope global dynamic noprefixroute br-ex\       valid_lft 6724sec preferred_lft 6724sec

$ systemctl status ovs-configuration.service 
● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2022-02-16 21:51:01 UTC; 11min ago
 Main PID: 400088 (code=exited, status=0/SUCCESS)
      CPU: 2.136s


$ oc get nodes
NAME       STATUS                     ROLES    AGE     VERSION
master-0   Ready                      master   3h33m   v1.23.3+2e8bad7
master-1   Ready                      master   3h32m   v1.23.3+2e8bad7
master-2   Ready                      master   3h31m   v1.23.3+2e8bad7
worker-0   Ready                      worker   3h1m    v1.23.3+2e8bad7
worker-1   Ready,SchedulingDisabled   worker   3h1m    v1.23.3+2e8bad7
worker-2   Ready,SchedulingDisabled   worker   3h1m    v1.23.3+2e8bad7
worker-3   Ready                      worker   178m    v1.23.3+2e8bad7



More logs, including journal, console and must-gather are available here: https://www.distributed-ci.io/jobs/5ff87c65-5645-4ddd-9222-cb59c07d4ac2/files

(I was not able to upload must-gather as is larger than 20MB)

Comment 1 Jaime Caamaño Ruiz 2022-02-21 17:33:18 UTC

This is happens in 4.10 IPI installs with custom network configuration.


Nodes boot with ip=dhcp karg which basically generates a DHCP NetworkManager (NM) profile that can activate any interface:

/run/NetworkManager/system-connections/default_connection.nmconnection


However, the node has custom NM profiles:

/etc/NetworkManager/system-connections/bond0.nmconnection
/etc/NetworkManager/system-connections/bond0.340.nmconnection
/etc/NetworkManager/system-connections/ens1f0.nmconnection
/etc/NetworkManager/system-connections/ens1f1.nmconnection
/etc/NetworkManager/system-connections/ens3f0.nmconnection
/etc/NetworkManager/system-connections/ens3f1.nmconnection


Assuming for example that ens1f0 is a bond0 slave, on an initial boot ens1f0 is configured through the corresponding custom profile:

ens1f0:22f4a3bf-b99a-38ae-91a8-17796391e6aa:802-3-ethernet:1645037701:Wed Feb 16 18\:55\:01 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/5:yes:ens1f0:activated:/org/freedesktop/NetworkManager/ActiveConnection/4:bond:/etc/NetworkManager/system-connections/ens1f0.nmconnection


But when configure-ovs resets networking on a subsequent reboot, ens1f0 is activated with the default profile:

Wired Connection:487ba214-7cab-4496-9a8f-015638080094:802-3-ethernet:1645038907:Wed Feb 16 19\:15\:07 2022:yes:0:no:/org/freedesktop/NetworkManager/Settings/1:yes:ens1f0:activating:/org/freedesktop/NetworkManager/ActiveConnection/7::/run/NetworkManager/system-connections/default_connection.nmconnection


The consequence is that bond0 slaves are not activated through the intended custom profiles, are not slaved to bond0 and thus bond0 does not resolve DHCP properly and configure-ovs does not find an appropriate default gateway interface after network reset.


The solution is to deploy the custom NM profiles with higher autoconnect-priority than the default (which is 0) so that the intended interfaces are always activated through those custom profiles instead of the default one.


This might seem a regression in 4.10 due to configure-ovs resetting the network which was not being done on previous releases but it is actually uncovering a probably incorrect network configuration of the nodes.

Comment 2 Manuel Rodriguez 2022-02-21 18:56:44 UTC

Thanks for the explanation Jaime, 

It makes sense, we do not set autoconnect-priority in our custom network configuration, so it takes the default. We haven't observed this in OCP < 4.10. 
We will set a higher priority and see if it happens again.

Comment 3 Jaime Caamaño Ruiz 2022-03-07 12:32:42 UTC

@manrodri any updates, can we close this bug?

Comment 4 Manuel Rodriguez 2022-03-09 02:28:48 UTC

@jcaamano we haven't seen many occurrences since we added the autoconnect-priority in our custom manifests,
but today one of the workers had this issue, it was a cluster with OpenShift 4.10.0 RC 2022-03-02 

The full journal log can be found here: https://www.distributed-ci.io/files/50b1094f-a9b2-4a5d-b926-78d095e9f0d8 or attached as journal-worker-1-2022Mar08.log

The behavior is the same as described initially in this BZ, the node started correctly in the first boot, but in the second reboot, the ovs-configuration.service failed.

$ cat journal-worker-1-2022Mar08.log | grep configure-ovs.sh | grep -A1 convert_to
Mar 08 06:56:09 worker-1 configure-ovs.sh[4778]: + convert_to_bridge bond0 br-ex phys0
Mar 08 06:56:09 worker-1 configure-ovs.sh[4778]: + iface=bond0
--
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + convert_to_bridge '' br-ex phys0
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + iface=

$ cat journal-worker-1-2022Mar08.log | grep configure-ovs.sh | less
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + convert_to_bridge '' br-ex phys0
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + iface=
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + bridge_name=br-ex
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + port_name=phys0
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + ovs_port=ovs-port-br-ex
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + ovs_interface=ovs-if-br-ex
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + default_port_name=ovs-port-phys0
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + bridge_interface_name=ovs-if-phys0
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + '[' '' = br-ex ']'
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + '[' -z '' ']'
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + echo 'ERROR: Unable to find default gateway interface'
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: ERROR: Unable to find default gateway interface
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + exit 1
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + handle_exit
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + e=1
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + '[' 1 -eq 0 ']'
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + echo 'ERROR: configure-ovs exited with error: 1'
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: ERROR: configure-ovs exited with error: 1
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + print_state
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: + echo 'Current device, connection, interface and routing state:'
Mar 08 07:14:55 worker-1 configure-ovs.sh[7337]: Current device, connection, interface and routing state:


Also FYI, we included autoconnect-priority=99 in all the following files, do you see a problem with that?
/etc/NetworkManager/system-connections/bond0.nmconnection
/etc/NetworkManager/system-connections/bond0.340.nmconnection
/etc/NetworkManager/system-connections/ens1f0.nmconnection
/etc/NetworkManager/system-connections/ens1f1.nmconnection
/etc/NetworkManager/system-connections/ens3f0.nmconnection
/etc/NetworkManager/system-connections/ens3f1.nmconnection

Please let me know if we can provide more details, I'll keep checking more carefully for new occurrences.

Comment 10 Thomas Haller 2022-03-09 23:01:15 UTC

> Mar 08 07:13:46 worker-1 configure-ovs.sh[7337]: + nmcli network off
> Mar 08 07:13:46 worker-1 configure-ovs.sh[7337]: + systemctl restart NetworkManager
> Mar 08 07:13:46 worker-1 configure-ovs.sh[7337]: + nmcli network on

> 1. Why does nm-online -s -t 60 terminate before bond0 connection has finished activation? 


It's almost always most useful to collect full level=TRACE logs and to attache full logs.

Anyway. If you first do `nmcli network off`, and then restart NM, then NM will start with network disabled.
that means, it has nothing to do, and will very quickly reach "startup-complete" (which already unblocks `nm-online -s`).

That afterwards something happens (`nmcli network on`) which causes additional work for NM, does not undo that NM already completely started.

you'll also see a log-line with "audit:" which is about `nmcli network on`. If that happens *after* the "startup complete" line, then it's too late.

> So in my case it is waiting until the default dhcp timeout of 45 secs.

Yes, there is a race here. If you manage to sneak the `nmcli network on` in before "startup complete" happens, then NM will wait longer.



I am not clear about what is done here. `nm-online -s` does not make much sense aside being an implementation detail of NetworkManager-wait-online.service. Is there a script that restarts NetworkManager? If you restart NetworkManager, probably NetworkManager-wait-online.service completes (or fails) right away, so it doesn't really matter what `nm-online -s` says afterwards. Also, restarting NetworkManager is usually not the right thing to do. Why is that done?

Comment 12 Thomas Haller 2022-03-11 09:28:32 UTC

> - wait for all autoconnect profiles to be active

ok. Seems cumbersome. And the script has races.

- `nmcli network off` does not wait for all devices to be disconnected, so it's not sure that all is down during the next `systemctl (re)start`.
- the race already discussed.


> - wait for all autoconnect profiles to be active

NM could do add API so this race could be avoided. On the D-Bus API there is currently a boolean flag `busctl get-property org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager Startup`, which starts with TRUE and then switches to FALSE (never to switch back). Internally, that is basically tied to notion of NetworkManager being busy doing something. If that busy flag would be exposed, then you could do `nmcli network on` and wait for the busy flag to go away.

But, it's not clear whether that is really the right solution. And it would only be a future addition.

@bgalvani, WDYT?

Comment 13 Thomas Haller 2022-03-11 09:49:55 UTC

> We want a procedure to reload NM after making profile configuration changes (either through nmcli or directly on disk) a running host so that it behaves as on boot:

can you share the script/code that you have for that?

Comment 27 Ross Brattain 2022-05-24 10:23:28 UTC

Verified on 4.11.0-0.nightly-2022-05-20-213928   IPI baremetal dual-stack bonding `autoconnect-priority=99`



/etc/NetworkManager/system-connections/enp6s0.nmconnection
[connection]
id=enp6s0
type=ethernet
interface-name=enp6s0
master=bond0
slave-type=bond
autoconnect=true
autoconnect-priority=99

/etc/NetworkManager/system-connections/enp5s0.nmconnection:
[connection]
id=enp5s0
type=ethernet
interface-name=enp5s0
master=bond0
slave-type=bond
autoconnect=true
autoconnect-priority=99




/etc/NetworkManager/system-connections/bond0.nmconnection:
[connection]
id=bond0
type=bond
interface-name=bond0
autoconnect=true
connection.autoconnect-slaves=1
autoconnect-priority=99

[bond]
mode=802.3ad
miimon=100

[ipv4]
method=auto
dhcp-timeout=2147483647

[ipv6]
method=auto
dhcp-timeout=2147483647



May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX")
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local conn=enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + local active_state=
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']'
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + for i in {1..10}
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection enp5s0-slave-ovs-clone'
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Error: Connection activation failed: Unknown reason
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: Hint: use 'journalctl -xe NM_CONNECTION=a9b3f226-7435-47bd-a431-d106c8fd8f39 + NM_DEVICE=enp5s0' to get more details.
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + s=4
May 24 09:54:06 master-0-2 configure-ovs.sh[3001]: + sleep 5
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + for i in {1..10}
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 2 to bring up connection enp5s0-slave-ovs-clone'
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: Attempt 2 to bring up connection enp5s0-slave-ovs-clone
May 24 09:54:11 master-0-2 configure-ovs.sh[3001]: + nmcli conn up enp5s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/20)
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + s=0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + break
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection enp5s0-slave-ovs-clone successfully'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Brought up connection enp5s0-slave-ovs-clone successfully
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp5s0-slave-ovs-clone connection.autoconnect yes
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for connection in $(nmcli -g NAME c | grep -- "$MANAGED_NM_CONN_SUFFIX")
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn enp6s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=enp6s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show enp6s0-slave-ovs-clone
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection enp6s0-slave-ovs-clone already activated'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection enp6s0-slave-ovs-clone already activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod enp6s0-slave-ovs-clone connection.autoconnect yes
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-phys0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-phys0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-phys0
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' activated '!=' activated ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Connection ovs-if-phys0 already activated'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Connection ovs-if-phys0 already activated
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-phys0 connection.autoconnect yes
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + activate_nm_conn ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local conn=ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: ++ nmcli -g GENERAL.STATE conn show ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + local active_state=
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + '[' '' '!=' activated ']'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + for i in {1..10}
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + echo 'Attempt 1 to bring up connection ovs-if-br-ex'
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: Attempt 1 to bring up connection ovs-if-br-ex
May 24 09:54:29 master-0-2 configure-ovs.sh[3001]: + nmcli conn up ovs-if-br-ex
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Connection successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/21)
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + s=0
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + break
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']'
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + echo 'Brought up connection ovs-if-br-ex successfully'
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: Brought up connection ovs-if-br-ex successfully
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + nmcli c mod ovs-if-br-ex connection.autoconnect yes
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' -f /etc/ovnk/extra_bridge ']'
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + handle_exit
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + e=0
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + '[' 0 -eq 0 ']'


May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip route show
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via 192.168.123.1 dev br-ex proto dhcp metric 48
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: 192.168.123.0/24 dev br-ex proto kernel scope link src 192.168.123.97 metric 48
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: + ip -6 route show
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: ::1 dev lo proto kernel metric 256 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::4d dev br-ex proto kernel metric 48 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fd2e:6f44:5dd8::/64 dev br-ex proto ra metric 48 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev br-ex proto kernel metric 48 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: fe80::/64 dev genev_sys_6081 proto kernel metric 256 pref medium
May 24 09:55:16 master-0-2 configure-ovs.sh[3001]: default via fe80::5054:ff:fe97:2978 dev br-ex proto ra metric 48 pref medium

Comment 31 Jaime Caamaño Ruiz 2022-06-08 13:42:43 UTC

*** Bug 2089943 has been marked as a duplicate of this bug. ***

Comment 33 errata-xmlrpc 2022-08-10 10:50:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.