Bug 2018003

Summary: ovs-configuration service fails when external network is configured via dnsmask on a bond device on a baremetal IPI deployment
Product: OpenShift Container Platform Reporter: Will Russell <wrussell>
Component: NetworkingAssignee: Jaime CaamaƱo Ruiz <jcaamano>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: bpickard, jcaamano, rbrattai, rhowe, vpickard, vvoronko
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-03 18:47:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Will Russell 2021-10-27 22:19:09 UTC
Description of problem:

OCP 4.7.24
BareMetal IPI installation
Bond Network

Version-Release number of selected component (if applicable):


How reproducible:
80% of the time (Succeeds on one master node, fails on two --> every time the same nodes however, master-1 ALWAYS succeeds, master-0 and master-2 always fail)

Steps to Reproduce:
1. configure IPI manifests, prepare dnsmask values/service, deploy cluster with defined bond interface details
2. bootstrap will initialize nodes, all 6 nics will be granted IP addresses and default routes successfully. Restart #1 succeeds, nics remain active. configure-ovs.sh kicks off during provisioning step, ovn comes online and loses the IP data for the interfaces - wipes the IP contents for ens2f0 and ens2f1 and fails to provision the network (and subsequently times out the deployment).
3. ssh into the nodes via provisioning network and observe loss of network data, configure-ovs.sh output differing on master-1 (succeeded/online) and master-0/2 (failed/offline)

Actual results:

deployment fails, IP/routes are lost when OVS comes up for primary interface ens2f0

Expected results:

ens2f0 should retain baremetal IP information and provision successfully

Additional info:

Case linked includes the following data:

Output for failed service on master-0, successful service on master-1 for comparison. DNSmask provisioning data and more - have been able to reproduce this behavior on maybe 10 separate installs with several changes but each time the consistent issue is that the default route will lose it's network connectivity/IP/route information when OVS comes up. The deployment then hangs because it requires a successful link before it can communicate as OVS defines the bond as the primary connection and so the deploys fail out. We are unable to successfully re-run the script even after running a reset-ovs.sh script to rebuild the baseline connection. Every time the script at `/usr/local/bin/configure-ovs.sh` is run, the connections will clear their values and the bond will fail to provision.

https://github.com/openshift/machine-config-operator/blob/release-4.7/templates/common/_base/files/configure-ovs-network.yaml#L259

~~~
Oct 27 16:28:30 master-0 configure-ovs.sh[3721]: + grep manual
Oct 27 16:28:30 master-0 configure-ovs.sh[3721]: + nmcli c add type ovs-interface slave-type ovs-port conn.interface br-ex master ovs-port-br-ex con-name ovs-if-br-ex 802-3-ethernet.mtu 9100 802-3-ethernet.cloned>
Oct 27 16:28:30 master-0 configure-ovs.sh[3721]: Connection 'ovs-if-br-ex' (<UUID>) successfully added.
Oct 27 16:28:30 master-0 configure-ovs.sh[3721]: + counter=0
Oct 27 16:28:30 master-0 configure-ovs.sh[3721]: + '[' 0 -lt 5 ']'
Oct 27 16:28:30 master-0 configure-ovs.sh[3721]: + sleep 5
Oct 27 16:28:35 master-0 configure-ovs.sh[3721]: + nmcli --fields GENERAL.STATE conn show ovs-if-br-ex
Oct 27 16:28:35 master-0 configure-ovs.sh[3721]: + grep -i activated
Oct 27 16:28:35 master-0 configure-ovs.sh[3721]: + counter=1
Oct 27 16:28:35 master-0 configure-ovs.sh[3721]: + '[' 1 -lt 5 ']'
Oct 27 16:28:35 master-0 configure-ovs.sh[3721]: + sleep 5
Oct 27 16:28:40 master-0 configure-ovs.sh[3721]: + nmcli --fields GENERAL.STATE conn show ovs-if-br-ex
Oct 27 16:28:40 master-0 configure-ovs.sh[3721]: + grep -i activated
Oct 27 16:28:40 master-0 configure-ovs.sh[3721]: + counter=2
Oct 27 16:28:40 master-0 configure-ovs.sh[3721]: + '[' 2 -lt 5 ']'
Oct 27 16:28:40 master-0 configure-ovs.sh[3721]: + sleep 5
Oct 27 16:28:45 master-0 configure-ovs.sh[3721]: + nmcli --fields GENERAL.STATE conn show ovs-if-br-ex
Oct 27 16:28:45 master-0 configure-ovs.sh[3721]: + grep -i activated
Oct 27 16:28:45 master-0 configure-ovs.sh[3721]: + counter=3
Oct 27 16:28:45 master-0 configure-ovs.sh[3721]: + '[' 3 -lt 5 ']'
Oct 27 16:28:45 master-0 configure-ovs.sh[3721]: + sleep 5
Oct 27 16:28:50 master-0 configure-ovs.sh[3721]: + nmcli --fields GENERAL.STATE conn show ovs-if-br-ex
Oct 27 16:28:50 master-0 configure-ovs.sh[3721]: + grep -i activated
Oct 27 16:28:50 master-0 configure-ovs.sh[3721]: + counter=4
Oct 27 16:28:50 master-0 configure-ovs.sh[3721]: + '[' 4 -lt 5 ']'
Oct 27 16:28:50 master-0 configure-ovs.sh[3721]: + sleep 5
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + nmcli --fields GENERAL.STATE conn show ovs-if-br-ex
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + grep -i activated
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + counter=5
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + '[' 5 -lt 5 ']'
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + echo 'WARN: OVS did not succesfully activate NM connection. Attempting to bring up connections'
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: WARN: OVS did not succesfully activate NM connection. Attempting to bring up connections
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + counter=0
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + '[' 0 -lt 5 ']'
Oct 27 16:28:55 master-0 configure-ovs.sh[3721]: + nmcli conn up ovs-if-br-ex
Oct 27 16:29:40 master-0 configure-ovs.sh[3721]: Error: Connection activation failed: IP configuration could not be reserved (no available address, timeout, etc.)
Oct 27 16:29:40 master-0 configure-ovs.sh[3721]: Hint: use 'journalctl -xe NM_CONNECTION=<UUID> + NM_DEVICE=br-ex' to get more details.
Oct 27 16:29:40 master-0 configure-ovs.sh[3721]: + sleep 5
~~~

Comment 1 Will Russell 2021-10-27 22:21:02 UTC
*** Bug 2013438 has been marked as a duplicate of this bug. ***

Comment 10 Will Russell 2021-11-03 18:47:11 UTC

*** This bug has been marked as a duplicate of bug 1975174 ***