Bug 1876216

Summary: [oVirt] api-int not resolvable for a short period of time
Product: OpenShift Container Platform Reporter: Gal Zaidman <gzaidman>
Component: Machine Config OperatorAssignee: Gal Zaidman <gzaidman>
Status: CLOSED ERRATA QA Contact: Lucie Leistnerova <lleistne>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.6CC: eslutsky, jerzhang
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:38:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gal Zaidman 2020-09-06 12:39:19 UTC
Description of problem:

On ovirt e2e tests we noticed errors on master/worker journal due to: 
dial tcp: lookup api-int.ovirt1X.gcp.devcluster.openshift.com on 192.168.21X.1:53: no such host"

see:
1. https://bugzilla.redhat.com/show_bug.cgi?id=1846529#c18
2. ovirt17-kcphn-worker-0-c2rd4 journal on CI job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1302427462155636736:
cat journal|grep "api-int.*1:53: no such host"|wc -l
945

On ovirt CI 192.168.21X.1 is the upstream DNS, and api-int is resolveable only on CoreDNS.
That means that at some points the coreDNS is not available and the node tries to use the Upstream DNS.

Additional thoughts:
The NetworkManager-resolve-prepender is responsible for adding the coredns to the resolv.conf. As we can see on https://bugzilla.redhat.com/show_bug.cgi?id=1846529#c28
"""
by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]:

during this lookup failure we can see :

# cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep resolv-prepender
Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change.
Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver 192.168.211.118' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
 """

It takes 2s to finish, at the begining of the script we copy /var/run/NetworkManager/resolv.conf to /etc/resolv.conf that means that in those 2 seconds we have the wrong DNS and that will lead to unexpected problems.

Comment 1 Gal Zaidman 2020-09-07 13:56:07 UTC
*** Bug 1876215 has been marked as a duplicate of this bug. ***

Comment 5 Lucie Leistnerova 2020-09-17 09:47:03 UTC
We don't see the issue occurring at CI anymore.
Verified in OCP 4.6.0-0.nightly-2020-09-17-031725 with RHV 4.4.0.3-1.el8

Comment 6 Ben Nemec 2020-09-28 20:39:41 UTC
*** Bug 1846529 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-10-27 16:38:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196