Bug 1876216 - [oVirt] api-int not resolvable for a short period of time
Summary: [oVirt] api-int not resolvable for a short period of time
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Gal Zaidman
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
: 1876215 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-06 12:39 UTC by Gal Zaidman
Modified: 2020-09-17 09:47 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2062 None closed Bug 1876216: Check if /etc/resolv.conf is present before overriding 2020-09-21 06:20:56 UTC

Description Gal Zaidman 2020-09-06 12:39:19 UTC
Description of problem:

On ovirt e2e tests we noticed errors on master/worker journal due to: 
dial tcp: lookup api-int.ovirt1X.gcp.devcluster.openshift.com on 192.168.21X.1:53: no such host"

see:
1. https://bugzilla.redhat.com/show_bug.cgi?id=1846529#c18
2. ovirt17-kcphn-worker-0-c2rd4 journal on CI job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1302427462155636736:
cat journal|grep "api-int.*1:53: no such host"|wc -l
945

On ovirt CI 192.168.21X.1 is the upstream DNS, and api-int is resolveable only on CoreDNS.
That means that at some points the coreDNS is not available and the node tries to use the Upstream DNS.

Additional thoughts:
The NetworkManager-resolve-prepender is responsible for adding the coredns to the resolv.conf. As we can see on https://bugzilla.redhat.com/show_bug.cgi?id=1846529#c28
"""
by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]:

during this lookup failure we can see :

# cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep resolv-prepender
Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change.
Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver 192.168.211.118' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
 """

It takes 2s to finish, at the begining of the script we copy /var/run/NetworkManager/resolv.conf to /etc/resolv.conf that means that in those 2 seconds we have the wrong DNS and that will lead to unexpected problems.

Comment 1 Gal Zaidman 2020-09-07 13:56:07 UTC
*** Bug 1876215 has been marked as a duplicate of this bug. ***

Comment 5 Lucie Leistnerova 2020-09-17 09:47:03 UTC
We don't see the issue occurring at CI anymore.
Verified in OCP 4.6.0-0.nightly-2020-09-17-031725 with RHV 4.4.0.3-1.el8


Note You need to log in before you can comment on or make changes to this bug.