Bug 1876216 - [oVirt] api-int not resolvable for a short period of time
Summary: [oVirt] api-int not resolvable for a short period of time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: Gal Zaidman
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
: 1846529 1876215 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-06 12:39 UTC by Gal Zaidman
Modified: 2020-10-27 16:38 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:38:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2062 0 None closed Bug 1876216: Check if /etc/resolv.conf is present before overriding 2021-02-03 19:38:37 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:38:38 UTC

Description Gal Zaidman 2020-09-06 12:39:19 UTC
Description of problem:

On ovirt e2e tests we noticed errors on master/worker journal due to: 
dial tcp: lookup api-int.ovirt1X.gcp.devcluster.openshift.com on 192.168.21X.1:53: no such host"

see:
1. https://bugzilla.redhat.com/show_bug.cgi?id=1846529#c18
2. ovirt17-kcphn-worker-0-c2rd4 journal on CI job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1302427462155636736:
cat journal|grep "api-int.*1:53: no such host"|wc -l
945

On ovirt CI 192.168.21X.1 is the upstream DNS, and api-int is resolveable only on CoreDNS.
That means that at some points the coreDNS is not available and the node tries to use the Upstream DNS.

Additional thoughts:
The NetworkManager-resolve-prepender is responsible for adding the coredns to the resolv.conf. As we can see on https://bugzilla.redhat.com/show_bug.cgi?id=1846529#c28
"""
by looking at ovirt11-wz8kt-worker-0-5686p workers journal [0]:

during this lookup failure we can see :

# cat workers-journal | grep nm-dispatcher | grep 'worker-0-5686p' | grep resolv-prepender
Aug 16 02:45:15.570577 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:15 root: NM resolv-prepender triggered by ens3 dhcp4-change.
Aug 16 02:45:17.726649 ovirt11-wz8kt-worker-0-5686p nm-dispatcher[302731]: <13>Aug 16 02:45:17 root: NM resolv-prepender: Prepending 'nameserver 192.168.211.118' to /etc/resolv.conf (other nameservers from /var/run/NetworkManager/resolv.conf)
 """

It takes 2s to finish, at the begining of the script we copy /var/run/NetworkManager/resolv.conf to /etc/resolv.conf that means that in those 2 seconds we have the wrong DNS and that will lead to unexpected problems.

Comment 1 Gal Zaidman 2020-09-07 13:56:07 UTC
*** Bug 1876215 has been marked as a duplicate of this bug. ***

Comment 5 Lucie Leistnerova 2020-09-17 09:47:03 UTC
We don't see the issue occurring at CI anymore.
Verified in OCP 4.6.0-0.nightly-2020-09-17-031725 with RHV 4.4.0.3-1.el8

Comment 6 Ben Nemec 2020-09-28 20:39:41 UTC
*** Bug 1846529 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-10-27 16:38:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.