Bug 1857169

Summary: coredns container misses localhost entry in /etc/resolv.conf
Product: OpenShift Container Platform Reporter: Jan Zmeskal <jzmeskal>
Component: InstallerAssignee: Roy Golan <rgolan>
Installer sub component: OpenShift on RHV QA Contact: Jan Zmeskal <jzmeskal>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: rgolan
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:14:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Zmeskal 2020-07-15 10:27:37 UTC
Description of problem:
Because RHCOS 46 dropped dhclient binary, localhost entry is no longer prepended to /etc/resolv.conf in coredns container. As a result, installation bails in bootstrap stage because api-int cannot be resolved.

Here is the specific error message from bootstrap's journalctl:
Jul 15 10:23:00 <fqdn> bootkube.sh[2272]: E0715 10:23:00.361298       1 reflector.go:178] k8s.io/client-go.3/tools/cache/reflector.go:125: Failed to list *v1.Etcd: Get "https://api-int.<domain>:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.<domain> on <dns_server_ip>:53: no such host

More details in this thread: https://coreos.slack.com/archives/CNSJG0ZED/p1594736797433300


Version-Release number of the following components:
openshift-install-linux-4.6.0-0.nightly-2020-07-15-004428
RHCOS 46.82.202007051540-0

How reproducible:
100 %

Steps to Reproduce:
1. openshift-install create cluster
2. Wait for bootstrap machine to be up
3. journalctl -b -f -u release-image.service -u bootkube.service on bootstrap machine

Actaul results: 
Installation fails

Comment 1 Roy Golan 2020-07-15 11:35:38 UTC
The reason for this is that RHCOS 4.6 doesn't contain the dhclient binary, which means /etc/resolv.conf 
is missing the first nameserver 127.0.0.1 which should point at coredns . 

In other words /etc/dhcp/dhclient.conf is simply ignored

The solution is to switch to using NetworkManager script to prepend that nameserver in /etc/resolv.conf
(only for bootstrap - nodes already use that)

Comment 4 Jan Zmeskal 2020-07-16 11:08:40 UTC
Verified with: openshift-install-linux-4.6.0-0.ci-2020-07-16-011059

Verification steps:
1. Run OCP4.6 installation
2. Make sure it finishes successfully
3. During bootstrap:
3.1 ssh core@<bastion_vm>
3.2 crictl ps
3.3 crictl exec -it <corends_container_id>
3.4 cat /etc/resolv.conf
First nameserver must be 127.0.0.1

Comment 6 errata-xmlrpc 2020-10-27 16:14:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196