Bug 1857169 - coredns container misses localhost entry in /etc/resolv.conf
Summary: coredns container misses localhost entry in /etc/resolv.conf
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Roy Golan
QA Contact: Jan Zmeskal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-15 10:27 UTC by Jan Zmeskal
Modified: 2020-10-27 16:14 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:14:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3892 0 None closed Bug 1857169: ovirt: Use NetworkManager instead of dhclient in bootstrap 2020-10-13 11:11:06 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:14:57 UTC

Description Jan Zmeskal 2020-07-15 10:27:37 UTC
Description of problem:
Because RHCOS 46 dropped dhclient binary, localhost entry is no longer prepended to /etc/resolv.conf in coredns container. As a result, installation bails in bootstrap stage because api-int cannot be resolved.

Here is the specific error message from bootstrap's journalctl:
Jul 15 10:23:00 <fqdn> bootkube.sh[2272]: E0715 10:23:00.361298       1 reflector.go:178] k8s.io/client-go.3/tools/cache/reflector.go:125: Failed to list *v1.Etcd: Get "https://api-int.<domain>:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.<domain> on <dns_server_ip>:53: no such host

More details in this thread: https://coreos.slack.com/archives/CNSJG0ZED/p1594736797433300


Version-Release number of the following components:
openshift-install-linux-4.6.0-0.nightly-2020-07-15-004428
RHCOS 46.82.202007051540-0

How reproducible:
100 %

Steps to Reproduce:
1. openshift-install create cluster
2. Wait for bootstrap machine to be up
3. journalctl -b -f -u release-image.service -u bootkube.service on bootstrap machine

Actaul results: 
Installation fails

Comment 1 Roy Golan 2020-07-15 11:35:38 UTC
The reason for this is that RHCOS 4.6 doesn't contain the dhclient binary, which means /etc/resolv.conf 
is missing the first nameserver 127.0.0.1 which should point at coredns . 

In other words /etc/dhcp/dhclient.conf is simply ignored

The solution is to switch to using NetworkManager script to prepend that nameserver in /etc/resolv.conf
(only for bootstrap - nodes already use that)

Comment 4 Jan Zmeskal 2020-07-16 11:08:40 UTC
Verified with: openshift-install-linux-4.6.0-0.ci-2020-07-16-011059

Verification steps:
1. Run OCP4.6 installation
2. Make sure it finishes successfully
3. During bootstrap:
3.1 ssh core@<bastion_vm>
3.2 crictl ps
3.3 crictl exec -it <corends_container_id>
3.4 cat /etc/resolv.conf
First nameserver must be 127.0.0.1

Comment 6 errata-xmlrpc 2020-10-27 16:14:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.