Bug 1967355

Summary:	vsphere IPI - local dns prepender is not prepending nameserver 127.0.0.1
Product:	OpenShift Container Platform	Reporter:	OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component:	Installer	Assignee:	aos-install
Installer sub component:	openshift-installer	QA Contact:	jima
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	mstaeble
Version:	4.7
Target Milestone:	---
Target Release:	4.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The bootstrap machine when installing to vSphere may not get its /etc/resolv.conf updated to include 127.0.0.1 as a nameserver. Consequence: The bootstrap machine is unable to access the temporary control plane that it creates. This results in a failed installation. Fix: Adjust the 30-local-dns-prepender NetworkManager dispatcher so that the sed command more reliably finds the line after which to add the nameserver line. Result: The bootstrap machine is able to access its temporary control plane, and the installation succeeds.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-29 04:19:45 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1966862
Bug Blocks:

Description OpenShift BugZilla Robot 2021-06-03 00:55:30 UTC

+++ This bug was initially created as a clone of Bug #1966862 +++

Created attachment 1788616 [details]
log of the openshift-install on vsphere

Version:

$ openshift-install version
openshift-install 4.7.7
built from commit fae650e24e7036b333b2b2d9dfb5a08a29cd07b1
release image quay.io/openshift-release-dev/ocp-release@sha256:aee8055875707962203197c4306e69b024bea1a44fa09ea2c2c621e8c5000794


Platform:

vSphere 7.0U2 with IPI



What happened?

The bootkube.sh shows a lot of nslookup errors on api-int, and that process does not complete on the bootstrap node. The node is not removed because the script does not return the notification that the bootstrap process is complete.

Jun 02 04:34:59 localhost bootkube.sh[2381]: E0602 04:34:59.344418       1 reflector.go:138] k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.ocp4.lab.io:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.ocp4.lab.io on 192.168.1.1:53: no such host
Jun 02 04:35:24 localhost bootkube.sh[2381]: E0602 04:35:24.639676       1 reflector.go:138] k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.ocp4.lab.io:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.ocp4.lab.io on 192.168.1.1:53: no such host

The NM dispatcher 30-local-dns-prepender is not adding 'nameserver 127.0.0.1' to /etc/resolv.conf consistently.

[root@localhost ~]# journalctl -u NetworkManager-dispatcher --no-pager
-- Logs begin at Wed 2021-06-02 04:13:54 UTC, end at Wed 2021-06-02 04:32:07 UTC. --
Jun 02 04:14:02 localhost systemd[1]: Starting Network Manager Script Dispatcher Service...
Jun 02 04:14:02 localhost systemd[1]: Started Network Manager Script Dispatcher Service.
Jun 02 04:14:03 localhost nm-dispatcher[1720]: <13>Jun  2 04:14:03 root: NM local-dns-prepender triggered by ens192 up.
Jun 02 04:14:03 localhost nm-dispatcher[1720]: <13>Jun  2 04:14:03 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf
Jun 02 04:14:03 localhost root[1771]: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1'
Jun 02 04:14:03 localhost nm-dispatcher[1720]: <13>Jun  2 04:14:03 root: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1'
Jun 02 04:14:16 localhost systemd[1]: NetworkManager-dispatcher.service: Succeeded.
[root@localhost ~]# 
[root@localhost ~]# ls -l /etc/resolv.conf 
-rw-r--r--. 1 root root 79 Jun  2 04:14 /etc/resolv.conf
[root@localhost ~]# 
[root@localhost ~]# cat /etc/resolv.conf 
# Generated by NetworkManager
search ocp4.lab.io lab.io
nameserver 192.168.1.1
[root@localhost ~]# 

The master nodes are brought up but manual intervention is necessary to get the cluster deployed correctly.


What did you expect to happen?

/etc/resolv.conf should have 127.0.0.1 as its first entry so that containers can resolve the new cluster's domain and subdomains. Then bootkube.sh should exit cleanly, completing the bootstrap process successfully.



How to reproduce it (as minimally and precisely as possible)?

I reproduced this in the bootstrap node restarting the NetworkManager service, which triggers the dispatcher, then checking the file /etc/resolv.conf to confirm that the line 'nameserver 127.0.0.1' was not added.

I also reproduced this copying the 30-local-dns-prepender script to /etc/NetworkManager/dispatcher.d in a separate CentOS 8 VM. Restarting the NetworkManager service or event the VM leads to the same result, 'nameserver 127.0.0.1/ is not added to /etc/resolv.conf


Anything else we need to know?

I was able to fix this after early editing /etc/NetworkManager/dispatcher.d/30-local-dns-prepender on the bootstrap node, to change the line with the sed command, removing the dot, star and dollar sign in the pattern section.

It seems that those pattern characters, are being expanded during the execution of sed, causing the command not to be applied in-place.

from:
  sed -i "/^# Generated by.*$/a nameserver $DNS_IP" /etc/resolv.conf

to:
  sed -i "/^# Generated by/a nameserver $DNS_IP" /etc/resolv.conf

Once I removed those pattern characters from the sed command, and restarted the NetworkManager service, the dispatcher got executed and the'nameserver 127.0.0.1' was correctly added to /etc/resolv.conf. 

That caused the bootkube.sh to complete successfully on the bootstrap node.

--- Additional comment from oaliasbo on 2021-06-02 15:01:45 UTC ---

I submitted the following PR

https://github.com/openshift/installer/pull/4973

I am promoting the removal of the pattern '.*$' to prevent expansion.

At some point after the timeout of the bootkube.sh, the localhost appears in /etc/resolv.conf. But it is added too late in the process, therefore the bootstrap node is not removed and the log shows that the bootstrap failed to complete.

This does not prevent the master and worker nodes to be created successfully, but manual intervention is required to complete the installation as the kube-apiserver operator gets stuck.

Comment 3 jima 2021-06-15 02:29:30 UTC

The issue is happened when using pfSense to provide DHCP and DNS services per https://bugzilla.redhat.com/show_bug.cgi?id=1966862#c5, while QE don't have such env.
I did regression testing on nightly build 4.7.0-0.nightly-2021-06-12-151209 with the fix on VMC.

On bootstrap server:
# cat /etc/NetworkManager/dispatcher.d/30-local-dns-prepender | grep sed
        sed -i "/^# Generated by/a nameserver $DNS_IP" /etc/resolv.conf

# cat /etc/resolv.conf 
# Generated by NetworkManager
nameserver 127.0.0.1
search us-west-2.compute.internal
nameserver 10.3.192.12

Finally, bootstrap server has been removed successfully, and cluster installation is completed.
$ ./openshift-install create cluster --dir ipi1 --log-level debug
......
......
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/jima/temp/4.7.0-0.nightly-2021-06-12-151209/ipi1/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.jima1967355.qe.devcluster.openshift.com 
INFO Login to the console with user: "kubeadmin", and password: "pg7NN-oRvf9-VFrvN-AYcYW" 
DEBUG Time elapsed per stage:                      
DEBUG     Infrastructure: 1m42s                    
DEBUG Bootstrap Complete: 11m50s                   
DEBUG                API: 2m15s                    
DEBUG  Bootstrap Destroy: 18s                      
DEBUG  Cluster Operators: 18m18s                   
INFO Time elapsed: 33m23s 

According to https://bugzilla.redhat.com/show_bug.cgi?id=1966862#c5 and my testing, move bug to VERIFIED.

Comment 4 OpenShift Automated Release Tooling 2021-06-17 12:29:08 UTC

OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.

Comment 8 errata-xmlrpc 2021-06-29 04:19:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502