Hide Forgot
+++ This bug was initially created as a clone of Bug #1966862 +++ Created attachment 1788616 [details] log of the openshift-install on vsphere Version: $ openshift-install version openshift-install 4.7.7 built from commit fae650e24e7036b333b2b2d9dfb5a08a29cd07b1 release image quay.io/openshift-release-dev/ocp-release@sha256:aee8055875707962203197c4306e69b024bea1a44fa09ea2c2c621e8c5000794 Platform: vSphere 7.0U2 with IPI What happened? The bootkube.sh shows a lot of nslookup errors on api-int, and that process does not complete on the bootstrap node. The node is not removed because the script does not return the notification that the bootstrap process is complete. Jun 02 04:34:59 localhost bootkube.sh[2381]: E0602 04:34:59.344418 1 reflector.go:138] k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.ocp4.lab.io:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.ocp4.lab.io on 192.168.1.1:53: no such host Jun 02 04:35:24 localhost bootkube.sh[2381]: E0602 04:35:24.639676 1 reflector.go:138] k8s.io/client-go.1/tools/cache/reflector.go:167: Failed to watch *v1.Etcd: failed to list *v1.Etcd: Get "https://api-int.ocp4.lab.io:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0": dial tcp: lookup api-int.ocp4.lab.io on 192.168.1.1:53: no such host The NM dispatcher 30-local-dns-prepender is not adding 'nameserver 127.0.0.1' to /etc/resolv.conf consistently. [root@localhost ~]# journalctl -u NetworkManager-dispatcher --no-pager -- Logs begin at Wed 2021-06-02 04:13:54 UTC, end at Wed 2021-06-02 04:32:07 UTC. -- Jun 02 04:14:02 localhost systemd[1]: Starting Network Manager Script Dispatcher Service... Jun 02 04:14:02 localhost systemd[1]: Started Network Manager Script Dispatcher Service. Jun 02 04:14:03 localhost nm-dispatcher[1720]: <13>Jun 2 04:14:03 root: NM local-dns-prepender triggered by ens192 up. Jun 02 04:14:03 localhost nm-dispatcher[1720]: <13>Jun 2 04:14:03 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Jun 02 04:14:03 localhost root[1771]: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1' Jun 02 04:14:03 localhost nm-dispatcher[1720]: <13>Jun 2 04:14:03 root: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1' Jun 02 04:14:16 localhost systemd[1]: NetworkManager-dispatcher.service: Succeeded. [root@localhost ~]# [root@localhost ~]# ls -l /etc/resolv.conf -rw-r--r--. 1 root root 79 Jun 2 04:14 /etc/resolv.conf [root@localhost ~]# [root@localhost ~]# cat /etc/resolv.conf # Generated by NetworkManager search ocp4.lab.io lab.io nameserver 192.168.1.1 [root@localhost ~]# The master nodes are brought up but manual intervention is necessary to get the cluster deployed correctly. What did you expect to happen? /etc/resolv.conf should have 127.0.0.1 as its first entry so that containers can resolve the new cluster's domain and subdomains. Then bootkube.sh should exit cleanly, completing the bootstrap process successfully. How to reproduce it (as minimally and precisely as possible)? I reproduced this in the bootstrap node restarting the NetworkManager service, which triggers the dispatcher, then checking the file /etc/resolv.conf to confirm that the line 'nameserver 127.0.0.1' was not added. I also reproduced this copying the 30-local-dns-prepender script to /etc/NetworkManager/dispatcher.d in a separate CentOS 8 VM. Restarting the NetworkManager service or event the VM leads to the same result, 'nameserver 127.0.0.1/ is not added to /etc/resolv.conf Anything else we need to know? I was able to fix this after early editing /etc/NetworkManager/dispatcher.d/30-local-dns-prepender on the bootstrap node, to change the line with the sed command, removing the dot, star and dollar sign in the pattern section. It seems that those pattern characters, are being expanded during the execution of sed, causing the command not to be applied in-place. from: sed -i "/^# Generated by.*$/a nameserver $DNS_IP" /etc/resolv.conf to: sed -i "/^# Generated by/a nameserver $DNS_IP" /etc/resolv.conf Once I removed those pattern characters from the sed command, and restarted the NetworkManager service, the dispatcher got executed and the'nameserver 127.0.0.1' was correctly added to /etc/resolv.conf. That caused the bootkube.sh to complete successfully on the bootstrap node. --- Additional comment from oaliasbo on 2021-06-02 15:01:45 UTC --- I submitted the following PR https://github.com/openshift/installer/pull/4973 I am promoting the removal of the pattern '.*$' to prevent expansion. At some point after the timeout of the bootkube.sh, the localhost appears in /etc/resolv.conf. But it is added too late in the process, therefore the bootstrap node is not removed and the log shows that the bootstrap failed to complete. This does not prevent the master and worker nodes to be created successfully, but manual intervention is required to complete the installation as the kube-apiserver operator gets stuck.
The issue is happened when using pfSense to provide DHCP and DNS services per https://bugzilla.redhat.com/show_bug.cgi?id=1966862#c5, while QE don't have such env. I did regression testing on nightly build 4.7.0-0.nightly-2021-06-12-151209 with the fix on VMC. On bootstrap server: # cat /etc/NetworkManager/dispatcher.d/30-local-dns-prepender | grep sed sed -i "/^# Generated by/a nameserver $DNS_IP" /etc/resolv.conf # cat /etc/resolv.conf # Generated by NetworkManager nameserver 127.0.0.1 search us-west-2.compute.internal nameserver 10.3.192.12 Finally, bootstrap server has been removed successfully, and cluster installation is completed. $ ./openshift-install create cluster --dir ipi1 --log-level debug ...... ...... INFO Install complete! INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/jima/temp/4.7.0-0.nightly-2021-06-12-151209/ipi1/auth/kubeconfig' INFO Access the OpenShift web-console here: https://console-openshift-console.apps.jima1967355.qe.devcluster.openshift.com INFO Login to the console with user: "kubeadmin", and password: "pg7NN-oRvf9-VFrvN-AYcYW" DEBUG Time elapsed per stage: DEBUG Infrastructure: 1m42s DEBUG Bootstrap Complete: 11m50s DEBUG API: 2m15s DEBUG Bootstrap Destroy: 18s DEBUG Cluster Operators: 18m18s INFO Time elapsed: 33m23s According to https://bugzilla.redhat.com/show_bug.cgi?id=1966862#c5 and my testing, move bug to VERIFIED.
OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502