Version $ openshift-install version ./openshift-install unreleased-master-3051-gaecb9201459e3e92f12344e78aaaf03b6550fde7-dirty built from commit aecb9201459e3e92f12344e78aaaf03b6550fde7 release image registry.svc.ci.openshift.org/origin/release:4.5 Platform: baremetal from assisted-installer, which is basically automated UPI Background The local-dns-prepender ensures that during installation, the bootstrap host is able to resolve the cluster's own domain and subdomains using its own DNS server running on localhost. https://github.com/openshift/installer/blob/80c458b/data/data/bootstrap/baremetal/files/etc/NetworkManager/dispatcher.d/30-local-dns-prepender What happened? bootkube.sh never completes on my bootstrap host. I traced it to a failure of local-dns-prepender to maintain an entry in /etc/resolv.conf. The last thing bootkube.sh does is run cluster-etcd-operator, which checks the status of etcd on the new control plane and keeps trying until it succeeds. https://github.com/openshift/installer/blob/7e2b0ad/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L400-L409 I saw that container logging errors like this one over and over, indicating that it can't resolve the new cluster's domain: E0729 17:41:25.822693 1 reflector.go:178] k8s.io/client-go.3/tools/cache/reflector.go:125: Failed to list *v1.Etcd: Get https://api-int.demo.ocp.home:6443/apis/operator.openshift.io/v1/etcds?fieldSelector=metadata.name%3Dcluster&limit=500&resourceVersion=0: dial tcp: lookup api-int.demo.ocp.home on [2002:18e1:510b:0:12da:43ff:feba:19ce]:53: no such host Looking in /etc/resolv.conf, I found that it did not include an entry for localhost, but instead just the DNS host on my LAN. It included both an ipv6 and ipv4 address, and in the log statement above, you can see it using the ipv6 address. I was able to manually add localhost to resolv.conf and restart the cluster-etcd-operator container, which then succeeded with its query and exited cleanly. What you expected to happen? /etc/resolv.conf should have localhost as its first entry so that containers such as the cluster-etcd-operator can resolve the new cluster's domain and subdomains. And if that's working, bootkube.sh should exit cleanly, enabling the bootstrap host to complete its work and reboot. Cause local-dns-prepender ensures that localhost is the first entry in resolv.conf. It gets triggered by an interface being activated. BUT on my network, a dhcp6 lease arrives after an interface is activated and local-dns-prepender runs, and the lease includes a name server entry. (See the last three lines of the log excerpt below.) That causes NetworkManager to overwrite resolv.conf, thus overwriting the changes made by local-dns-prepender. Relevant logs from the bootstrap host: Jul 29 19:17:00 localhost root[3347]: NM local-dns-prepender triggered by cni-podman0 up. Jul 29 19:17:00 localhost nm-dispatcher[3250]: <13>Jul 29 19:17:00 root: NM local-dns-prepender triggered by cni-podman0 up. Jul 29 19:17:00 localhost root[3348]: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Jul 29 19:17:00 localhost nm-dispatcher[3250]: <13>Jul 29 19:17:00 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Jul 29 19:17:00 localhost root[3352]: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1' Jul 29 19:17:00 localhost nm-dispatcher[3250]: <13>Jul 29 19:17:00 root: NM local-dns-prepender: Looking for '# Generated by NetworkManager' in /etc/resolv.conf to place 'nameserver 127.0.0.1' Jul 29 19:17:00 localhost systemd[1]: iscsi.service: Unit cannot be reloaded because it is inactive. Jul 29 19:17:00 localhost root[3359]: NM local-dns-prepender triggered by ens2 up. Jul 29 19:17:00 localhost nm-dispatcher[3250]: <13>Jul 29 19:17:00 root: NM local-dns-prepender triggered by ens2 up. Jul 29 19:17:00 localhost root[3360]: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Jul 29 19:17:00 localhost nm-dispatcher[3250]: <13>Jul 29 19:17:00 root: NM local-dns-prepender: Checking if local DNS IP is the first entry in resolv.conf Jul 29 19:17:00 localhost root[3364]: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf Jul 29 19:17:00 localhost nm-dispatcher[3250]: <13>Jul 29 19:17:00 root: NM local-dns-prepender: local DNS IP already is the first entry in resolv.conf Jul 29 19:17:00 localhost NetworkManager[3260]: <info> [1596050220.5900] dhcp6 (ens2): activation: beginning transaction (timeout in 45 seconds) Jul 29 19:17:00 localhost NetworkManager[3260]: <info> [1596050220.5935] dhcp6 (ens2): option dhcp6_name_servers => '2002:18e1:510b:0:12da:43ff:feba:19ce' Jul 29 19:17:00 localhost NetworkManager[3260]: <info> [1596050220.5935] dhcp6 (ens2): state changed unknown -> bound I suspect that even in an ipv4-only environment, a dhcp lease renewal would cause this same bug by inducing NetworkManager to overwrite resolv.conf. That would only be seen if the lease renewal period was shorter than the time it takes a bootstrap host to finish its work. Such short renewal times are likely uncommon, but could reasonably be found in a user's network. This would also be exacerbated by a bootstrap host that for some reason took a long time to complete its work. How to reproduce Initiate installation with the bare metal platform in a network environment that will either send a dhcp6 lease or send a dhcp4 lease renewal before bootkube.sh completes. Such an installation can be initiated using the new assisted installer project, but the bug is not specific to that project. Wait for two of the hosts to finish installation as master, at which point the bootstrap host should reboot and become the third master. Observe that the bootstrap host does not reboot or complete its own installation. Potential Fixes Catch more NetworkManager action types: A normal host in a cluster is configured by machine-config-operator to apply DNS config in response to more actions than just up. Following that example by expanding the list of actions in local-dns-prepender to include dhcp4-change and dhcp6-change would probably fix this bug. I tested this approach, and it resolved the issue. Use NetworkManager directly to add localhost DNS: NetworkManager theoretically holds the "source of truth" for DNS servers and writes that out to resolv.conf. RHCOS hosts may be able to use NetworkManager to add localhost as the preferred DNS server instead of trying to continuously modify NetworkManager's output. But since this seems like an obvious approach that apparently was not taken, there may be a good reason why this is not viable. Anything else we need to know? The openstack, ovirt and vsphere platforms have similar or identical 30-local-dns-prepender scripts that may be susceptible to the same bug.
Pull request: https://github.com/openshift/installer/pull/4015
Verified: Version: 4.6.0-0.nightly-2020-08-31-220837 Checked with dhcp lease-time set to 2 minutes. Deployed the OCP cluster with assisted-installer. The deployment of masters completed successfully and the bootstrap host was pivoted successfully into a master.
Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 DHCPREQUEST(eno2) 10.19.134.3 98:03:9b:61:7c:21 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 tags: known, eno2 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 DHCPACK(eno2) 10.19.134.3 98:03:9b:61:7c:21 openshift-master-1 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 requested options: 1:netmask, 2:time-offset, 6:dns-server, 12:hostname, Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 requested options: 15:domain-name, 26:mtu, 28:broadcast, 121:classless-static-route, Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 requested options: 3:router, 33:static-route, 40:nis-domain, Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 requested options: 41:nis-server, 42:ntp-server, 119:domain-search, Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 requested options: 249, 252, 17:root-path Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 next server: 10.19.134.1 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 1 option: 53 message-type 5 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 54 server-identifier 10.19.134.1 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 51 lease-time 2m Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 58 T1 55s Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 59 T2 1m40s Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 1 netmask 255.255.255.128 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 28 broadcast 10.19.134.127 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 30 option: 15 domain-name qe1.kni.lab.eng.bos.redhat.com Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 18 option: 12 hostname openshift-master-1 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 12 option: 6 dns-server 10.19.42.41, 10.11.5.19, 10.5.30.160 Sep 30 11:46:57 r640-u01.qe1.kni.lab.eng.bos.redhat.com dnsmasq-dhcp[27062]: 206316558 sent size: 4 option: 3 router 10.19.134.126 the above is from dnsmasq log on the dhcp server (provided by dnsmasq) Note the lease time set to 2m at the end: dhcp[27062]: 206316558 sent size: 4 option: 51 lease-time 2m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196