Version: Deploying OpenShift 4.9.7 Platform: baremetal Using AI on ACM What happened? Trying to deploy a 3-node compressed cluster all on baremetal. The lab environment's main DNS does not have a record for the 'api-int.$cluster' address. The first 2 master nodes install properly, but the bootstrap node was stuck as a bootstrap node forever. According to the bootkube.service logs, it was trying but failing to resolve api-int.$cluster repeatedly: > Dec 03 20:34:07 cnfdf02.telco5gran.eng.rdu2.redhat.com bootkube.sh[26004]: Unable to connect to the server: dial tcp: lookup api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com on 10.11.5.19:53: no such host It's correct in that that upstream DNS (10.11.5.19) does indeed have no record for api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com; However, the internal DNS does: $ dig @localhost api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com ; <<>> DiG 9.11.26-RedHat-9.11.26-4.el8_4 <<>> @localhost api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com ; (2 servers found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61870 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ; COOKIE: d1e499aa081db818 (echoed) ;; QUESTION SECTION: ;api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com. IN A ;; ANSWER SECTION: api-int.cnfdf02.telco5gran.eng.rdu2.redhat.com. 30 IN A 10.8.34.52 So the bootstrap node should be able to resolve this address! However, /etc/resolv.conf says: > # Generated by NetworkManager > search telco5gran.eng.rdu2.redhat.com > nameserver 10.11.5.19 After some conversation in slack, it looks like there may be a race condition between NetworkManager bringing up the interface, the nm-dispatcher adding in localhost to /etc/resolv.conf, and NetworkManager doing further processing which resets resolv.conf to only that which is in the nmconnection file: https://coreos.slack.com/archives/CUPJTHQ5P/p1638577329289700?thread_ts=1638483562.237600&cid=CUPJTHQ5P What did you expect to happen? The bootstrap node should always have its own address in /etc/resolv.conf so it can always resolve api-int.$cluster and complete the install successfully. How to reproduce it (as minimally and precisely as possible)? Deploy a cluster with a static IPv4 configuration in an environment where there's no DNS record for 'api-int.$cluster'. Example nmconnection: [connection] id=eno1 uuid=60a1b8f8-d3de-44cc-a09e-72fd1e76c9c6 type=ethernet interface-name=eno1 permissions= autoconnect=true autoconnect-priority=1 [ethernet] mac-address-blacklist= [ipv4] address1=10.8.34.12/24 dhcp-client-id=mac dns=10.11.5.19; dns-priority=40 dns-search=telco5gran.eng.rdu2.redhat.com; method=manual route1=0.0.0.0/0,10.8.34.254 route1_options=table=254 [ipv6] addr-gen-mode=eui64 dhcp-duid=ll dhcp-iaid=mac dns-search= method=disabled [proxy] (Note: This was generated with `nmstate gc <config>` but nmstate is not running on the node) Anything else we should know: There are 2 workarounds: - On a system where the bootstrap node is in this stuck state, running something trivial like "sudo nmcli device disconnect eno1; wait ; sudo nmcli device connect eno1" will cause the localhost entry to be re-added to resolv.conf and the install proceeds - On a fresh install, adding "127.0.0.1" to the static DNS configuration will cause the install to start, too.
Created attachment 1844924 [details] bootkube.service.log
Can you please make the following modifications to the bug description: - Remove references to "nmstate" - nmstate is not being used here, it's just raw nmconnection files generated by the assisted service (using nmstate, but that's beside the point), in your case it's generated to and this is what matters: [connection] id=eno1 uuid=60a1b8f8-d3de-44cc-a09e-72fd1e76c9c6 type=ethernet interface-name=eno1 permissions= autoconnect=true autoconnect-priority=1 [ethernet] mac-address-blacklist= [ipv4] address1=10.8.34.12/24 dhcp-client-id=mac dns=10.11.5.19; dns-priority=40 dns-search=telco5gran.eng.rdu2.redhat.com; method=manual route1=0.0.0.0/0,10.8.34.254 route1_options=table=254 [ipv6] addr-gen-mode=eui64 dhcp-duid=ll dhcp-iaid=mac dns-search= method=disabled [proxy] - Remove the .interfaces stanza from the yaml under "and with nmstate something like the following:", it's assisted-installer specific and is not relevant to the problem. Only the content under ".config" is the actual nmstate config. And even then, please just specify that the nmconnection file above is simply generated with `nmstate gc <config>` and nmstate is not running on the node - Replace the workaround "I have a workaround: If I manually add "127.0.0.1" to the dns-resolver section of my nmstate, the install succeeds." with this workaround "sudo nmcli device disconnect eno1; wait ; sudo nmcli device connect eno1" - it shows that simply doing a meaningless action on interfaces will trigger the dispatcher script which works as intended.
*** Bug 2033550 has been marked as a duplicate of this bug. ***
This issue is not unique to baremetal. See https://bugzilla.redhat.com/show_bug.cgi?id=2033550 where the same issue is happening with vSphere.
*** Bug 2027836 has been marked as a duplicate of this bug. ***
The issue happened several times against 4.10 recently on QE CI and manual installation. Is there any plan to fix the issue on 4.10?
Once this happens, the cluster could not be set up successfully. Per #comment 13, update the severity to high.
We are researching who the correct assignee for this bz is.
upi-on-vsphere installation failed at bootstrap stage when using nightly build 4.11.0-0.nightly-2022-04-24-085400 (containing the fix) or later payload, it is succeeded against 4.11.0-0.nightly-2022-04-23-153426. Checked on bootstrap instance, /etc/resolv.conf was not generated. [root@bootstrap-0 ~]# ls -ltr /etc/resolv.conf ls: cannot access '/etc/resolv.conf': No such file or directory And see rc-manager is configured as unmanaged. [root@bootstrap-0 ~]# ls -ltr /etc/NetworkManager/conf.d/99-vsphere.conf -rw-------. 1 root root 28 Apr 25 03:04 /etc/NetworkManager/conf.d/99-vsphere.conf [root@bootstrap-0 ~]# cat /etc/NetworkManager/conf.d/99-vsphere.conf [main] rc-manager=unmanaged
The UPI bug was fixed by https://github.com/openshift/installer/pull/5842 . This should be ready for testing again.
The issue of vsphere upi installation in comment 21 has been fixed in https://github.com/openshift/installer/pull/5842, and verified passed, upi installation is successful without any error. The original issue described in this bug on ipi-on-vsphere also happens sometimes on QE CI(1-2 time per week), after PR installer#5482 is merged, I monitor QE CI for two weeks, and don't hit such issue in CI and manual installation any more. Issue should be fixed, move bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069