Description of problem: Trying to install OpenShift 4.3.1 in VMware, using the following OVA template: https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.3/latest/rhcos-4.3.0-x86_64-vmware.ova After booting the masters and workers, kubelet is started before the host gets IP/hostname through DHCP. Then it creates a node certificate with name "localhost" and the bootstrap process finishes "successfully" (etcd and api are UP, but not all the nodes). Because all nodes are trying to register themselves as "localhost", only one control plane goes up and the installation can't proceed. Noticed that this service didn't go up in boot: UNIT LOAD ACTIVE SUB DESCRIPTION * NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online Feb 17 16:37:27 localhost systemd[1]: Starting Network Manager Wait Online... Feb 17 16:37:57 localhost systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE Feb 17 16:37:57 localhost systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'. Feb 17 16:37:57 localhost systemd[1]: Failed to start Network Manager Wait Online. Feb 17 16:37:57 localhost systemd[1]: NetworkManager-wait-online.service: Consumed 25ms CPU time Trying to start it after boot doesn't give any errors. If we restart the kubelet service, kubelet tries to register itself as the proper hostname. Maybe there's a race condition there? A workaround was done by recovering / recreating all control plane certificates and re-registering with kubelet, but this created a lot of other problems, like "oc logs" not working due to unknown certificate, "system:admin"'s kubeconfig from the installation didn't work, etc. Now we're looking for a root cause and solution. More information: * hostname -f works fine. DNS entries are OK and resolving; * Reverse DNS is also working; * Recovery/Workaround used: https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html Version-Release number: OpenShift 4.3.1 Image: rhcos-4.3.0-x86_64-vmware.ova How reproducible: Steps to Reproduce: 1. Install OpenShift 4.3.1 using provided OVA in VMware 2. IPs were get from DHCP (MAC/IP association) 3. Observe masters and workers booting and creating CSRs for "localhost" (journalctl -f) Actual results: Nodes bootstrap themselves on the cluster as "localhost" Expected results: Nodes bootstrap themselves on the cluster with their hostname
Do you have a console log from one of the affected hosts?
I don't have it currently, do you know any way to get it in VMware? Or is there a specific file(s) which contain the logs you want? (dmesg, journalct, etc)
Can you fetch dmesg, journalctl and the contents of /etc/resolv.conf?
This is something we should investigate as part of the work to improve static IP networking on VMWare in 4.5
Do you recommend any workarounds? The customer can't install 4.x on their infrastructure because of this. Maybe using the bare metal installation and raw image instead of using .ova would work? Please note that we used DHCP and not kernel command line options to set a static IP.
A likely workaround is to encode /etc/hostname in Ignition. But, I agree we should probably be handling this by default.
Fix for this should be in 4.3.4. *** This bug has been marked as a duplicate of bug 1763700 ***
Hi Colin, The referenced bug 1763700 was fixed in 4.3.0. Is there any new bugs to track this on 4.3.4?
For some reason that patch didn't actually make it into 4.3.0. But it should be in the next 4.3.X release (which is probably 4.3.5). I specifically verified the fix is in https://openshift-release.svc.ci.openshift.org/releasestream/4.4.0-0.nightly/release/4.4.0-0.nightly-2020-03-05-142733
*** Bug 1839900 has been marked as a duplicate of this bug. ***