1803962 – Installing on VMware, all nodes try to bootstrap themselves as "localhost"

Bug 1803962 - Installing on VMware, all nodes try to bootstrap themselves as "localhost"

Summary: Installing on VMware, all nodes try to bootstrap themselves as "localhost"

Keywords:
Status:	CLOSED DUPLICATE of bug 1763700
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.z
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1839900 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-17 20:06 UTC by Hugo Cisneiros (Eitch)
Modified:	2023-09-07 21:55 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-05 18:14:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Hugo Cisneiros (Eitch) 2020-02-17 20:06:06 UTC

Description of problem:

Trying to install OpenShift 4.3.1 in VMware, using the following OVA template:

https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.3/latest/rhcos-4.3.0-x86_64-vmware.ova

After booting the masters and workers, kubelet is started before the host gets IP/hostname through DHCP. Then it creates a node certificate with name "localhost" and the bootstrap process finishes "successfully" (etcd and api are UP, but not all the nodes).

Because all nodes are trying to register themselves as "localhost", only one control plane goes up and the installation can't proceed.

Noticed that this service didn't go up in boot:

UNIT LOAD ACTIVE SUB DESCRIPTION
* NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online

Feb 17 16:37:27 localhost systemd[1]: Starting Network Manager Wait Online...
Feb 17 16:37:57 localhost systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Feb 17 16:37:57 localhost systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
Feb 17 16:37:57 localhost systemd[1]: Failed to start Network Manager Wait Online.
Feb 17 16:37:57 localhost systemd[1]: NetworkManager-wait-online.service: Consumed 25ms CPU time

Trying to start it after boot doesn't give any errors.

If we restart the kubelet service, kubelet tries to register itself as the proper hostname. Maybe there's a race condition there?

A workaround was done by recovering / recreating all control plane certificates and re-registering with kubelet, but this created a lot of other problems, like "oc logs" not working due to unknown certificate, "system:admin"'s kubeconfig from the installation didn't work, etc. Now we're looking for a root cause and solution.

More information:

* hostname -f works fine. DNS entries are OK and resolving;
* Reverse DNS is also working;
* Recovery/Workaround used: https://docs.openshift.com/container-platform/4.3/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

Version-Release number:

OpenShift 4.3.1
Image: rhcos-4.3.0-x86_64-vmware.ova

How reproducible:

Steps to Reproduce:
1. Install OpenShift 4.3.1 using provided OVA in VMware
2. IPs were get from DHCP (MAC/IP association)
3. Observe masters and workers booting and creating CSRs for "localhost" (journalctl -f)

Actual results:

Nodes bootstrap themselves on the cluster as "localhost"

Expected results:

Nodes bootstrap themselves on the cluster with their hostname

Comment 3 Ben Howard 2020-02-17 23:16:26 UTC

Do you have a console log from one of the affected hosts?

Comment 4 Hugo Cisneiros (Eitch) 2020-02-18 15:52:53 UTC

I don't have it currently, do you know any way to get it in VMware?

Or is there a specific file(s) which contain the logs you want? (dmesg, journalct, etc)

Comment 5 Ben Howard 2020-02-18 22:56:43 UTC

Can you fetch dmesg, journalctl and the contents of /etc/resolv.conf?

Comment 9 Micah Abbott 2020-02-26 20:49:29 UTC

This is something we should investigate as part of the work to improve static IP networking on VMWare in 4.5

Comment 10 Hugo Cisneiros (Eitch) 2020-02-26 22:04:40 UTC

Do you recommend any workarounds? The customer can't install 4.x on their infrastructure because of this.

Maybe using the bare metal installation and raw image instead of using .ova would work? Please note that we used DHCP and not kernel command line options to set a static IP.

Comment 11 Colin Walters 2020-02-26 22:10:30 UTC

A likely workaround is to encode /etc/hostname in Ignition.  But, I agree we should probably be handling this by default.

Comment 12 Colin Walters 2020-03-05 18:14:04 UTC

Fix for this should be in 4.3.4.

*** This bug has been marked as a duplicate of bug 1763700 ***

Comment 13 Hugo Cisneiros (Eitch) 2020-03-05 18:41:29 UTC

Hi Colin,

The referenced bug 1763700 was fixed in 4.3.0. Is there any new bugs to track this on 4.3.4?

Comment 14 Colin Walters 2020-03-05 18:50:52 UTC

For some reason that patch didn't actually make it into 4.3.0.  But it should be in the next 4.3.X release (which is probably 4.3.5).
I specifically verified the fix is in https://openshift-release.svc.ci.openshift.org/releasestream/4.4.0-0.nightly/release/4.4.0-0.nightly-2020-03-05-142733

Comment 15 Colin Walters 2020-05-26 13:21:47 UTC

*** Bug 1839900 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.