1855392 – [4.3.z] race condition during installation between nodes getting their hostnames and crio+kubelet starting

Bug 1855392 - [4.3.z] race condition during installation between nodes getting their hostnames and crio+kubelet starting

Summary: [4.3.z] race condition during installation between nodes getting their hostna...

Keywords:
Status:	CLOSED DUPLICATE of bug 1855879
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1850775
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-09 18:32 UTC by Micah Abbott
Modified:	2020-07-21 14:26 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1850775
Environment:
Last Closed:	2020-07-10 19:37:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Micah Abbott 2020-07-09 18:32:09 UTC

+++ This bug was initially created as a clone of Bug #1850775 +++

This bug was initially created as a copy of Bug #1845885

Description of problem:
When installing OCP in a BM environment if there is a *minutes* delay while the nodes are getting their hostnames by dhcp the crio daemon and kubelet will start with the system having localhost.localdomain and the pods won't be able to retrieve their images from the registry

Version-Release number of selected component (if applicable):
4.4.6

How reproducible:
Install OCP 4.4.6 with a dhcp server that takes some time to send the answers to the DHCP requests of the nodes, the nodes will boot RHCOS and get localhost.localdomain, during that time crio and the kubelet will start with the wrong hostname data.

Steps to Reproduce:
1.Run the OCP installation with a dhcp server giving the network data
2.Have a delayed dhcp server to send the hostnames
3.Have the nodes with localhost.localdomain hostnames while crio and kubelet are starting

Actual results:
The nodes are booting and after a while they will get the proper hostnames but the first pods will stay in a "ContainerCreating" status because they won't be able to pull the images from the registry.

Expected results:
Nothing of that happens, even if the DHCP answer is long enough, the crio and kubelet daemons should wait to start until the hostname information is the right one (different form localhost.localdomain).

Additional info:
Telco setups and disconnected environments could face this situation with long delays since the nodes are requesting dhcp data and they are setting the proper hostnames.
A workaround that worked for us was to login in each node and run the following commands:
sudo systemctl daemon-reload
sudo systemctl restart nodeip-configuration.service
sudo systemcrl restart crio
sudo systemctl restart kubelet.service

--- Additional comment from Ben Howard on 2020-06-24 20:51:37 UTC ---

Cloning bug for cherry pick.

--- Additional comment from Micah Abbott on 2020-07-09 18:31:42 UTC ---

To appease the BZ bots that work on the MCO repo, we need to clone to 4.4.z + cherry-pick before we can go to 4.3.z

Comment 1 Ben Howard 2020-07-10 19:37:43 UTC

Closing this in favor of 1855879. The backport from 4.6 contains four fixes to hostnames including this race-condition.

*** This bug has been marked as a duplicate of bug 1855879 ***

Note You need to log in before you can comment on or make changes to this bug.