Bug 1845885 - race condition during installation between nodes getting their hostnames and crio+kubelet starting
Summary: race condition during installation between nodes getting their hostnames and ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.6
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Ben Howard
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks: 1186913 1853400
TreeView+ depends on / blocked
 
Reported: 2020-06-10 10:40 UTC by Pedro Ibáñez
Modified: 2024-03-25 16:02 UTC (History)
24 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1853400 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:06:07 UTC
Target Upstream Version:
Embargoed:
miabbott: needinfo-
miabbott: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1813 0 None closed Bug 1845885: templates: add node-valid-hostname.service for hostname check 2021-02-16 09:58:51 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:06:38 UTC

Description Pedro Ibáñez 2020-06-10 10:40:09 UTC
Description of problem:
When installing OCP in a BM environment if there is a *minutes* delay while the nodes are getting their hostnames by dhcp the crio daemon and kubelet will start with the system having localhost.localdomain and the pods won't be able to retrieve their images from the registry

Version-Release number of selected component (if applicable):
4.4.6

How reproducible:
Install OCP 4.4.6 with a dhcp server that takes some time to send the answers to the DHCP requests of the nodes, the nodes will boot RHCOS and get localhost.localdomain, during that time crio and the kubelet will start with the wrong hostname data.

Steps to Reproduce:
1.Run the OCP installation with a dhcp server giving the network data
2.Have a delayed dhcp server to send the hostnames
3.Have the nodes with localhost.localdomain hostnames while crio and kubelet are starting

Actual results:
The nodes are booting and after a while they will get the proper hostnames but the first pods will stay in a "ContainerCreating" status because they won't be able to pull the images from the registry.

Expected results:
Nothing of that happens, even if the DHCP answer is long enough, the crio and kubelet daemons should wait to start until the hostname information is the right one (different form localhost.localdomain).

Additional info:
Telco setups and disconnected environments could face this situation with long delays since the nodes are requesting dhcp data and they are setting the proper hostnames.
A workaround that worked for us was to login in each node and run the following commands:
sudo systemctl daemon-reload
sudo systemctl restart nodeip-configuration.service
sudo systemcrl restart crio
sudo systemctl restart kubelet.service

Comment 2 Micah Abbott 2020-06-10 14:01:17 UTC
Not sure where the best place to fix this is.  If the nodes are eventually getting a valid hostname via DHCP, then it would seem like crio/kubelet could be configured to wait to start until that event happens.  Not sure if systemd units have a condition that could be used for this.

Possibly, RHCOS could be configured to drop a file indicating the hostname has been set and the presence of that file could be the condition which crio/kubelet is waiting on.

@mrunal @rphillips Any thoughts here?

Comment 3 Ben Howard 2020-06-10 20:37:29 UTC
I concur with Micah's comments about where the best place to fix this particular issue is. The problem that I see is that "localhost" is a valid Linux hostname. Since its an invalid cluster hostname, it would appear that the better place is in the Machine Config Operator. I've thrown up a WIP PR to https://github.com/openshift/machine-config-operator/pull/18

I've re-scoped this to 4.6. If the WIP PR lands, then I expect we'll backport this to 4.3.

Comment 6 Micah Abbott 2020-06-11 14:04:58 UTC
@pedro, see newly linked PR  :)

Comment 7 Pedro Ibáñez 2020-06-15 09:06:54 UTC
Thanks @Micah.

Comment 11 Micah Abbott 2020-06-19 22:26:56 UTC
@Pedro would you be able to retest your configuration with the latest OCP 4.6 nightly?

Comment 13 Micah Abbott 2020-06-25 13:43:25 UTC
@Pedro any 4.6 nightly released after Jun 17 should suffice

Comment 14 Pedro Ibáñez 2020-06-25 14:02:04 UTC
Thanks Micah, I'll come back with the results.

Comment 17 Micah Abbott 2020-07-22 20:17:41 UTC
The linked PR (https://github.com/openshift/machine-config-operator/pull/1813) was superseded by https://github.com/openshift/machine-config-operator/pull/1914, which references BZ#1853584 and is marked as VERIFIED.

We don't have an easy way to setup a bare metal environment where the DHCP responses are delayed, so I'm going to ask Pedro again to please retest this configuration with a recent 4.6 nightly.

If we don't hear back in the next few weeks, I'll mark this as VERIFIED based on BZ#1853584

Comment 18 Sabina Aledort 2020-07-23 07:59:52 UTC
Hi,

We are facing the same issue when deploying ocp 4.6 in our bare metal environments. We deployed 4.6.0-0.ci-2020-07-21-114552 version but the master nodes are not joining the cluster. When applying the workaround mentioned above (logging in each node and running the commands) the issue is resolved and the nodes are joining the cluster.

Comment 25 Eric Lajoie 2020-09-08 16:24:37 UTC
One workaround if you are using playbooks is to edit the ignition files before pxe booting. Is the link below the ignition sets hostnames and wipes dids used for OCS.

https://github.com/elajoie/5G-SA-POC-prereq/blob/master/roles/init/tasks/ocp-secret.yml

Example play:

#If the machines enter emergency mode then validate the ignition here: https://coreos.com/validate/
    - name: Edit the worker and master ignitions to set hostname
      replace:
        path: "{{ ign_folder }}/{{ item.name }}.ign"
        regexp: '"storage":{}'
        replace: '"storage":{"files":[{"filesystem":"root","path":"/etc/hostname","mode":420,"overwrite":true,"contents":{"source":"data:,{{ item.name }}"}}]}'
      with_items:
        - "{{  nodes.masters  }}"
        - "{{  nodes.workers  }}"

Comment 26 Micah Abbott 2020-09-09 20:25:36 UTC
@Pedro this bug is just about fixing the problem described for 4.6 releases.  If possible, please try to reproduce the problem using an Accepted 4.6 nightly payload from here - https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.6.0-0.nightly

We do not have an environment where we can replicate this problem, so we are relying on the reporter to help with the verification.


If you want to test this has been fixed in 4.5.z, please leave a comment on https://bugzilla.redhat.com/show_bug.cgi?id=1853400.  That BZ indicates the fix was released as part of 4.5.7 (https://access.redhat.com/errata/RHBA-2020:3436).  If the issue persists on 4.5.z, please open a new 4.5 BZ for the problem.


If you want to test this has been fixed in 4.4.z, please leave a comment on https://bugzilla.redhat.com/show_bug.cgi?id=1855878
That BZ indicates the fix was released as part of 4.4.17 (https://access.redhat.com/errata/RHBA-2020:3334).  If the issue persists on 4.4.z, please open a new 4.4 BZ for the problem.

Comment 27 Pedro Ibáñez 2020-09-10 07:37:13 UTC
@Micah ack
I'm waiting for 4.5.7 verification on the affected environment.

Thanks!

Comment 30 errata-xmlrpc 2020-10-27 16:06:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.