Bug 1845885
| Summary: | race condition during installation between nodes getting their hostnames and crio+kubelet starting | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Pedro Ibáñez <pibanezr> | |
| Component: | RHCOS | Assignee: | Ben Howard <behoward> | |
| Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.6 | CC: | augol, bbreard, bchardim, behoward, dollierp, dornelas, ealcaniz, elajoie, imcleod, jligon, kholtz, mdekan, miabbott, mpatel, mrobson, nstielau, ohochman, rgregory, rphillips, saledort, sferguso, smilner, travier, ykashtan | |
| Target Milestone: | --- | Flags: | miabbott:
needinfo-
miabbott: needinfo- |
|
| Target Release: | 4.6.0 | |||
| Hardware: | x86_64 | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1853400 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 16:06:07 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1186913, 1853400 | |||
|
Description
Pedro Ibáñez
2020-06-10 10:40:09 UTC
Not sure where the best place to fix this is. If the nodes are eventually getting a valid hostname via DHCP, then it would seem like crio/kubelet could be configured to wait to start until that event happens. Not sure if systemd units have a condition that could be used for this. Possibly, RHCOS could be configured to drop a file indicating the hostname has been set and the presence of that file could be the condition which crio/kubelet is waiting on. @mrunal @rphillips Any thoughts here? I concur with Micah's comments about where the best place to fix this particular issue is. The problem that I see is that "localhost" is a valid Linux hostname. Since its an invalid cluster hostname, it would appear that the better place is in the Machine Config Operator. I've thrown up a WIP PR to https://github.com/openshift/machine-config-operator/pull/18 I've re-scoped this to 4.6. If the WIP PR lands, then I expect we'll backport this to 4.3. @pedro, see newly linked PR :) Thanks @Micah. @Pedro would you be able to retest your configuration with the latest OCP 4.6 nightly? @Pedro any 4.6 nightly released after Jun 17 should suffice Thanks Micah, I'll come back with the results. The linked PR (https://github.com/openshift/machine-config-operator/pull/1813) was superseded by https://github.com/openshift/machine-config-operator/pull/1914, which references BZ#1853584 and is marked as VERIFIED. We don't have an easy way to setup a bare metal environment where the DHCP responses are delayed, so I'm going to ask Pedro again to please retest this configuration with a recent 4.6 nightly. If we don't hear back in the next few weeks, I'll mark this as VERIFIED based on BZ#1853584 Hi, We are facing the same issue when deploying ocp 4.6 in our bare metal environments. We deployed 4.6.0-0.ci-2020-07-21-114552 version but the master nodes are not joining the cluster. When applying the workaround mentioned above (logging in each node and running the commands) the issue is resolved and the nodes are joining the cluster. One workaround if you are using playbooks is to edit the ignition files before pxe booting. Is the link below the ignition sets hostnames and wipes dids used for OCS. https://github.com/elajoie/5G-SA-POC-prereq/blob/master/roles/init/tasks/ocp-secret.yml Example play: #If the machines enter emergency mode then validate the ignition here: https://coreos.com/validate/ - name: Edit the worker and master ignitions to set hostname replace: path: "{{ ign_folder }}/{{ item.name }}.ign" regexp: '"storage":{}' replace: '"storage":{"files":[{"filesystem":"root","path":"/etc/hostname","mode":420,"overwrite":true,"contents":{"source":"data:,{{ item.name }}"}}]}' with_items: - "{{ nodes.masters }}" - "{{ nodes.workers }}" @Pedro this bug is just about fixing the problem described for 4.6 releases. If possible, please try to reproduce the problem using an Accepted 4.6 nightly payload from here - https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/#4.6.0-0.nightly We do not have an environment where we can replicate this problem, so we are relying on the reporter to help with the verification. If you want to test this has been fixed in 4.5.z, please leave a comment on https://bugzilla.redhat.com/show_bug.cgi?id=1853400. That BZ indicates the fix was released as part of 4.5.7 (https://access.redhat.com/errata/RHBA-2020:3436). If the issue persists on 4.5.z, please open a new 4.5 BZ for the problem. If you want to test this has been fixed in 4.4.z, please leave a comment on https://bugzilla.redhat.com/show_bug.cgi?id=1855878 That BZ indicates the fix was released as part of 4.4.17 (https://access.redhat.com/errata/RHBA-2020:3334). If the issue persists on 4.4.z, please open a new 4.4 BZ for the problem. @Micah ack I'm waiting for 4.5.7 verification on the affected environment. Thanks! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |