+++ This bug was initially created as a clone of Bug #1840222 +++ ... --- Additional comment from Colin Walters on 2020-05-28 17:28:30 UTC --- Current diagnosis is that something in the IPI stack (probably Ironic) is forcibly powering off machines during the first boot. Our stack is currently not very robust around this. We need to change the MCO's firstboot handling to be more idempotent: https://github.com/openshift/machine-config-operator/pull/1762
Tested with 4.5.0-0.nightly-2020-06-17-234944 There's no clear reproduction for this as it happens to be flacky anyway, I've spun up a cluster with the above nightly and made sure the changes to the unit are there (no more BindsTo): 17:25:15 [~/Downloads] export KUBECONFIG=cluster-bot-2020-06-18-142422.kubeconfig 17:25:19 [~/Downloads] oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-163-11.us-west-2.compute.internal Ready master 30m v1.18.3+91d0edd ip-10-0-163-148.us-west-2.compute.internal Ready master 31m v1.18.3+91d0edd ip-10-0-177-120.us-west-2.compute.internal Ready worker 20m v1.18.3+91d0edd ip-10-0-185-241.us-west-2.compute.internal Ready worker 20m v1.18.3+91d0edd ip-10-0-245-33.us-west-2.compute.internal Ready master 30m v1.18.3+91d0edd ip-10-0-249-229.us-west-2.compute.internal Ready worker 20m v1.18.3+91d0edd 17:25:24 [~/Downloads] oc debug node ip-10-0-163-11.us-west-2.compute.internal Error from server (NotFound): pods "node" not found 17:25:30 [~/Downloads] oc debug node/ip-10-0-163-11.us-west-2.compute.internal 1 ↵ Starting pod/ip-10-0-163-11us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.163.11 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host bash [root@ip-10-0-163-11 /]# systemctl cat machine-config-daemon-firstboot.service # /etc/systemd/system/machine-config-daemon-firstboot.service [Unit] Description=Machine Config Daemon Firstboot # Make sure it runs only on OSTree booted system ConditionPathExists=/run/ostree-booted # Removal of this file signals firstboot completion ConditionPathExists=/etc/ignition-machine-config-encapsulated.json # We only want to run on 4.3 clusters and above; this came from # https://github.com/coreos/coreos-assembler/pull/768 ConditionPathExists=/sysroot/.coreos-aleph-version.json After=ignition-firstboot-complete.service Before=crio.service crio-wipe.service Before=kubelet.service [Service] # Need oneshot to delay kubelet Type=oneshot ExecStart=/usr/libexec/machine-config-daemon firstboot-complete-machineconfig [Install] WantedBy=multi-user.target RequiredBy=crio.service kubelet.service [root@ip-10-0-163-11 /]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409