RHEL7 workers scaleup playbook succeeds however after removing the RHCOS nodes the worker machine config never completely rolls out. Across 177 runs and 22 jobs (80.23% failed), matched 81.69% of failing runs and 36.36% of jobs CI Search: https://search.ci.openshift.org/?search=error%3A+timed+out+waiting+for+the+condition+on+machineconfigpools%2Fworker&maxAge=48h&context=1&type=bug%2Bjunit&name=e2e-aws-workers-rhel7&maxMatches=5&maxBytes=20971520&groupBy=job Example Job: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2428/pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7/1365006063505510400 Build Log snippit (after scaleup.yml playbook completes successfully): + echo '2021-02-25 19:32:11+00:00 - Deleting CoreOS machinesets' 2021-02-25 19:32:11+00:00 - Deleting CoreOS machinesets + mapfile -t COREOS_MACHINE_SETS ++ oc get machinesets --namespace openshift-machine-api ++ grep worker ++ grep -v rhel ++ awk '{print $1}' + [[ 2 != 0 ]] + oc delete machinesets --namespace openshift-machine-api ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1a ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1b machineset.machine.openshift.io "ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1a" deleted machineset.machine.openshift.io "ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1b" deleted ++ date -u --rfc-3339=seconds + echo '2021-02-25 19:32:12+00:00 - Waiting for CoreOS nodes to be removed' 2021-02-25 19:32:12+00:00 - Waiting for CoreOS nodes to be removed + oc wait node --for=delete --timeout=10m --selector node.openshift.io/os_id=rhcos,node-role.kubernetes.io/worker node/ip-10-0-149-56.us-west-1.compute.internal condition met node/ip-10-0-156-31.us-west-1.compute.internal condition met node/ip-10-0-208-64.us-west-1.compute.internal condition met ++ date -u --rfc-3339=seconds + echo '2021-02-25 19:34:32+00:00 - Waiting for worker machineconfigpool to update' 2021-02-25 19:34:32+00:00 - Waiting for worker machineconfigpool to update + oc wait machineconfigpool/worker --for=condition=Updated=True --timeout=10m error: timed out waiting for the condition on machineconfigpools/worker Example from MCD pod logs: 2021-02-25T19:42:02.027958844Z I0225 19:42:02.027484 23752 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-678-g4bda8ca2-dirty (4bda8ca28522b3f2e663e8e27f56e79884e8d6ce) 2021-02-25T19:42:02.030583723Z I0225 19:42:02.030550 23752 start.go:121] Calling chroot("/rootfs") 2021-02-25T19:42:02.067087660Z I0225 19:42:02.067030 23752 start.go:97] Copied self to /run/bin/machine-config-daemon on host 2021-02-25T19:42:02.070540459Z I0225 19:42:02.070489 23752 metrics.go:105] Registering Prometheus metrics 2021-02-25T19:42:02.070671155Z I0225 19:42:02.070630 23752 metrics.go:110] Starting metrics listener on 127.0.0.1:8797 2021-02-25T19:42:02.072647740Z I0225 19:42:02.072611 23752 update.go:1901] Starting to manage node: ip-10-0-130-229.us-west-1.compute.internal 2021-02-25T19:42:02.075268617Z I0225 19:42:02.075229 23752 rpm-ostree.go:261] Running captured: journalctl --list-boots 2021-02-25T19:42:02.076415481Z I0225 19:42:02.076381 23752 daemon.go:682] Detected a new login session: New session 1 of user ec2-user. 2021-02-25T19:42:02.076415481Z I0225 19:42:02.076403 23752 daemon.go:683] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh 2021-02-25T19:42:02.078529860Z I0225 19:42:02.078484 23752 daemon.go:871] journalctl --list-boots: 2021-02-25T19:42:02.078529860Z 0 2d64c5efd7cc4f1d988178be598026e4 Thu 2021-02-25 19:29:38 UTC—Thu 2021-02-25 19:42:02 UTC 2021-02-25T19:42:02.078529860Z I0225 19:42:02.078511 23752 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend 2021-02-25T19:42:02.082326587Z I0225 19:42:02.082286 23752 daemon.go:886] systemd service state: OK 2021-02-25T19:42:02.082326587Z I0225 19:42:02.082307 23752 daemon.go:620] Starting MachineConfigDaemon 2021-02-25T19:42:02.082326587Z I0225 19:42:02.082313 23752 daemon.go:590] Guarding against sigterm signal 2021-02-25T19:42:02.082369207Z I0225 19:42:02.082332 23752 daemon.go:627] Enabling Kubelet Healthz Monitor 2021-02-25T19:42:02.182675828Z W0225 19:42:02.182578 23752 daemon.go:648] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ip-10-0-130-229.us-west-1.compute.internal" not found 2021-02-25T19:42:02.182675828Z I0225 19:42:02.182628 23752 daemon.go:649] Shutting down MachineConfigDaemon 2021-02-25T19:42:02.182757582Z F0225 19:42:02.182731 23752 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ip-10-0-130-229.us-west-1.compute.internal" not found Node ip-10-0-130-229.us-west-1.compute.internal Ready state is True.
Looking at job failure status, it appears this PR introduced a change causing this to fail on RHEL nodes. The merge just before this PR passed workers-rhel7 and this PR didn't. https://github.com/openshift/machine-config-operator/pull/2395
Hmm, I wonder if the MCD is failing to terminate because of some weird RHEL upgrade limbo state. Passing to Ben to take a look
This is a 4.8 blocker since it is a regression.
Dup of Bug 1933772 *** This bug has been marked as a duplicate of bug 1933772 ***