See https://docs.google.com/document/d/1dt1J3Ds3TVTG8KzFd0qVU-0w7AiWcX8BouWgElZBir0/edit# I reproduced this in a cluster-bot cluster updated from 4.4 to 4.5 to 4.6. The MCS will only serve the *status* config for new workers until the worker pool is complete. See https://github.com/openshift/machine-config-operator/pull/2035 Operating theory is that the control plane (and hence MCS) is upgraded and starts to serve Ignition spec 2.5.0-experimental. This is supported by 4.5 bootimages but *not* 4.4 or below. So this isn't actually directly related to the spec 3 transition, it's that in 4.6 we (incorrectly) started serving a too new spec2 version.
My original theory was wrong. The actual problem is this: https://github.com/openshift/machine-config-operator/blob/130947243313dcfa8a4f0ef487f458f923df1128/pkg/server/server.go#L63 Will always be spec3. And this ends up being an ignition file we stick inside ignition - that's what the firstboot program reads. This isn't a problem after the worker pool finishes upgrading to 4.6 because at that point the MCO will switch over to using the *new* MCD from this PR: https://github.com/openshift/machine-config-operator/pull/1766 (After 4.6 we are much less exposed to bugs from old bootimage versions in general) Working around this is...slightly tricky. The cleanest thing would probably be having the MCS gather up front which ignition version to render from the client and generating the Ignition from that, including putting the right version into the One possibility is to add a change directly to 4.5 that would fix this, and require people upgrading to go through that...the big hammer would be backporting https://github.com/openshift/machine-config-operator/pull/1766 to 4.5 (but that's a *big* hammer).
Eric Paris says this is a blocker.
Target 4.7.0 for the master PR, then we can clone back to 4.6.0 for the backport.
To reproduce this reliably, you should: - Provision a 4.5 (or older) cluster - Add a pod disruption budget (or otherwise un-drainable workload) that blocks upgrades on at least one worker - Start an upgrade to 4.6 - Try scaling up a worker machineset You've reproduced the bug if the worker is stuck in Provisioning. The key is blocking the upgrade of at least one worker.
More elaborate steps - Provision a 4.5 (or older) cluster - Add a pod disruption budget (or otherwise un-drainable workload) that effectively blocks upgrades on at least one worker - Start an upgrade to 4.6 - Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out; you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify it shows its version as 4.6. - Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked) - Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz`
This isn't merged yet but marking VERIFIED since we have tested it manually.
we'll leave it modified and I overwrote the flag on the 4.6 bug.
Like all good bugs, a few separate things combined here to let this sneak under the radar. The biggest obviously is the fact that this only reproduces *during* an upgrade; it's the kind of thing easy to hit in a cluster someone's using actively, but a lot of our CI is a bit too "synthetic". (Hmm, do any of our periodics enable the autoscaler? That would have helped) @wking was looking at some of this in https://github.com/openshift/release/pull/13009 The second biggest thing here though is our lack of visibility into machines which fail during ignition/firstboot. xref https://github.com/coreos/ignition/issues/585 (This issue is really the flip side of Ignition; we're in a known good state if it succeeds, but debugging is painful) I would guess it's quite possible that some people have hit this but not realized it - I think last time this happened the machineAPI team at least added a prometheus alert: https://github.com/openshift/machine-api-operator/commit/706ecf9cc21fe901fab84a7c0a49a726970560f2 But diagnosing from that alert is a whole other thing, requiring ssh (in this case) or (in cases of ignition failure) going to the console. Even if our e2e testing hit this it'd feel like a flake and could get lost in retesting. If we had the failure reporting I think we'd want any failures like that to be reported very loudly, and be sure that failure turns into a `clusteroperator/machine-config` critical alert or so. Then it should be easier to be sure that if we see this on any periodics or CI jobs (or from users obviously) it gets in front of the MCO/CoreOS teams.
Baseed on Comment 6, we reproduced this bug by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.6.0-0.nightly-2020-10-21-195503. Machines stuck in Provisioned status. Verified this by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.7.0-0.ci-2020-10-22-020841. worker node could join the cluster. Reproducer: 1. Provision a 4.5 cluster 2. Create a PDB for the deployment. Node drain would not succeed because PDB prevents it 3. Oc adm upgrade 4. Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out; you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify it shows its version as 4.6. 5. Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked) 6. Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.ci-2020-10-22-020841 True False 101m Cluster version is 4.7.0-0.ci-2020-10-22-020841 $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun22azure-6s5ls-master-0 Running Standard_D8s_v3 northcentralus 4h22m zhsun22azure-6s5ls-master-1 Running Standard_D8s_v3 northcentralus 4h22m zhsun22azure-6s5ls-master-2 Running Standard_D8s_v3 northcentralus 4h22m zhsun22azure-6s5ls-worker-northcentralus-7nglq Running Standard_D2s_v3 northcentralus 4h10m zhsun22azure-6s5ls-worker-northcentralus-fhctl Running Standard_D2s_v3 northcentralus 4h10m zhsun22azure-6s5ls-worker-northcentralus-gmd8v Running Standard_D2s_v3 northcentralus 12m zhsun22azure-6s5ls-worker-northcentralus-hfwlk Running Standard_D2s_v3 northcentralus 12m $ oc get node NAME STATUS ROLES AGE VERSION zhsun22azure-6s5ls-master-0 Ready master 4h30m v1.19.0+80fd895 zhsun22azure-6s5ls-master-1 Ready master 4h31m v1.19.0+80fd895 zhsun22azure-6s5ls-master-2 Ready master 4h30m v1.19.0+80fd895 zhsun22azure-6s5ls-worker-northcentralus-7nglq Ready worker 4h15m v1.19.0+80fd895 zhsun22azure-6s5ls-worker-northcentralus-fhctl Ready worker 4h14m v1.19.0+80fd895 zhsun22azure-6s5ls-worker-northcentralus-gmd8v Ready worker 17m v1.19.0+80fd895 zhsun22azure-6s5ls-worker-northcentralus-hfwlk Ready worker 17m v1.19.0+80fd895
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475