Bug 1890250 - workers may fail to join the cluster during an update from 4.5
Summary: workers may fail to join the cluster during an update from 4.5
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.7.0
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1890362
TreeView+ depends on / blocked
 
Reported: 2020-10-21 18:27 UTC by Colin Walters
Modified: 2021-02-24 15:27 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1890362 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:27:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2167 0 None closed Bug 1890250: mcs: Ensure that the encapsulated config is spec 2 if requested 2021-01-27 09:12:12 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:27:46 UTC

Description Colin Walters 2020-10-21 18:27:23 UTC
See https://docs.google.com/document/d/1dt1J3Ds3TVTG8KzFd0qVU-0w7AiWcX8BouWgElZBir0/edit#

I reproduced this in a cluster-bot cluster updated from 4.4 to 4.5 to 4.6.

The MCS will only serve the *status* config for new workers until the worker pool is complete.  See https://github.com/openshift/machine-config-operator/pull/2035

Operating theory is that the control plane (and hence MCS) is upgraded and starts to serve Ignition spec 2.5.0-experimental.  This is supported by 4.5 bootimages but *not* 4.4 or below.

So this isn't actually directly related to the spec 3 transition, it's that in 4.6 we (incorrectly) started serving a too new spec2 version.

Comment 1 Colin Walters 2020-10-21 20:56:19 UTC
My original theory was wrong.  The actual problem is this:

https://github.com/openshift/machine-config-operator/blob/130947243313dcfa8a4f0ef487f458f923df1128/pkg/server/server.go#L63

Will always be spec3.  And this ends up being an ignition file we stick inside ignition - that's what the firstboot program reads.

This isn't a problem after the worker pool finishes upgrading to 4.6 because at that point the MCO will switch over to using the *new* MCD from this PR:
https://github.com/openshift/machine-config-operator/pull/1766
(After 4.6 we are much less exposed to bugs from old bootimage versions in general)

Working around this is...slightly tricky.  The cleanest thing would probably be having the MCS gather up front which ignition version
to render from the client and generating the Ignition from that, including putting the right version into the 

One possibility is to add a change directly to 4.5 that would fix this, and require people upgrading to go through that...the big hammer
would be backporting https://github.com/openshift/machine-config-operator/pull/1766 to 4.5 (but that's a *big* hammer).

Comment 2 W. Trevor King 2020-10-21 21:48:44 UTC
Eric Paris says this is a blocker.

Comment 4 W. Trevor King 2020-10-21 21:51:02 UTC
Target 4.7.0 for the master PR, then we can clone back to 4.6.0 for the backport.

Comment 5 Colin Walters 2020-10-21 22:08:56 UTC
To reproduce this reliably, you should:

- Provision a 4.5 (or older) cluster
- Add a pod disruption budget (or otherwise un-drainable workload) that blocks upgrades on at least one worker
- Start an upgrade to 4.6
- Try scaling up a worker machineset

You've reproduced the bug if the worker is stuck in Provisioning.

The key is blocking the upgrade of at least one worker.

Comment 6 Colin Walters 2020-10-21 22:37:03 UTC
More elaborate steps

- Provision a 4.5 (or older) cluster
- Add a pod disruption budget (or otherwise un-drainable workload) that effectively blocks upgrades on at least one worker
- Start an upgrade to 4.6
- Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out;
  you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify
  it shows its version as 4.6.
- Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked)
- Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz`

Comment 7 Colin Walters 2020-10-22 01:35:30 UTC
This isn't merged yet but marking VERIFIED since we have tested it manually.

Comment 8 Eric Paris 2020-10-22 01:49:44 UTC
we'll leave it modified and I overwrote the flag on the 4.6 bug.

Comment 9 Colin Walters 2020-10-22 02:22:31 UTC
Like all good bugs, a few separate things combined here to let this sneak under the radar.

The biggest obviously is the fact that this only reproduces *during* an upgrade; it's
the kind of thing easy to hit in a cluster someone's using actively, but a lot of our CI is
a bit too "synthetic".  (Hmm, do any of our periodics enable the autoscaler?  That would have helped)
@wking was looking at some of this in https://github.com/openshift/release/pull/13009

The second biggest thing here though is our lack of visibility into machines which fail during ignition/firstboot.
xref https://github.com/coreos/ignition/issues/585
(This issue is really the flip side of Ignition; we're in a known good state if it succeeds, but debugging is painful)

I would guess it's quite possible that some people have hit this but not realized it - I think last time this happened the machineAPI team at least added a prometheus alert:
https://github.com/openshift/machine-api-operator/commit/706ecf9cc21fe901fab84a7c0a49a726970560f2
But diagnosing from that alert is a whole other thing, requiring ssh (in this case) or (in cases of ignition failure) going to the console.

Even if our e2e testing hit this it'd feel like a flake and could get lost in retesting.
If we had the failure reporting I think we'd want any failures like that to be reported very loudly, and be sure that failure turns into a `clusteroperator/machine-config` critical alert or so.
Then it should be easier to be sure that if we see this on any periodics or CI jobs (or from users obviously) it gets in front of the MCO/CoreOS teams.

Comment 11 sunzhaohua 2020-10-22 08:08:39 UTC
Baseed on Comment 6, we reproduced this bug by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.6.0-0.nightly-2020-10-21-195503. Machines stuck in Provisioned status.
Verified this by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.7.0-0.ci-2020-10-22-020841. worker node could join the cluster.
Reproducer:
1. Provision a 4.5 cluster
2. Create a PDB for the deployment. Node drain would not succeed because PDB prevents it
3. Oc adm upgrade 
4. Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out; you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify it shows its version as 4.6.
5. Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked)
6. Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz`

$ oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci-2020-10-22-020841   True        False         101m    Cluster version is 4.7.0-0.ci-2020-10-22-020841

$ oc get machine
NAME                                             PHASE     TYPE              REGION           ZONE   AGE
zhsun22azure-6s5ls-master-0                      Running   Standard_D8s_v3   northcentralus          4h22m
zhsun22azure-6s5ls-master-1                      Running   Standard_D8s_v3   northcentralus          4h22m
zhsun22azure-6s5ls-master-2                      Running   Standard_D8s_v3   northcentralus          4h22m
zhsun22azure-6s5ls-worker-northcentralus-7nglq   Running   Standard_D2s_v3   northcentralus          4h10m
zhsun22azure-6s5ls-worker-northcentralus-fhctl   Running   Standard_D2s_v3   northcentralus          4h10m
zhsun22azure-6s5ls-worker-northcentralus-gmd8v   Running   Standard_D2s_v3   northcentralus          12m
zhsun22azure-6s5ls-worker-northcentralus-hfwlk   Running   Standard_D2s_v3   northcentralus          12m

$ oc get node
NAME                                             STATUS   ROLES    AGE     VERSION
zhsun22azure-6s5ls-master-0                      Ready    master   4h30m   v1.19.0+80fd895
zhsun22azure-6s5ls-master-1                      Ready    master   4h31m   v1.19.0+80fd895
zhsun22azure-6s5ls-master-2                      Ready    master   4h30m   v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-7nglq   Ready    worker   4h15m   v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-fhctl   Ready    worker   4h14m   v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-gmd8v   Ready    worker   17m     v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-hfwlk   Ready    worker   17m     v1.19.0+80fd895

Comment 15 errata-xmlrpc 2021-02-24 15:27:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.