1890250 – workers may fail to join the cluster during an update from 4.5

Bug 1890250 - workers may fail to join the cluster during an update from 4.5

Summary: workers may fail to join the cluster during an update from 4.5

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1890362
TreeView+	depends on / blocked

Reported:	2020-10-21 18:27 UTC by Colin Walters
Modified:	2021-04-15 07:56 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1890362 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:27:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2167	0	None	closed	Bug 1890250: mcs: Ensure that the encapsulated config is spec 2 if requested	2021-01-27 09:12:12 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:27:46 UTC

Description Colin Walters 2020-10-21 18:27:23 UTC

See https://docs.google.com/document/d/1dt1J3Ds3TVTG8KzFd0qVU-0w7AiWcX8BouWgElZBir0/edit#

I reproduced this in a cluster-bot cluster updated from 4.4 to 4.5 to 4.6.

The MCS will only serve the *status* config for new workers until the worker pool is complete.  See https://github.com/openshift/machine-config-operator/pull/2035

Operating theory is that the control plane (and hence MCS) is upgraded and starts to serve Ignition spec 2.5.0-experimental.  This is supported by 4.5 bootimages but *not* 4.4 or below.

So this isn't actually directly related to the spec 3 transition, it's that in 4.6 we (incorrectly) started serving a too new spec2 version.

Comment 1 Colin Walters 2020-10-21 20:56:19 UTC

My original theory was wrong.  The actual problem is this:

https://github.com/openshift/machine-config-operator/blob/130947243313dcfa8a4f0ef487f458f923df1128/pkg/server/server.go#L63

Will always be spec3.  And this ends up being an ignition file we stick inside ignition - that's what the firstboot program reads.

This isn't a problem after the worker pool finishes upgrading to 4.6 because at that point the MCO will switch over to using the *new* MCD from this PR:
https://github.com/openshift/machine-config-operator/pull/1766
(After 4.6 we are much less exposed to bugs from old bootimage versions in general)

Working around this is...slightly tricky.  The cleanest thing would probably be having the MCS gather up front which ignition version
to render from the client and generating the Ignition from that, including putting the right version into the 

One possibility is to add a change directly to 4.5 that would fix this, and require people upgrading to go through that...the big hammer
would be backporting https://github.com/openshift/machine-config-operator/pull/1766 to 4.5 (but that's a *big* hammer).

Comment 2 W. Trevor King 2020-10-21 21:48:44 UTC

Eric Paris says this is a blocker.

Comment 4 W. Trevor King 2020-10-21 21:51:02 UTC

Target 4.7.0 for the master PR, then we can clone back to 4.6.0 for the backport.

Comment 5 Colin Walters 2020-10-21 22:08:56 UTC

To reproduce this reliably, you should:

- Provision a 4.5 (or older) cluster
- Add a pod disruption budget (or otherwise un-drainable workload) that blocks upgrades on at least one worker
- Start an upgrade to 4.6
- Try scaling up a worker machineset

You've reproduced the bug if the worker is stuck in Provisioning.

The key is blocking the upgrade of at least one worker.

Comment 6 Colin Walters 2020-10-21 22:37:03 UTC

More elaborate steps

- Provision a 4.5 (or older) cluster
- Add a pod disruption budget (or otherwise un-drainable workload) that effectively blocks upgrades on at least one worker
- Start an upgrade to 4.6
- Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out;
  you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify
  it shows its version as 4.6.
- Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked)
- Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz`

Comment 7 Colin Walters 2020-10-22 01:35:30 UTC

This isn't merged yet but marking VERIFIED since we have tested it manually.

Comment 8 Eric Paris 2020-10-22 01:49:44 UTC

we'll leave it modified and I overwrote the flag on the 4.6 bug.

Comment 9 Colin Walters 2020-10-22 02:22:31 UTC

Like all good bugs, a few separate things combined here to let this sneak under the radar.

The biggest obviously is the fact that this only reproduces *during* an upgrade; it's
the kind of thing easy to hit in a cluster someone's using actively, but a lot of our CI is
a bit too "synthetic". (Hmm, do any of our periodics enable the autoscaler? That would have helped)
@wking was looking at some of this in https://github.com/openshift/release/pull/13009

The second biggest thing here though is our lack of visibility into machines which fail during ignition/firstboot.
xref https://github.com/coreos/ignition/issues/585
(This issue is really the flip side of Ignition; we're in a known good state if it succeeds, but debugging is painful)

I would guess it's quite possible that some people have hit this but not realized it - I think last time this happened the machineAPI team at least added a prometheus alert:
https://github.com/openshift/machine-api-operator/commit/706ecf9cc21fe901fab84a7c0a49a726970560f2
But diagnosing from that alert is a whole other thing, requiring ssh (in this case) or (in cases of ignition failure) going to the console.

Even if our e2e testing hit this it'd feel like a flake and could get lost in retesting.
If we had the failure reporting I think we'd want any failures like that to be reported very loudly, and be sure that failure turns into a `clusteroperator/machine-config` critical alert or so.
Then it should be easier to be sure that if we see this on any periodics or CI jobs (or from users obviously) it gets in front of the MCO/CoreOS teams.

Comment 11 sunzhaohua 2020-10-22 08:08:39 UTC

Baseed on Comment 6, we reproduced this bug by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.6.0-0.nightly-2020-10-21-195503. Machines stuck in Provisioned status.
Verified this by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.7.0-0.ci-2020-10-22-020841. worker node could join the cluster.
Reproducer:
1. Provision a 4.5 cluster
2. Create a PDB for the deployment. Node drain would not succeed because PDB prevents it
3. Oc adm upgrade
4. Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out; you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify it shows its version as 4.6.
5. Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked)
6. Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz`

$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.ci-2020-10-22-020841 True False 101m Cluster version is 4.7.0-0.ci-2020-10-22-020841

$ oc get machine
NAME PHASE TYPE REGION ZONE AGE
zhsun22azure-6s5ls-master-0 Running Standard_D8s_v3 northcentralus 4h22m
zhsun22azure-6s5ls-master-1 Running Standard_D8s_v3 northcentralus 4h22m
zhsun22azure-6s5ls-master-2 Running Standard_D8s_v3 northcentralus 4h22m
zhsun22azure-6s5ls-worker-northcentralus-7nglq Running Standard_D2s_v3 northcentralus 4h10m
zhsun22azure-6s5ls-worker-northcentralus-fhctl Running Standard_D2s_v3 northcentralus 4h10m
zhsun22azure-6s5ls-worker-northcentralus-gmd8v Running Standard_D2s_v3 northcentralus 12m
zhsun22azure-6s5ls-worker-northcentralus-hfwlk Running Standard_D2s_v3 northcentralus 12m

$ oc get node
NAME STATUS ROLES AGE VERSION
zhsun22azure-6s5ls-master-0 Ready master 4h30m v1.19.0+80fd895
zhsun22azure-6s5ls-master-1 Ready master 4h31m v1.19.0+80fd895
zhsun22azure-6s5ls-master-2 Ready master 4h30m v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-7nglq Ready worker 4h15m v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-fhctl Ready worker 4h14m v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-gmd8v Ready worker 17m v1.19.0+80fd895
zhsun22azure-6s5ls-worker-northcentralus-hfwlk Ready worker 17m v1.19.0+80fd895

Comment 15 errata-xmlrpc 2021-02-24 15:27:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 16 W. Trevor King 2021-04-05 17:47:46 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.