1890362 – workers may fail to join the cluster during an update from 4.5

Bug 1890362 - workers may fail to join the cluster during an update from 4.5

Summary: workers may fail to join the cluster during an update from 4.5

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1890250
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-22 01:33 UTC by Eric Paris
Modified:	2021-04-15 07:57 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1890250
Environment:
Last Closed:	2020-10-27 16:47:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2168	0	None	closed	Bug 1890362: mcs: Ensure that the encapsulated config is spec 2 if requested	2020-12-17 03:42:06 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:48:11 UTC

Comment 1 sunzhaohua 2020-10-22 08:13:14 UTC

Verified this by upgrading from 4.5.0-0.nightly-2020-10-21-224736 to 4.6.0-0.nightly-2020-10-22-034051. worker node could join the cluster.
Reproducer:
1. Provision a 4.5 cluster
2. Create a PDB for the deployment. Node drain would not succeed because PDB prevents it
3. Oc adm upgrade 
4. Wait looking at `oc -n openshift-machine-config-operator get ds/machine-config-server` for the new MCS to roll out; you can use e.g. `oc -n openshift-machine-config-operator logs pod/machine-config-server-xyz` and verify it shows its version as 4.6.
5. Verify in `oc get machineconfigpool/worker` that the pool is still progressing (you have at least one worker blocked)
6. Try scaling up a worker machineset via e.g. `oc -n openshift-machine-api scale machineset/worker-xyz`

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-10-22-034051   True        False         7m17s   Cluster version is 4.6.0-0.nightly-2020-10-22-034051

$ oc get machine
NAME                                         PHASE     TYPE         REGION      ZONE         AGE
zhsun22aws1-4d6b9-master-0                   Running   m4.xlarge    us-east-2   us-east-2a   3h39m
zhsun22aws1-4d6b9-master-1                   Running   m4.xlarge    us-east-2   us-east-2b   3h39m
zhsun22aws1-4d6b9-master-2                   Running   m4.xlarge    us-east-2   us-east-2c   3h39m
zhsun22aws1-4d6b9-worker-us-east-2a-6vblc    Running   m4.large     us-east-2   us-east-2a   9m27s
zhsun22aws1-4d6b9-worker-us-east-2aa-m8d4t   Running   m5.4xlarge   us-east-2   us-east-2a   78m
zhsun22aws1-4d6b9-worker-us-east-2b-2r695    Running   m4.large     us-east-2   us-east-2b   9m27s
zhsun22aws1-4d6b9-worker-us-east-2b-s6rmx    Running   m4.large     us-east-2   us-east-2b   3h26m
zhsun22aws1-4d6b9-worker-us-east-2c-44fkq    Running   m4.large     us-east-2   us-east-2c   3h26m

$ oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-144-169.us-east-2.compute.internal   Ready    worker   24m     v1.19.0+d59ce34
ip-10-0-154-83.us-east-2.compute.internal    Ready    master   3h50m   v1.19.0+d59ce34
ip-10-0-159-91.us-east-2.compute.internal    Ready    worker   94m     v1.19.0+d59ce34
ip-10-0-177-122.us-east-2.compute.internal   Ready    worker   3h41m   v1.19.0+d59ce34
ip-10-0-177-223.us-east-2.compute.internal   Ready    worker   23m     v1.19.0+d59ce34
ip-10-0-180-169.us-east-2.compute.internal   Ready    master   3h50m   v1.19.0+d59ce34
ip-10-0-200-202.us-east-2.compute.internal   Ready    master   3h50m   v1.19.0+d59ce34
ip-10-0-209-66.us-east-2.compute.internal    Ready    worker   3h41m   v1.19.0+d59ce34

Comment 6 errata-xmlrpc 2020-10-27 16:47:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.