1943754 – Update blocked from 4.6 to 4.7 when using spot/preemptible instances

Bug 1943754 - Update blocked from 4.6 to 4.7 when using spot/preemptible instances

Summary: Update blocked from 4.6 to 4.7 when using spot/preemptible instances

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6.z
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:	1938947
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-27 04:30 UTC by OpenShift BugZilla Robot
Modified:	2021-04-20 18:52 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-20 18:52:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 539	0	None	open	Bug 1943754: Ensure automountServiceAccountToken is synced on service account updates	2021-03-28 03:40:26 UTC
Red Hat Product Errata	RHBA-2021:1149	0	None	None	None	2021-04-20 18:52:59 UTC

Description OpenShift BugZilla Robot 2021-03-27 04:30:44 UTC

+++ This bug was initially created as a clone of Bug #1938947 +++

Description of problem:
The update hangs at 26% with machine-termination-handler not starting any pods. there are currently 0 replicas of it. We did check scc, it generates its own correctly, seems not to use it maybe #554 
We already delete DS of it. Disabled Cluster-Autoscaler & Machine-Autoscalers. But we do get still following events.. none other events.

```
Error creating: pods "machine-api-termination-handler-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[2]: Invalid value: "secret": secret volumes are not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]
```
we actually need a way to get update done completly.. can we skip it or something.. update already startet so can't reset it

Version-Release number of selected component (if applicable):
from 4.6 to 4.7

reference of okd https://github.com/openshift/okd/issues/559

Additional info:
must-gather
https://drive.google.com/file/d/1UxcwoCKTcTM9lVsFEUeJkDHsA4tgzpR2/view?usp=sharing

--- Additional comment from alexander on 2021-03-15 10:18:59 UTC ---

workaround so the update is not hanging and continues... adding scc "priveledge" to service account of machine-termination-handler.. 
this might get overwritten by operator again, but for now it get us doing the update... 

still not fixes the bug

--- Additional comment from jspeed on 2021-03-15 12:13:00 UTC ---

I think the problem here is to do with the ServiceAccount, though this doesn't seem to have been captured in the must gather.

Could you check that the service account for the termination handler matches https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L21-L29

In particular, that the `automountServiceAccountToken` line is correct?

I think the problem here is that (based on the output) it is trying to mount a secret (which is not in the spec of the daemonset) which is not allowed by the dedicated SCC. The only reason I can think it would be doing that is because it's trying to mount the service account token.

--- Additional comment from alexander on 2021-03-15 13:10:40 UTC ---

yeah service account matches in yaml config. we checked that.

`automountServiceAccountToken` does not exist, where do I find it

--- Additional comment from jspeed on 2021-03-15 13:15:59 UTC ---

That is line 29 of the service account https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L29, are you sure it is definitely there?

--- Additional comment from alexander on 2021-03-15 13:21:35 UTC ---

this is our service account definition

kind: ServiceAccount
apiVersion: v1
metadata:
  name: machine-api-termination-handler
  namespace: openshift-machine-api
  selfLink: >-
    /api/v1/namespaces/openshift-machine-api/serviceaccounts/machine-api-termination-handler
  uid: 6278662a-c7f5-427b-ac3a-483abbe39ea9
  resourceVersion: '42427226'
  creationTimestamp: '2020-12-21T08:14:20Z'
  annotations:
    include.release.openshift.io/self-managed-high-availability: 'true'
    include.release.openshift.io/single-node-developer: 'true'
secrets:
  - name: machine-api-termination-handler-token-gmr6v
  - name: machine-api-termination-handler-dockercfg-rjnnx
imagePullSecrets:
  - name: machine-api-termination-handler-dockercfg-rjnnx

--- Additional comment from jspeed on 2021-03-15 13:52:26 UTC ---

Ok so yeah that looks to be the problem, the `automountServiceAccountToken` field is missing, you should be able to add this with a value false.
It should be on the same indentation level as `secrets`. Check the link in the previous comment for an example.

--- Additional comment from jspeed on 2021-03-26 11:22:17 UTC ---

@alexander Did you get anywhere with this? Do you need further assistance?

--- Additional comment from alexander on 2021-03-26 11:42:03 UTC ---

do not know, we did workaround the initial bug with dding scc "priveledge" to service account of machine-termination-handler.. 
never had any issues after that, but don't know if that happens for other people too. Or if our servic account is now having root priveledges it shouldn't ghave?!

--- Additional comment from jspeed on 2021-03-26 11:56:06 UTC ---

Could you tell me exactly which OKD release you used so I can try to reproduce the upgrade? I assume it was one of the releases from https://github.com/openshift/okd/releases?

--- Additional comment from alexander on 2021-03-26 12:10:28 UTC ---

Currently on my phone https://github.com/openshift/okd/issues/559 on aws installer provided infrastructure

--- Additional comment from jspeed on 2021-03-26 13:47:41 UTC ---

I've managed to reproduce this today. This is an upgrade blocker for anyone who uses spot instances.

The issue seems to be that the images are being updated before the manifests in the payload are being deployed by CVO.

When the images are updated, the MAO restarts and updates the DaemonSet.

The updated daemonset NEEDS the updated service account from the manifests, but for some reason this hasn't been updated yet.

Because the daemonset cannot be healthy without the updated service account, this degrades the MAO cluster operator blocking further upgrades.

Need to work out why the RBAC changes aren't being deployed before/with the image reference updates, will need some help from CVO folks for this

--- Additional comment from alexander on 2021-03-26 14:03:12 UTC ---

yeah we used spot instances too
but after the workaround we hit this issue https://bugzilla.redhat.com/show_bug.cgi?id=1939054
so we decided to deactivate spot instances for now

--- Additional comment from lmohanty on 2021-03-26 14:30:51 UTC ---

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

--- Additional comment from jspeed on 2021-03-26 14:59:44 UTC ---

Who is impacted?  
- Any customer upgrading from any 4.6.z to any 4.7.z (this should be patched in 4.8), if and only if they are using spot/preemptible instances on AWS, GCP or Azure
What is the impact?
- Upgrade stops at Machine API as MAO goes into degraded state
- Spot termination handlers are not running, spot instances may be removed without warning/graceful termination
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
- Remediation requires a patch to the machine-api-termination-handler service account, command below:
- oc patch --type merge -n openshift-machine-api serviceaccount machine-api-termination-handler -p '{"automountServiceAccountToken":false}'
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
- No, we changed the way the termination handlers work in 4.7, but it is all permission changes, so no change in functionality

Comment 1 W. Trevor King 2021-03-31 03:47:07 UTC

I'm clearing UpgradeBlocker from this series based on the straightforward 'oc patch ...' workaround [1].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1938947#c17

Comment 4 sunzhaohua 2021-04-14 06:24:14 UTC

Failed to verify.
Steps:
1.set up a 4.6.24 cluster
2.create a spot instance
3.upgrade to 4.7.0-0.nightly-2021-04-13-144216, the update hangs at 26% with machine-termination-handler not starting any pods.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.24    True        True          123m    Working towards 4.7.0-0.nightly-2021-04-13-144216: 178 of 668 done (26% complete), waiting on machine-api

$ oc get co machine-api -o yaml
    message: 'Failed when progressing towards operator: 4.7.0-0.nightly-2021-04-13-144216 because daemonset machine-api-termination-handler is not ready. status: (desired: 1, updated: 0, available: 0, unavailable: 1)'
    reason: SyncingFailed
    status: "True"
    type: Degraded

4m28s       Warning   FailedCreate        daemonset/machine-api-termination-handler           Error creating: pods "machine-api-termination-handler-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[2]: Invalid value: "secret": secret volumes are not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]

$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         0       0            0           machine.openshift.io/interruptible-instance=   3h10m

$ oc get po
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-56c4b7fc94-82chn   2/2     Running   0          168m
machine-api-controllers-f64fd7646-svxtk        7/7     Running   0          100m
machine-api-operator-cf4d88fc4-bkzlh           2/2     Running   0          102m

$ oc get sa machine-api-termination-handler -o yaml
apiVersion: v1
imagePullSecrets:
- name: machine-api-termination-handler-dockercfg-tw66n
kind: ServiceAccount
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2021-04-14T02:57:28Z"
  name: machine-api-termination-handler
  namespace: openshift-machine-api
  resourceVersion: "57024"
  selfLink: /api/v1/namespaces/openshift-machine-api/serviceaccounts/machine-api-termination-handler
  uid: 593ac503-c57d-4ce3-920e-b4e1c447a6aa
secrets:
- name: machine-api-termination-handler-token-2mn2p
- name: machine-api-termination-handler-dockercfg-tw66n

Comment 7 sunzhaohua 2021-04-16 06:53:11 UTC

Verified
clusterversion: 4.7.0-0.nightly-2021-04-15-035247

Steps:
1.set up a 4.6.24 cluster
2.create a spot instance
3.upgrade to 4.7.0-0.nightly-2021-04-15-035247, upgrade is successful

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-04-15-035247   True        False         13m     Cluster version is 4.7.0-0.nightly-2021-04-15-035247
$ oc get ds
NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                  AGE
machine-api-termination-handler   1         1         1       1            1           machine.openshift.io/interruptible-instance=   162m

Comment 9 errata-xmlrpc 2021-04-20 18:52:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.7 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1149

Note You need to log in before you can comment on or make changes to this bug.