1938947 – Update blocked from 4.6 to 4.7 when using spot/preemptible instances

Bug 1938947 - Update blocked from 4.6 to 4.7 when using spot/preemptible instances

Summary: Update blocked from 4.6 to 4.7 when using spot/preemptible instances

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6.z
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Joel Speed
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1943754
TreeView+	depends on / blocked

Reported:	2021-03-15 10:13 UTC by Alexander Niebuhr
Modified:	2021-07-27 22:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:53:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 537	0	None	open	Bug 1938947: Ensure automountServiceAccountToken is synced on service account updates	2021-03-26 16:27:00 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:53:42 UTC

Description Alexander Niebuhr 2021-03-15 10:13:40 UTC

Description of problem:
The update hangs at 26% with machine-termination-handler not starting any pods. there are currently 0 replicas of it. We did check scc, it generates its own correctly, seems not to use it maybe #554
We already delete DS of it. Disabled Cluster-Autoscaler & Machine-Autoscalers. But we do get still following events.. none other events.

```
Error creating: pods "machine-api-termination-handler-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[2]: Invalid value: "secret": secret volumes are not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used]
```
we actually need a way to get update done completly.. can we skip it or something.. update already startet so can't reset it

Version-Release number of selected component (if applicable):
from 4.6 to 4.7

reference of okd https://github.com/openshift/okd/issues/559

Additional info:
must-gather
https://drive.google.com/file/d/1UxcwoCKTcTM9lVsFEUeJkDHsA4tgzpR2/view?usp=sharing

Comment 1 Alexander Niebuhr 2021-03-15 10:18:59 UTC

workaround so the update is not hanging and continues... adding scc "priveledge" to service account of machine-termination-handler.. 
this might get overwritten by operator again, but for now it get us doing the update... 

still not fixes the bug

Comment 2 Joel Speed 2021-03-15 12:13:00 UTC

I think the problem here is to do with the ServiceAccount, though this doesn't seem to have been captured in the must gather.

Could you check that the service account for the termination handler matches https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L21-L29

In particular, that the `automountServiceAccountToken` line is correct?

I think the problem here is that (based on the output) it is trying to mount a secret (which is not in the spec of the daemonset) which is not allowed by the dedicated SCC. The only reason I can think it would be doing that is because it's trying to mount the service account token.

Comment 3 Alexander Niebuhr 2021-03-15 13:10:40 UTC

yeah service account matches in yaml config. we checked that.

`automountServiceAccountToken` does not exist, where do I find it

Comment 4 Joel Speed 2021-03-15 13:15:59 UTC

That is line 29 of the service account https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L29, are you sure it is definitely there?

Comment 5 Alexander Niebuhr 2021-03-15 13:21:35 UTC

this is our service account definition

kind: ServiceAccount
apiVersion: v1
metadata:
  name: machine-api-termination-handler
  namespace: openshift-machine-api
  selfLink: >-
    /api/v1/namespaces/openshift-machine-api/serviceaccounts/machine-api-termination-handler
  uid: 6278662a-c7f5-427b-ac3a-483abbe39ea9
  resourceVersion: '42427226'
  creationTimestamp: '2020-12-21T08:14:20Z'
  annotations:
    include.release.openshift.io/self-managed-high-availability: 'true'
    include.release.openshift.io/single-node-developer: 'true'
secrets:
  - name: machine-api-termination-handler-token-gmr6v
  - name: machine-api-termination-handler-dockercfg-rjnnx
imagePullSecrets:
  - name: machine-api-termination-handler-dockercfg-rjnnx

Comment 6 Joel Speed 2021-03-15 13:52:26 UTC

Ok so yeah that looks to be the problem, the `automountServiceAccountToken` field is missing, you should be able to add this with a value false.
It should be on the same indentation level as `secrets`. Check the link in the previous comment for an example.

Comment 7 Joel Speed 2021-03-26 11:22:17 UTC

@alexander Did you get anywhere with this? Do you need further assistance?

Comment 8 Alexander Niebuhr 2021-03-26 11:42:03 UTC

do not know, we did workaround the initial bug with dding scc "priveledge" to service account of machine-termination-handler.. 
never had any issues after that, but don't know if that happens for other people too. Or if our servic account is now having root priveledges it shouldn't ghave?!

Comment 9 Joel Speed 2021-03-26 11:56:06 UTC

Could you tell me exactly which OKD release you used so I can try to reproduce the upgrade? I assume it was one of the releases from https://github.com/openshift/okd/releases?

Comment 10 Alexander Niebuhr 2021-03-26 12:10:28 UTC

Currently on my phone https://github.com/openshift/okd/issues/559 on aws installer provided infrastructure

Comment 11 Joel Speed 2021-03-26 13:47:41 UTC

I've managed to reproduce this today. This is an upgrade blocker for anyone who uses spot instances.

The issue seems to be that the images are being updated before the manifests in the payload are being deployed by CVO.

When the images are updated, the MAO restarts and updates the DaemonSet.

The updated daemonset NEEDS the updated service account from the manifests, but for some reason this hasn't been updated yet.

Because the daemonset cannot be healthy without the updated service account, this degrades the MAO cluster operator blocking further upgrades.

Need to work out why the RBAC changes aren't being deployed before/with the image reference updates, will need some help from CVO folks for this

Comment 12 Alexander Niebuhr 2021-03-26 14:03:12 UTC

yeah we used spot instances too
but after the workaround we hit this issue https://bugzilla.redhat.com/show_bug.cgi?id=1939054
so we decided to deactivate spot instances for now

Comment 13 Lalatendu Mohanty 2021-03-26 14:30:51 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 14 Joel Speed 2021-03-26 14:59:44 UTC

Who is impacted?  
- Any customer upgrading from any 4.6.z to any 4.7.z (this should be patched in 4.8), if and only if they are using spot/preemptible instances on AWS, GCP or Azure
What is the impact?
- Upgrade stops at Machine API as MAO goes into degraded state
- Spot termination handlers are not running, spot instances may be removed without warning/graceful termination
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
- Remediation requires a patch to the machine-api-termination-handler service account, command below:
- oc patch --type merge -n openshift-machine-api serviceaccount machine-api-termination-handler -p '{"automountServiceAccountToken":false}'
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
- No, we changed the way the termination handlers work in 4.7, but it is all permission changes, so no change in functionality

Comment 15 W. Trevor King 2021-03-27 04:37:39 UTC

Potential verification process:

1. Install a 4.8 nightly with the fix.
2. Poke automountServiceAccountToken in some CVO-managed ServiceAccount, e.g. machine-api-termination-handler in the openshift-machine-api namespace [1].
3. Wait a few minutes.
4. Confirm that the CVO has stomped your change, and the property is back to its original value.

[1]: https://github.com/openshift/machine-api-operator/commit/deaa09f1dcfaa8cdbc84a0e760edc03f1255d903#diff-9cd166d71ea385fc76930a2e6b3df411a0c7418edcdee6e5039218dce403c175R19-R26

Comment 17 W. Trevor King 2021-03-31 03:46:10 UTC

I'm clearing UpgradeBlocker based on the 'oc patch ...' command from comment 14 being a sufficiently straightforward workaround for anyone who gets bit by this before we get a fix out.

Comment 18 sunzhaohua 2021-03-31 07:40:21 UTC

Verified
1.set up a 4.7.4 cluster
2.create a spot instance
3.upgrade to 4.8.0-0.nightly-2021-03-30-181828 successfully
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-30-181828   True        False         44m     Cluster version is 4.8.0-0.nightly-2021-03-30-181828
$ oc get node
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-134-198.us-east-2.compute.internal   Ready    master   3h29m   v1.20.0+29a606d
ip-10-0-152-81.us-east-2.compute.internal    Ready    worker   3h21m   v1.20.0+29a606d
ip-10-0-168-137.us-east-2.compute.internal   Ready    worker   3h21m   v1.20.0+29a606d
ip-10-0-184-32.us-east-2.compute.internal    Ready    master   3h29m   v1.20.0+29a606d
ip-10-0-203-35.us-east-2.compute.internal    Ready    master   3h29m   v1.20.0+29a606d
ip-10-0-211-71.us-east-2.compute.internal    Ready    worker   169m    v1.20.0+29a606d
$ oc get po
NAME                                           READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-6c5c5b564c-fb996   2/2     Running   0          52m
cluster-baremetal-operator-664cb9c5d9-vjmsm    2/2     Running   0          52m
machine-api-controllers-78fbffc475-26p97       7/7     Running   0          52m
machine-api-operator-5675cb644f-mcmz2          2/2     Running   0          52m
machine-api-termination-handler-wh6g5          1/1     Running   0          135m

Comment 21 errata-xmlrpc 2021-07-27 22:53:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.