+++ This bug was initially created as a clone of Bug #1938947 +++ Description of problem: The update hangs at 26% with machine-termination-handler not starting any pods. there are currently 0 replicas of it. We did check scc, it generates its own correctly, seems not to use it maybe #554 We already delete DS of it. Disabled Cluster-Autoscaler & Machine-Autoscalers. But we do get still following events.. none other events. ``` Error creating: pods "machine-api-termination-handler-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[2]: Invalid value: "secret": secret volumes are not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used] ``` we actually need a way to get update done completly.. can we skip it or something.. update already startet so can't reset it Version-Release number of selected component (if applicable): from 4.6 to 4.7 reference of okd https://github.com/openshift/okd/issues/559 Additional info: must-gather https://drive.google.com/file/d/1UxcwoCKTcTM9lVsFEUeJkDHsA4tgzpR2/view?usp=sharing --- Additional comment from alexander on 2021-03-15 10:18:59 UTC --- workaround so the update is not hanging and continues... adding scc "priveledge" to service account of machine-termination-handler.. this might get overwritten by operator again, but for now it get us doing the update... still not fixes the bug --- Additional comment from jspeed on 2021-03-15 12:13:00 UTC --- I think the problem here is to do with the ServiceAccount, though this doesn't seem to have been captured in the must gather. Could you check that the service account for the termination handler matches https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L21-L29 In particular, that the `automountServiceAccountToken` line is correct? I think the problem here is that (based on the output) it is trying to mount a secret (which is not in the spec of the daemonset) which is not allowed by the dedicated SCC. The only reason I can think it would be doing that is because it's trying to mount the service account token. --- Additional comment from alexander on 2021-03-15 13:10:40 UTC --- yeah service account matches in yaml config. we checked that. `automountServiceAccountToken` does not exist, where do I find it --- Additional comment from jspeed on 2021-03-15 13:15:59 UTC --- That is line 29 of the service account https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L29, are you sure it is definitely there? --- Additional comment from alexander on 2021-03-15 13:21:35 UTC --- this is our service account definition kind: ServiceAccount apiVersion: v1 metadata: name: machine-api-termination-handler namespace: openshift-machine-api selfLink: >- /api/v1/namespaces/openshift-machine-api/serviceaccounts/machine-api-termination-handler uid: 6278662a-c7f5-427b-ac3a-483abbe39ea9 resourceVersion: '42427226' creationTimestamp: '2020-12-21T08:14:20Z' annotations: include.release.openshift.io/self-managed-high-availability: 'true' include.release.openshift.io/single-node-developer: 'true' secrets: - name: machine-api-termination-handler-token-gmr6v - name: machine-api-termination-handler-dockercfg-rjnnx imagePullSecrets: - name: machine-api-termination-handler-dockercfg-rjnnx --- Additional comment from jspeed on 2021-03-15 13:52:26 UTC --- Ok so yeah that looks to be the problem, the `automountServiceAccountToken` field is missing, you should be able to add this with a value false. It should be on the same indentation level as `secrets`. Check the link in the previous comment for an example. --- Additional comment from jspeed on 2021-03-26 11:22:17 UTC --- @alexander Did you get anywhere with this? Do you need further assistance? --- Additional comment from alexander on 2021-03-26 11:42:03 UTC --- do not know, we did workaround the initial bug with dding scc "priveledge" to service account of machine-termination-handler.. never had any issues after that, but don't know if that happens for other people too. Or if our servic account is now having root priveledges it shouldn't ghave?! --- Additional comment from jspeed on 2021-03-26 11:56:06 UTC --- Could you tell me exactly which OKD release you used so I can try to reproduce the upgrade? I assume it was one of the releases from https://github.com/openshift/okd/releases? --- Additional comment from alexander on 2021-03-26 12:10:28 UTC --- Currently on my phone https://github.com/openshift/okd/issues/559 on aws installer provided infrastructure --- Additional comment from jspeed on 2021-03-26 13:47:41 UTC --- I've managed to reproduce this today. This is an upgrade blocker for anyone who uses spot instances. The issue seems to be that the images are being updated before the manifests in the payload are being deployed by CVO. When the images are updated, the MAO restarts and updates the DaemonSet. The updated daemonset NEEDS the updated service account from the manifests, but for some reason this hasn't been updated yet. Because the daemonset cannot be healthy without the updated service account, this degrades the MAO cluster operator blocking further upgrades. Need to work out why the RBAC changes aren't being deployed before/with the image reference updates, will need some help from CVO folks for this --- Additional comment from alexander on 2021-03-26 14:03:12 UTC --- yeah we used spot instances too but after the workaround we hit this issue https://bugzilla.redhat.com/show_bug.cgi?id=1939054 so we decided to deactivate spot instances for now --- Additional comment from lmohanty on 2021-03-26 14:30:51 UTC --- We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 --- Additional comment from jspeed on 2021-03-26 14:59:44 UTC --- Who is impacted? - Any customer upgrading from any 4.6.z to any 4.7.z (this should be patched in 4.8), if and only if they are using spot/preemptible instances on AWS, GCP or Azure What is the impact? - Upgrade stops at Machine API as MAO goes into degraded state - Spot termination handlers are not running, spot instances may be removed without warning/graceful termination How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? - Remediation requires a patch to the machine-api-termination-handler service account, command below: - oc patch --type merge -n openshift-machine-api serviceaccount machine-api-termination-handler -p '{"automountServiceAccountToken":false}' Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? - No, we changed the way the termination handlers work in 4.7, but it is all permission changes, so no change in functionality
I'm clearing UpgradeBlocker from this series based on the straightforward 'oc patch ...' workaround [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1938947#c17
Failed to verify. Steps: 1.set up a 4.6.24 cluster 2.create a spot instance 3.upgrade to 4.7.0-0.nightly-2021-04-13-144216, the update hangs at 26% with machine-termination-handler not starting any pods. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.24 True True 123m Working towards 4.7.0-0.nightly-2021-04-13-144216: 178 of 668 done (26% complete), waiting on machine-api $ oc get co machine-api -o yaml message: 'Failed when progressing towards operator: 4.7.0-0.nightly-2021-04-13-144216 because daemonset machine-api-termination-handler is not ready. status: (desired: 1, updated: 0, available: 0, unavailable: 1)' reason: SyncingFailed status: "True" type: Degraded 4m28s Warning FailedCreate daemonset/machine-api-termination-handler Error creating: pods "machine-api-termination-handler-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[2]: Invalid value: "secret": secret volumes are not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used] $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-api-termination-handler 1 1 0 0 0 machine.openshift.io/interruptible-instance= 3h10m $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-56c4b7fc94-82chn 2/2 Running 0 168m machine-api-controllers-f64fd7646-svxtk 7/7 Running 0 100m machine-api-operator-cf4d88fc4-bkzlh 2/2 Running 0 102m $ oc get sa machine-api-termination-handler -o yaml apiVersion: v1 imagePullSecrets: - name: machine-api-termination-handler-dockercfg-tw66n kind: ServiceAccount metadata: annotations: include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2021-04-14T02:57:28Z" name: machine-api-termination-handler namespace: openshift-machine-api resourceVersion: "57024" selfLink: /api/v1/namespaces/openshift-machine-api/serviceaccounts/machine-api-termination-handler uid: 593ac503-c57d-4ce3-920e-b4e1c447a6aa secrets: - name: machine-api-termination-handler-token-2mn2p - name: machine-api-termination-handler-dockercfg-tw66n
Verified clusterversion: 4.7.0-0.nightly-2021-04-15-035247 Steps: 1.set up a 4.6.24 cluster 2.create a spot instance 3.upgrade to 4.7.0-0.nightly-2021-04-15-035247, upgrade is successful $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-04-15-035247 True False 13m Cluster version is 4.7.0-0.nightly-2021-04-15-035247 $ oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-api-termination-handler 1 1 1 1 1 machine.openshift.io/interruptible-instance= 162m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.7 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1149