Description of problem: The update hangs at 26% with machine-termination-handler not starting any pods. there are currently 0 replicas of it. We did check scc, it generates its own correctly, seems not to use it maybe #554 We already delete DS of it. Disabled Cluster-Autoscaler & Machine-Autoscalers. But we do get still following events.. none other events. ``` Error creating: pods "machine-api-termination-handler-" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.volumes[2]: Invalid value: "secret": secret volumes are not allowed to be used spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used] ``` we actually need a way to get update done completly.. can we skip it or something.. update already startet so can't reset it Version-Release number of selected component (if applicable): from 4.6 to 4.7 reference of okd https://github.com/openshift/okd/issues/559 Additional info: must-gather https://drive.google.com/file/d/1UxcwoCKTcTM9lVsFEUeJkDHsA4tgzpR2/view?usp=sharing
workaround so the update is not hanging and continues... adding scc "priveledge" to service account of machine-termination-handler.. this might get overwritten by operator again, but for now it get us doing the update... still not fixes the bug
I think the problem here is to do with the ServiceAccount, though this doesn't seem to have been captured in the must gather. Could you check that the service account for the termination handler matches https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L21-L29 In particular, that the `automountServiceAccountToken` line is correct? I think the problem here is that (based on the output) it is trying to mount a secret (which is not in the spec of the daemonset) which is not allowed by the dedicated SCC. The only reason I can think it would be doing that is because it's trying to mount the service account token.
yeah service account matches in yaml config. we checked that. `automountServiceAccountToken` does not exist, where do I find it
That is line 29 of the service account https://github.com/openshift/machine-api-operator/blob/ff46cf5e8df5cb27d34b1e1e67e297ed21b42b3e/install/0000_30_machine-api-operator_09_rbac.yaml#L29, are you sure it is definitely there?
this is our service account definition kind: ServiceAccount apiVersion: v1 metadata: name: machine-api-termination-handler namespace: openshift-machine-api selfLink: >- /api/v1/namespaces/openshift-machine-api/serviceaccounts/machine-api-termination-handler uid: 6278662a-c7f5-427b-ac3a-483abbe39ea9 resourceVersion: '42427226' creationTimestamp: '2020-12-21T08:14:20Z' annotations: include.release.openshift.io/self-managed-high-availability: 'true' include.release.openshift.io/single-node-developer: 'true' secrets: - name: machine-api-termination-handler-token-gmr6v - name: machine-api-termination-handler-dockercfg-rjnnx imagePullSecrets: - name: machine-api-termination-handler-dockercfg-rjnnx
Ok so yeah that looks to be the problem, the `automountServiceAccountToken` field is missing, you should be able to add this with a value false. It should be on the same indentation level as `secrets`. Check the link in the previous comment for an example.
@alexander Did you get anywhere with this? Do you need further assistance?
do not know, we did workaround the initial bug with dding scc "priveledge" to service account of machine-termination-handler.. never had any issues after that, but don't know if that happens for other people too. Or if our servic account is now having root priveledges it shouldn't ghave?!
Could you tell me exactly which OKD release you used so I can try to reproduce the upgrade? I assume it was one of the releases from https://github.com/openshift/okd/releases?
Currently on my phone https://github.com/openshift/okd/issues/559 on aws installer provided infrastructure
I've managed to reproduce this today. This is an upgrade blocker for anyone who uses spot instances. The issue seems to be that the images are being updated before the manifests in the payload are being deployed by CVO. When the images are updated, the MAO restarts and updates the DaemonSet. The updated daemonset NEEDS the updated service account from the manifests, but for some reason this hasn't been updated yet. Because the daemonset cannot be healthy without the updated service account, this degrades the MAO cluster operator blocking further upgrades. Need to work out why the RBAC changes aren't being deployed before/with the image reference updates, will need some help from CVO folks for this
yeah we used spot instances too but after the workaround we hit this issue https://bugzilla.redhat.com/show_bug.cgi?id=1939054 so we decided to deactivate spot instances for now
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Who is impacted? - Any customer upgrading from any 4.6.z to any 4.7.z (this should be patched in 4.8), if and only if they are using spot/preemptible instances on AWS, GCP or Azure What is the impact? - Upgrade stops at Machine API as MAO goes into degraded state - Spot termination handlers are not running, spot instances may be removed without warning/graceful termination How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? - Remediation requires a patch to the machine-api-termination-handler service account, command below: - oc patch --type merge -n openshift-machine-api serviceaccount machine-api-termination-handler -p '{"automountServiceAccountToken":false}' Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? - No, we changed the way the termination handlers work in 4.7, but it is all permission changes, so no change in functionality
Potential verification process: 1. Install a 4.8 nightly with the fix. 2. Poke automountServiceAccountToken in some CVO-managed ServiceAccount, e.g. machine-api-termination-handler in the openshift-machine-api namespace [1]. 3. Wait a few minutes. 4. Confirm that the CVO has stomped your change, and the property is back to its original value. [1]: https://github.com/openshift/machine-api-operator/commit/deaa09f1dcfaa8cdbc84a0e760edc03f1255d903#diff-9cd166d71ea385fc76930a2e6b3df411a0c7418edcdee6e5039218dce403c175R19-R26
I'm clearing UpgradeBlocker based on the 'oc patch ...' command from comment 14 being a sufficiently straightforward workaround for anyone who gets bit by this before we get a fix out.
Verified 1.set up a 4.7.4 cluster 2.create a spot instance 3.upgrade to 4.8.0-0.nightly-2021-03-30-181828 successfully $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-30-181828 True False 44m Cluster version is 4.8.0-0.nightly-2021-03-30-181828 $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-134-198.us-east-2.compute.internal Ready master 3h29m v1.20.0+29a606d ip-10-0-152-81.us-east-2.compute.internal Ready worker 3h21m v1.20.0+29a606d ip-10-0-168-137.us-east-2.compute.internal Ready worker 3h21m v1.20.0+29a606d ip-10-0-184-32.us-east-2.compute.internal Ready master 3h29m v1.20.0+29a606d ip-10-0-203-35.us-east-2.compute.internal Ready master 3h29m v1.20.0+29a606d ip-10-0-211-71.us-east-2.compute.internal Ready worker 169m v1.20.0+29a606d $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-6c5c5b564c-fb996 2/2 Running 0 52m cluster-baremetal-operator-664cb9c5d9-vjmsm 2/2 Running 0 52m machine-api-controllers-78fbffc475-26p97 7/7 Running 0 52m machine-api-operator-5675cb644f-mcmz2 2/2 Running 0 52m machine-api-termination-handler-wh6g5 1/1 Running 0 135m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438