Description of problem: Version-Release number of the following components: OpenShift 4.3.8 How reproducible: Steps to Reproduce: 1.Deploy 4.3.8 cluster 2. Try to upgrade to 4.3.9 Actual results: 'Unable to apply 4.3.9: it may not be safe to apply this update' 'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]' Expected results: 4.3.9 Cluster Additional info:
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1818893 We installed NetApp trident in this cluster and it adds itself to the privileged SCC [ncc@t490s ~]$ oc get scc privileged -o yaml allowHostDirVolumePlugin: true allowHostIPC: true allowHostNetwork: true allowHostPID: true allowHostPorts: true allowPrivilegeEscalation: true allowPrivilegedContainer: true allowedCapabilities: - '*' allowedUnsafeSysctls: - '*' apiVersion: security.openshift.io/v1 defaultAddCapabilities: null fsGroup: type: RunAsAny groups: - system:cluster-admins - system:nodes - system:masters kind: SecurityContextConstraints metadata: annotations: kubernetes.io/description: 'privileged allows access to all privileged and host features and the ability to run as any user, any group, any fsGroup, and with any SELinux context. WARNING: this is the most relaxed SCC and should be used only for cluster administration. Grant with caution.' creationTimestamp: "2020-04-06T21:56:25Z" generation: 2 name: privileged resourceVersion: "301708" selfLink: /apis/security.openshift.io/v1/securitycontextconstraints/privileged uid: daebc90c-f795-42ab-a830-dc3c1e8ad962 priority: null readOnlyRootFilesystem: false requiredDropCapabilities: null runAsUser: type: RunAsAny seLinuxContext: type: RunAsAny seccompProfiles: - '*' supplementalGroups: type: RunAsAny users: - system:admin - system:serviceaccount:openshift-infra:build-controller - system:serviceaccount:trident:trident-csi volumes:
This is expected.
Customers adding additional users to the default sccs is a routine operation. Many of our third-party integrations, such as NetApp mentioned here add themselves to the default sccs. What is the expected upgrade path from clusters such as this one? Seems as though we should ignore the scc.users array in this check?
Objects created by the system are owned by the system. We cannot reconcile arbitrary changes to them and guarantee working upgrades. The work around: either ship another SCC as part of the 3rdparty component, or add the user to one of the group which are allowed to access the SCC. Being able to run with the privileged SCC is equivalent to run as cluster-admin.
*** This bug has been marked as a duplicate of bug 1820231 ***
Created trident issue here: https://github.com/NetApp/trident/issues/374
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 Even if we decide not to do anything to revert the newly set Upgradeable=False condition here this is the triggering bug, we can CLOSED WONTFIX
For the workaround, do we have the commands customer can run to fix the SCC so that it does not show mutated? Or we can suggest a force upgrade.
Adding upgradeblocker as we do not want customers go to versions which would make SCC to be marked mutated
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1820231
How does 4.2 updates in 4.2.z stream gets impacted if a cluster has mutated SCCs. Also does it impact 4.2.x to 4.3.x upgrades?
> How does 4.2 updates in 4.2.z stream gets impacted if a cluster has mutated SCCs. Also does it impact 4.2.x to 4.3.x upgrades? Bug series that introduced the mutated check is bug 1794309. That was backported to 4.3.z with bug 1808602 , but not cloned or backported to 4.2.z (yet). So there should be no impact on 4.1, 4.2, or 4.2 -> 4.3 updates.
Also, for recovering in the meantime (before bug 1820231 lands in 4.3.z, and for folks who attempt 4.3 -> 4.4 updates before un-modifying their default SCCs, the procedure from bug 1822752 is: 1. Figure out the release image pullspec for your current release (e.g. 4.3.8, if you're stuck on 4.3.8 -> 4.3.9) by looking in your ClusterVersion .status.history. 2. oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image $CURRENT_RELEASE_PULLSPEC_BY_DIGEST That's safe (as long as you don't fumble the pullspec), because when you are stuck on the precondition, the cluster-version operator has not actually begun applying any of the next release's manifests. And going forward, bug 1822752 will hopefully give us a more convenient way to cancel precondition-blocked updates. Canceling a stuck update (e.g. 4.3.8 -> 4.3.9) will still leave your cluster on a release that has an overly sensitive CVO precondition trigger. Ideally, un-modify your default SCCs. But if you want to update in 4.3.z without doing that yet, you may be able to blow through that with: $ oc adm upgrade --allow-upgrade-with-warnings --to ... although I haven't actually tested that yet. And I dunno what the web console has in this space, since things like --allow-upgrade-with-warnings are currently client-side oc checks.
Created attachment 1677680 [details] Default SCC objects for 4.3 Default SCC object(s) for 4.3
Background: OpenShift ships with a set of default SecurityContextConstraints object(s), these are [anyuid, hostaccess, hostmount-anyuid, hostnetwork, nonroot, privileged, restricted]. So far, cluster admins have been granting user(s) access to the default SCCs by directly adding the user to it, using the `oc adm policy add-scc-to-user` command. The official documentation covers this topic - https://docs.openshift.com/container-platform/3.11/admin_guide/manage_scc.html#grant-access-to-the-privileged-scc On the other hand, we also treat the default SCCs as unmodifiable, the beginning of the doc mentions - "Do not modify the default SCCs. Customizing the default SCCs can lead to issues when upgrading. Instead, create new SCCs." The default SecurityContextConstraints object(s) are explicitly reserved for the system. In OpenShift 4.4 we brought the default SCCs under active management. Any changes made to the default SecurityContextConstraints object(s) are automatically stomped. This means customer workloads that were relying on this explicitly forbidden behavior will experience outages. The following sections highlight the changes we have made, customer impacts and resolution. Changes: - 4.4: Starting with 4.4 and onward the default SecurityContextConstraints object(s) are managed by Cluster Version Operator (CVO) operator and hence any changes to these SecurityContextConstraints object(s) will be automatically stomped by CVO. - 4.3: In 4.3 we mark the cluster as `Upgradeable=False` if any of the default SecurityContextConstraints has been mutated. We have a controller that watches the SecurityContextConstraints object(s) and if it detects any change `Upgradeable` is set to `False`, as shown below. - lastTransitionTime: "2020-03-11T06:05:31Z" type: Upgradeable status: "False" message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [anyuid hostmount-anyuid privileged]' reason: DefaultSecurityContextConstraints_Mutated This feature is available from 4.3.8. Who is impacted? Customers who modified system managed SCCs on clusters running 4.3.z (z >= 8 ). What is the impact? - clusteroperator/kube-apiserver Upgradeable=false. - 4.3 -> 4.4 upgrade is disallowed by CVO. - z-level upgrade (4.3.z to 4.3.z+) is disallowed by CVO, but would still likely succeed if forced with --allow-upgrade-with-warnings as long as DefaultSecurityContextConstraints_Mutated was the only Upgradeable=False condition being overridden. Resolution: - Restore the default SCC, have the customer workload use new SCC or use role based access to the default SCC. - Force upgrade. How do I grant user(s) access to the default SCC without changing it? - Cluster admin can always create new SCC object(s). - You can use RBAC to grant user(s) access to the default SCC and thus avoid making changes to it. For more information see - https://docs.openshift.com/container-platform/4.3/authentication/managing-security-context-constraints.html#role-based-access-to-ssc_configuring-internal-oauth How do I revert the changes made to the default SCC? - Download the attachment "Default SCC objects for 4.3" from the BZ. It has a YAML file (4.3-default-scc-list.yaml) that contains the set of default SCC objects that ships with the cluster. - Apply the YAML file on the cluster - 'oc apply -f default-scc-list.yaml'. It should revert the changes made to any default SCC. Recovery: In the meantime, to recover (before bug https://bugzilla.redhat.com/show_bug.cgi?id=1820231 lands in 4.3.z, and for folks who attempt 4.3 -> 4.4 updates before un-modifying their default SCCs), the procedure from bug https://bugzilla.redhat.com/show_bug.cgi?id=1822752 is: 1. Figure out the release image pullspec for your current release (e.g. 4.3.8, if you're stuck on 4.3.8 -> 4.3.9) by looking in your ClusterVersion .status.history. 2. oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image $CURRENT_RELEASE_PULLSPEC_BY_DIGEST That's safe (as long as you don't fumble the pullspec), because when you are stuck on the precondition, the cluster-version operator has not actually begun applying any of the next release's manifests. And going forward, bug 1822752 will hopefully give us a more convenient way to cancel precondition-blocked updates. Canceling a stuck update (e.g. 4.3.8 -> 4.3.9) will still leave your cluster on a release that has an overly sensitive CVO precondition trigger. Ideally, un-modify your default SCCs. But if you want to update in 4.3.z without doing that yet, you may be able to blow through that with: $ oc adm upgrade --allow-upgrade-with-warnings --to ...
Abu's nice write-up has everything Scott was asking for; clearing NEEDSINFO.
> Even if we decide not to do anything to revert the newly set Upgradeable=False condition here this is the triggering bug, we can CLOSED WONTFIX This is my impression. Bug 1820231 is ON_QA and will go out with the next 4.3.z release image. There's nothing we can do to the existing 4.3.z releases. So not much else we can do short of pulling edges from the Cincinnati graph, and we can do that or not regardless of the bug state. Closing as CANTFIX, because the existing releases are immutable and folks don't want to close this experience as a dup of bug 1820231 ;).
*** Bug 1823609 has been marked as a duplicate of this bug. ***
As an FYI, we are going to make some changes: - OpenShift 4.3: Revert DefaultSecurityContextConstraints_Mutated in 4.3. We have a PR open for this - https://github.com/openshift/cluster-kube-apiserver-operator/pull/830. It will go into 4.3.z. - OpenShift 4.4: Mark the CVO manifests for the default SCCs as `create-only`. CVO will create/recreate if any default SCCs are deleted but will tolerate changed made to any default SCC. https://github.com/openshift/cluster-kube-apiserver-operator/pull/831 (will be back ported to 4.4)
Created attachment 1679747 [details] Default SCC Objects Default SCC Objects that ship with OpenShift
Best way to unstick yourself is still (In reply to Abu Kashem from comment #22) > Resolution: > - Restore the default SCC, have the customer workload use new SCC or use > role based access to the default SCC. > - Force upgrade. > ... > Ideally, un-modify your default SCCs. But if you want to update in 4.3.z > without doing that yet, you may be able to blow through that with: > > $ oc adm upgrade --allow-upgrade-with-warnings --to ... Restoring the default SCCs is still the best way to unstick yourself. But if you have to update before you can do that (for some reason), the safest way is probably: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort 2020-04-17T20:54:15Z RetrievedUpdates True 2020-04-17T21:11:52Z Available True Done applying 4.3.10 2020-04-17T21:17:47Z Progressing True Unable to apply 4.3.13: it may not be safe to apply this update 2020-04-17T21:17:47Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged] 2020-04-17T21:18:07Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged] Then check the failed preconditions to ensure they are acceptable (e.g. DefaultSecurityContextConstraints_Mutated is acceptable, but ClusterVersionOverridesSet is probably not). If you are comfortable waiving them, run: $ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/desiredUpdate/force", "value": true}]' That will tell the cluster-version operator (CVO) to waive the preconditions, but since the CVO had already made it through signature validation, etc., you know you aren't risking installing a potentially malicious release image. Afterwards, the update will proceed as usual, and you'll be back to your usual flows. Bug 1825396 means the history data will be a bit garbled, but a buggy verified=true in the history won't impact cluster functionality. [1]: https://github.com/openshift/cluster-version-operator/blob/b8af7e484941f2c57dab55f36216f4a0bcf4d11a/pkg/payload/precondition/precondition.go#L60
> [1]: https://github.com/openshift/cluster-version-operator/blob/b8af7e484941f2c57dab55f36216f4a0bcf4d11a/pkg/payload/precondition/precondition.go#L60 Oops, didn't actually explain this reference. It is just showing that after signature verification, the CVO is running all preconditions (even if an earlier precondition fails), so the failed-precondition report will include all failing preconditions. So you won't have, for example, a DefaultSecurityContextConstraints_Mutated in the Failing=True message that is masking an unreported ClusterVersionOverridesSet condition. If both preconditions are failing, the Failing=True message will include both.