Bug 1821905 - Cannot upgrade from 4.3.8 -> 4.3.9 due to "DefaultSecurityContextConstraints_Mutated"
Summary: Cannot upgrade from 4.3.8 -> 4.3.9 due to "DefaultSecurityContextConstraints_...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: scheng
URL:
Whiteboard:
: 1823609 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-07 19:37 UTC by Nick Curry
Modified: 2021-12-17 08:03 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-10 01:41:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Default SCC Objects (7.17 KB, text/plain)
2020-04-17 19:39 UTC, Abu Kashem
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4972291 0 None None None 2020-04-14 11:55:52 UTC

Description Nick Curry 2020-04-07 19:37:28 UTC
Description of problem:

Version-Release number of the following components:
OpenShift 4.3.8

How reproducible:

Steps to Reproduce:
1.Deploy 4.3.8 cluster
2. Try to upgrade to 4.3.9

Actual results:
'Unable to apply 4.3.9: it may not be safe to apply this update'


'Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated":
        Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
        Default SecurityContextConstraints object(s) have mutated [privileged]'


Expected results:

4.3.9 Cluster

Additional info:

Comment 1 Nick Curry 2020-04-07 19:44:55 UTC
Related to 
https://bugzilla.redhat.com/show_bug.cgi?id=1818893

We installed NetApp trident in this cluster and it adds itself to the privileged SCC

[ncc@t490s ~]$ oc get scc privileged -o yaml
allowHostDirVolumePlugin: true
allowHostIPC: true
allowHostNetwork: true
allowHostPID: true
allowHostPorts: true
allowPrivilegeEscalation: true
allowPrivilegedContainer: true
allowedCapabilities:
- '*'
allowedUnsafeSysctls:
- '*'
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups:
- system:cluster-admins
- system:nodes
- system:masters
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: 'privileged allows access to all privileged and host
      features and the ability to run as any user, any group, any fsGroup, and with
      any SELinux context.  WARNING: this is the most relaxed SCC and should be used
      only for cluster administration. Grant with caution.'
  creationTimestamp: "2020-04-06T21:56:25Z"
  generation: 2
  name: privileged
  resourceVersion: "301708"
  selfLink: /apis/security.openshift.io/v1/securitycontextconstraints/privileged
  uid: daebc90c-f795-42ab-a830-dc3c1e8ad962
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities: null
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: RunAsAny
seccompProfiles:
- '*'
supplementalGroups:
  type: RunAsAny
users:
- system:admin
- system:serviceaccount:openshift-infra:build-controller
- system:serviceaccount:trident:trident-csi
volumes:

Comment 2 Scott Dodson 2020-04-08 13:23:09 UTC
This is expected.

Comment 3 Nick Curry 2020-04-08 13:36:49 UTC
Customers adding additional users to the default sccs is a routine operation. Many of our third-party integrations, such as NetApp mentioned here add themselves to the default sccs.


What is the expected upgrade path from clusters such as this one?


Seems as though we should ignore the scc.users array in this check?

Comment 4 Stefan Schimanski 2020-04-08 14:00:08 UTC
Objects created by the system are owned by the system. We cannot reconcile arbitrary changes to them and guarantee working upgrades.

The work around: either ship another SCC as part of the 3rdparty component, or add the user to one of the group which are allowed to access the SCC. Being able to run with the privileged SCC is equivalent to run as cluster-admin.

Comment 5 W. Trevor King 2020-04-08 14:08:01 UTC

*** This bug has been marked as a duplicate of bug 1820231 ***

Comment 6 Nick Curry 2020-04-08 14:14:01 UTC
Created trident issue here:
https://github.com/NetApp/trident/issues/374

Comment 7 Scott Dodson 2020-04-09 15:08:50 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Even if we decide not to do anything to revert the newly set Upgradeable=False condition here this is the triggering bug, we can CLOSED WONTFIX

Comment 9 Lalatendu Mohanty 2020-04-09 15:44:38 UTC
For the workaround, do we have the commands customer can run to fix the SCC so that it does not show mutated?

Or we can suggest a force upgrade.

Comment 10 Lalatendu Mohanty 2020-04-09 15:46:49 UTC
Adding upgradeblocker as we do not want customers go to versions which would make SCC to be marked mutated

Comment 11 Lalatendu Mohanty 2020-04-09 15:48:41 UTC
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1820231

Comment 16 Lalatendu Mohanty 2020-04-09 17:32:49 UTC
How does 4.2 updates in 4.2.z stream gets impacted if a cluster has mutated SCCs. Also does it impact 4.2.x to 4.3.x upgrades?

Comment 19 W. Trevor King 2020-04-09 19:05:42 UTC
> How does 4.2 updates in 4.2.z stream gets impacted if a cluster has mutated SCCs. Also does it impact 4.2.x to 4.3.x upgrades?

Bug series that introduced the mutated check is bug 1794309.  That was backported to 4.3.z with bug 1808602 , but not cloned or backported to 4.2.z (yet).  So there should be no impact on 4.1, 4.2, or 4.2 -> 4.3 updates.

Comment 20 W. Trevor King 2020-04-09 20:02:46 UTC
Also, for recovering in the meantime (before bug 1820231 lands in 4.3.z, and for folks who attempt 4.3 -> 4.4 updates before un-modifying their default SCCs, the procedure from bug 1822752 is:

1. Figure out the release image pullspec for your current release (e.g. 4.3.8, if you're stuck on 4.3.8 -> 4.3.9) by looking in your ClusterVersion .status.history.
2. oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image $CURRENT_RELEASE_PULLSPEC_BY_DIGEST

That's safe (as long as you don't fumble the pullspec), because when you are stuck on the precondition, the cluster-version operator has not actually begun applying any of the next release's manifests.  And going forward, bug 1822752 will hopefully give us a more convenient way to cancel precondition-blocked updates.

Canceling a stuck update (e.g. 4.3.8 -> 4.3.9) will still leave your cluster on a release that has an overly sensitive CVO precondition trigger.  Ideally, un-modify your default SCCs.  But if you want to update in 4.3.z without doing that yet, you may be able to blow through that with:

  $ oc adm upgrade --allow-upgrade-with-warnings --to ...

although I haven't actually tested that yet.  And I dunno what the web console has in this space, since things like --allow-upgrade-with-warnings are currently client-side oc checks.

Comment 21 Abu Kashem 2020-04-10 00:33:32 UTC
Created attachment 1677680 [details]
Default SCC objects for 4.3

Default SCC object(s) for 4.3

Comment 22 Abu Kashem 2020-04-10 01:03:08 UTC
Background:
OpenShift ships with a set of default SecurityContextConstraints object(s), these are [anyuid, hostaccess, hostmount-anyuid, hostnetwork, nonroot, privileged, restricted].

So far, cluster admins have been granting user(s) access to the default SCCs by directly adding the user to it, using the `oc adm policy add-scc-to-user` command. The official documentation covers this topic - https://docs.openshift.com/container-platform/3.11/admin_guide/manage_scc.html#grant-access-to-the-privileged-scc
On the other hand, we also treat the default SCCs as unmodifiable, the beginning of the doc mentions - "Do not modify the default SCCs. Customizing the default SCCs can lead to issues when upgrading. Instead, create new SCCs."

The default SecurityContextConstraints object(s) are explicitly reserved for the system. In OpenShift 4.4 we brought the default SCCs under active management.
Any changes made to the default SecurityContextConstraints object(s) are automatically stomped. This means customer workloads that were relying on this explicitly forbidden behavior will experience outages.

The following sections highlight the changes we have made, customer impacts and resolution.

Changes:
- 4.4: Starting with 4.4 and onward the default SecurityContextConstraints object(s) are managed by Cluster Version Operator (CVO) operator
and hence any changes to these SecurityContextConstraints object(s) will be automatically stomped by CVO.

- 4.3: In 4.3 we mark the cluster as `Upgradeable=False` if any of the default SecurityContextConstraints has been mutated. We have a controller that watches the SecurityContextConstraints object(s) and if it detects any change `Upgradeable` is set to `False`, as shown below.

    - lastTransitionTime: "2020-03-11T06:05:31Z"
      type: Upgradeable
      status: "False"
      message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
        Default SecurityContextConstraints object(s) have mutated [anyuid hostmount-anyuid privileged]'
      reason: DefaultSecurityContextConstraints_Mutated

This feature is available from 4.3.8.


Who is impacted?
Customers who modified system managed SCCs on clusters running 4.3.z (z >= 8 ).

What is the impact?
- clusteroperator/kube-apiserver Upgradeable=false.
- 4.3 -> 4.4 upgrade is disallowed by CVO.
- z-level upgrade (4.3.z to 4.3.z+) is disallowed by CVO, but would still likely succeed if forced with --allow-upgrade-with-warnings as long as DefaultSecurityContextConstraints_Mutated was the only Upgradeable=False condition being overridden.


Resolution:
- Restore the default SCC, have the customer workload use new SCC or use role based access to the default SCC.
- Force upgrade.

How do I grant user(s) access to the default SCC without changing it?
- Cluster admin can always create new SCC object(s).
- You can use RBAC to grant user(s) access to the default SCC and thus avoid making changes to it. For more information see - https://docs.openshift.com/container-platform/4.3/authentication/managing-security-context-constraints.html#role-based-access-to-ssc_configuring-internal-oauth

How do I revert the changes made to the default SCC?
- Download the attachment "Default SCC objects for 4.3" from the BZ. It has a YAML file (4.3-default-scc-list.yaml) that contains the set of default SCC objects that ships with the cluster.
- Apply the YAML file on the cluster - 'oc apply -f default-scc-list.yaml'. It should revert the changes made to any default SCC.


Recovery:
In the meantime, to recover (before bug https://bugzilla.redhat.com/show_bug.cgi?id=1820231 lands in 4.3.z, and for folks who attempt 4.3 -> 4.4 updates before un-modifying their default SCCs), the procedure from bug https://bugzilla.redhat.com/show_bug.cgi?id=1822752 is:

1. Figure out the release image pullspec for your current release (e.g. 4.3.8, if you're stuck on 4.3.8 -> 4.3.9) by looking in your ClusterVersion .status.history.
2. oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image $CURRENT_RELEASE_PULLSPEC_BY_DIGEST

That's safe (as long as you don't fumble the pullspec), because when you are stuck on the precondition, the cluster-version operator has not actually begun applying any of the next release's manifests.  And going forward, bug 1822752 will hopefully give us a more convenient way to cancel precondition-blocked updates.

Canceling a stuck update (e.g. 4.3.8 -> 4.3.9) will still leave your cluster on a release that has an overly sensitive CVO precondition trigger.  Ideally, un-modify your default SCCs.  But if you want to update in 4.3.z without doing that yet, you may be able to blow through that with:

  $ oc adm upgrade --allow-upgrade-with-warnings --to ...

Comment 23 W. Trevor King 2020-04-10 01:28:30 UTC
Abu's nice write-up has everything Scott was asking for; clearing NEEDSINFO.

Comment 24 W. Trevor King 2020-04-10 01:41:46 UTC
> Even if we decide not to do anything to revert the newly set Upgradeable=False condition here this is the triggering bug, we can CLOSED WONTFIX

This is my impression.  Bug 1820231 is ON_QA and will go out with the next 4.3.z release image.  There's nothing we can do to the existing 4.3.z releases.  So not much else we can do short of pulling edges from the Cincinnati graph, and we can do that or not regardless of the bug state.  Closing as CANTFIX, because the existing releases are immutable and folks don't want to close this experience as a dup of bug 1820231 ;).

Comment 27 Stephen Cuppett 2020-04-14 11:55:52 UTC
*** Bug 1823609 has been marked as a duplicate of this bug. ***

Comment 28 Abu Kashem 2020-04-15 19:42:35 UTC
As an FYI, we are going to make some changes:
- OpenShift 4.3: Revert DefaultSecurityContextConstraints_Mutated in 4.3. We have a PR open for this - https://github.com/openshift/cluster-kube-apiserver-operator/pull/830. It will go into 4.3.z.

- OpenShift 4.4: Mark the CVO manifests for the default SCCs as `create-only`. CVO will create/recreate if any default SCCs are deleted but will tolerate changed made to any default SCC. 
https://github.com/openshift/cluster-kube-apiserver-operator/pull/831 (will be back ported to 4.4)

Comment 29 Abu Kashem 2020-04-17 19:39:38 UTC
Created attachment 1679747 [details]
Default SCC Objects

Default SCC Objects that ship with OpenShift

Comment 30 W. Trevor King 2020-04-17 22:21:21 UTC
Best way to unstick yourself is still (In reply to Abu Kashem from comment #22)
> Resolution:
> - Restore the default SCC, have the customer workload use new SCC or use
> role based access to the default SCC.
> - Force upgrade.
> ...
> Ideally, un-modify your default SCCs.  But if you want to update in 4.3.z
> without doing that yet, you may be able to blow through that with:
> 
>   $ oc adm upgrade --allow-upgrade-with-warnings --to ...

Restoring the default SCCs is still the best way to unstick yourself.  But if you have to update before you can do that (for some reason), the safest way is probably:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-17T20:54:15Z RetrievedUpdates True
2020-04-17T21:11:52Z Available True Done applying 4.3.10
2020-04-17T21:17:47Z Progressing True Unable to apply 4.3.13: it may not be safe to apply this update
2020-04-17T21:17:47Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-17T21:18:07Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]

Then check the failed preconditions to ensure they are acceptable (e.g. DefaultSecurityContextConstraints_Mutated is acceptable, but ClusterVersionOverridesSet is probably not).  If you are comfortable waiving them, run:

$ oc patch clusterversion version --type json -p '[{"op": "add", "path": "/spec/desiredUpdate/force", "value": true}]'

That will tell the cluster-version operator (CVO) to waive the preconditions, but since the CVO had already made it through signature validation, etc., you know you aren't risking installing a potentially malicious release image.  Afterwards, the update will proceed as usual, and you'll be back to your usual flows.  Bug 1825396 means the history data will be a bit garbled, but a buggy verified=true in the history won't impact cluster functionality.

[1]: https://github.com/openshift/cluster-version-operator/blob/b8af7e484941f2c57dab55f36216f4a0bcf4d11a/pkg/payload/precondition/precondition.go#L60

Comment 31 W. Trevor King 2020-04-17 22:31:51 UTC
> [1]: https://github.com/openshift/cluster-version-operator/blob/b8af7e484941f2c57dab55f36216f4a0bcf4d11a/pkg/payload/precondition/precondition.go#L60

Oops, didn't actually explain this reference.  It is just showing that after signature verification, the CVO is running all preconditions (even if an earlier precondition fails), so the failed-precondition report will include all failing preconditions.  So you won't have, for example, a DefaultSecurityContextConstraints_Mutated in the Failing=True message that is masking an unreported ClusterVersionOverridesSet condition.  If both preconditions are failing, the Failing=True message will include both.


Note You need to log in before you can comment on or make changes to this bug.