Bug 1818893

Summary: If Upgradeable is False due to default SCC mutation, we should provide better messaging to resolve the issue
Product: OpenShift Container Platform Reporter: Abu Kashem <akashem>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.3.0CC: aos-bugs, armin.kunaschik, bsawyers, carlo.reggiani, cruhm, daniel.hagen, denis, dmoessne, gparente, jkaur, john.johansson, kechung, luferrar, mark.jackson2, mfojtik, mharri, mzali, nagrawal, ncurry, nnosenzo, rabdulra, rbohne, rdomnu, rsandu, sople, sttts, trees, vjaypurk, vpagar, xxia
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1821447 (view as bug list) Environment:
Last Closed: 2020-07-13 17:24:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1821447    

Description Abu Kashem 2020-03-30 16:04:28 UTC
Description of problem:
If a cluster admin changes any default SCC, cluster upgrade is prevented and we see the following message in `version`

    - lastTransitionTime: "2020-03-11T06:05:31Z"
      message: 'Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable:
        Default SecurityContextConstraints object(s) have mutated [anyuid hostmount-anyuid
        privileged]'
      reason: DefaultSecurityContextConstraints_Mutated

This message is not helpful, it does not instruct the admin on how to resolve the issue. 


How reproducible:
Always

Steps to Reproduce:
1. Take a 4.3 nightly cluster
2. Change any default SCC that ships with OpenShift


Actual results:
Upgradeable is set to False with the above error message

Expected results:
Upgradeable is set to False with a brief message describing how a cluster-admin can clear any changes if he desires and a command that gives him a diff and a link to how to configure access without changing the SCC to avoid the problem in the future will do

Comment 1 Abu Kashem 2020-04-06 15:08:19 UTC
This feature is enabled in 4.3 only. From 4.4 and onward cvo will stomp any changes to default SCC mutation.

Comment 3 Abu Kashem 2020-04-06 20:25:48 UTC
This is not an issue in 4.4, since CVO manages the default SCC. It's not reproducible in 4.4, but qe can mutate any default SCC and validate 
- this bug is not present, no DefaultSecurityContextConstraints_Mutated in `Upgradeable` condition.
- CVO will stomp the changes made to the default SCC.

Comment 6 Armin Kunaschik 2020-04-07 12:21:43 UTC
The update is blocked just because an admin has added accounts/service accounts to the mentioned SCC.
The version operator doesn't make a difference whether the actual SCC was modified or not. This is imho wrong.

Adding an account to any SCC should be no reason to block the update. Nor should memberships be removed automatically. This will break e.g. storage provisioners who rely on privileged operations.
To me this check needs to be relaxed to just check relevant fields of the SCCS.

Comment 8 Abu Kashem 2020-04-07 14:12:37 UTC
Hi sople, armin.kunaschik,
In 4.4, any changes to a default SCC will be stomped by CVO. So, if we allowed an upgrade (from 4.3) with mutated SCC, CVO will stomp those changes once 4.4 upgrade is underway. 
We are preventing upgrade in 4.3 so that the admin has a chance to fix the issue. Otherwise customer will complain that upgrade has broken their applications (supported by mutated SCC in 4.3).

A workaround to "adding an account to any SCC" is to use RBAC to give a user access to the default SCC. This way the customer can avoid changing the default SCC.

Hope this clarifies the situation, please let us know if you have any additional questions.
Thanks!

Comment 9 Armin Kunaschik 2020-04-07 14:56:34 UTC
First: This is an issue with 4.3 and therefore needs to be fixed in 4.3!

Second: You introduce big compatibility issues with such a change. It is nowhere documented that adding accounts to e.g. the privileged account is forbidden.
There is even lots of documentation that advises to use the default SCCs to achieve e.g. privileged containers!
E.g. https://docs.openshift.com/container-platform/3.11/admin_guide/manage_scc.html#grant-access-to-the-privileged-scc
or
https://docs.openshift.com/container-platform/4.1/cli_reference/administrator-cli-commands.html#policy

The command "oc adm policy add-scc-to-user privileged -z myserviceaccount" as described in the above links adds(!) myserviceaccount to the privileged SCC. This is used by openshift admins since the beginning!

It is ok to check every definition of an SCC, but NOT the members.

You can not change this in the middle of a release without telling nobody about it!

Comment 10 Nick Curry 2020-04-07 19:47:07 UTC
Hitting this when trying to upgrade from 4.3.8 -> 4.3.9 after installing NetApp Trident dynamic storage provisioner.

It adds itself to the privileged scc.

Comment 11 Armin Kunaschik 2020-04-08 10:42:32 UTC
It's not just Trident. I also ran into this when I tried to upgrade a cluster with Trident installed.
There is several (commercial) software from Redhat partners which requires membership in the privileged SCC.
Monitoring applications like Dynatrace, log collection software like Splunk collectors, etc. all require that their sa add to the privileged scc.

Comment 12 Nick Curry 2020-04-08 14:15:31 UTC
Created trident issue here:
https://github.com/NetApp/trident/issues/374

Comment 13 Marcel Härri 2020-04-08 17:56:19 UTC
Also keep in mind that during upgrades the added software might be required to work and thus removing the added serviceaccount from the SCC is not an option.

Imagine the example with Splunk, where disabling it would mean that during the upgrade phase no audit trails are collected. This would make upgrades impossible in various environments.

Adding a serviceaccount to a SCC is not mutating the SCC. What must be ensured is that a) the privileges are not changed and b) none of the built-in (Service)Accounts are removed from the SCC.

Comment 14 Kevin Chung 2020-04-08 19:27:34 UTC
I have a cluster that is already on 4.3.9 and am running into this error as well trying to upgrade to 4.3.10.  Thus, this error is lurking around in 4.3.8+ which affects 4.3.9 as well.  Here is the logs from my cluster-version-operator pod:

I0408 15:13:34.220636       1 sync_worker.go:471] Running sync 4.3.10 (force=false) on generation 72 in state Updating at attempt 32
I0408 15:13:34.220700       1 sync_worker.go:477] Loading payload
I0408 15:13:34.267678       1 payload.go:210] Loading updatepayload from "/etc/cvo/updatepayloads/Wu01xRb7K7Vz9hQJYPhGjg"
E0408 15:13:34.560105       1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [anyuid]
E0408 15:13:34.560228       1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s): Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [anyuid]

Comment 15 Armin Kunaschik 2020-04-09 09:57:43 UTC
This workaround did the trick while upgrading from 4.3.8 to 4.3.9:
* Remove all users from privileged SCC except:
- system:admin
- system:serviceaccount:openshift-infra:build-controller
(this might differ on your cluster)

* Start the update
* Add the removed users back a few moments later when the control plane update is in progress

Update will finish without problems.

Comment 16 Luca 2020-04-09 10:17:53 UTC
In my cases I had a modification in anyuid not in privileged, to be able to run Oracle12g.
The other issue is that I'm using nfs-provisioner which is modifying hostmount-anyuid. In this case I don't think the workaround will work, since I'm using NFS backing storage for all the cluster (including the registry).

Comment 17 Armin Kunaschik 2020-04-09 10:39:00 UTC
@luca: It depends. I'm using Trident with NFS as backing storage and it was working. But probably because it runs just a few seconds without the necessary SCC and there were no pod restarts or pvc creations.
As always: It's a workaround and might not work on any cluster :-)

Comment 18 Stefan Schimanski 2020-04-09 11:37:14 UTC
Nothing to check here for QE. Moving to VERIFIED.

Comment 19 Luca 2020-04-09 17:06:26 UTC
I removed the 2 users added by nfs-provisioner in hostmount-anyuid but I'm still getting the same error in the operator version log:

1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed: Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [hostmount-anyuid]
1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s): Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [hostmount-anyuid]

Comment 20 Mark Jackson 2020-04-09 17:44:37 UTC
Hello, I just wanted to say I am currently using OpenShift in a corporate environment, and we are affected by this at the moment. We are using trident currently.

Comment 22 daniel.hagen 2020-04-15 17:02:19 UTC
Run into the  same problem  when upgrading from Openshift 4.3.9 to Openshift 4.3.10.

>> 4.3.10 cannot be applied: It may not be safe to apply this update

The procedure mentioned in the RedHat solution 4972291 works. This is not nice, but until this has been fixed by the development, one has to live with it for better or worse.

I've simply cloned the privileged scc and removed the user I've added for "playground" purposes from the privileged scc , all users belonging to the openshift universe should
not be removed.

The cluster-version-operator pod is recreated and the upgrade to version 4.3.10 starts.

Comment 23 Courtney Ruhm 2020-04-16 16:21:19 UTC
I noticed a potential bug today.  

When running the update via web console, my updates would fail (even after resetting the scc's).

However, when updating using the "oc adm upgrade --to=4.3.10" command, it would work.

I've seen this in both my dev and stage environments now.  I will try on my prod env as well.

Comment 25 Armin Kunaschik 2020-04-17 15:43:16 UTC
I tried to figure out how to achieve the same functionality (privileged SCC) with RBAC, but failed.

I found https://github.com/openshift/pipelines-catalog/issues/9
But if this is the intentional solution, then it actually decreases security.
Can somebody point me/us to the advised way of doing things with the same level of security in later OCP versions?

Comment 26 Armin Kunaschik 2020-04-18 19:37:01 UTC
Please ignore my last comment #25. The described procedure is working.

Comment 27 Carlo Reggiani 2020-04-22 10:00:12 UTC
Please help me to restore the default SCC, I'm a novice in OpenShift administration :(

I have a bare-metatal OpenShift 4.3.10, trying to upgrade to 4.3.13.

I did a "oc adm policy add-scc-to-group anyuid system:authenticated" to reuse a Docker Image and now I'm getting the upgrade error.

I'm looking to documentation to undo the operation I did, but it's no so clear: could it be a simple "remove-scc-from-group"?

      oc adm policy remove-scc-from-group anyuid system:authenticated

Thank for any support

Carlo

Comment 29 Ke Wang 2020-04-24 16:34:39 UTC
For OCP 4.3, Default SCC likes below, you just leave following users.
$oc get scc privileged -o json | jq .users
[
  "system:admin",
  "system:serviceaccount:openshift-infra:build-controller"
]

Comment 30 Luca 2020-04-24 18:33:23 UTC
(In reply to Luca from comment #19)
> I removed the 2 users added by nfs-provisioner in hostmount-anyuid but I'm
> still getting the same error in the operator version log:
> 
> 1 precondition.go:49] Precondition "ClusterVersionUpgradeable" failed:
> Cluster operator kube-apiserver cannot be upgraded:
> DefaultSecurityContextConstraintsUpgradeable: Default
> SecurityContextConstraints object(s) have mutated [hostmount-anyuid]
> 1 sync_worker.go:329] unable to synchronize image (waiting 2m52.525702462s):
> Precondition "ClusterVersionUpgradeable" failed because of
> "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver
> cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default
> SecurityContextConstraints object(s) have mutated [hostmount-anyuid]

Two comments to my own comment:
- the checks to make sure there is no conflict in the SCC configuration is not instant, so it might takes a minute to clear
- node-exporter SCC is recreated automatically (as expected) so no issue there

Comment 32 Abu Kashem 2020-05-19 15:04:42 UTC
Hi vjaypurk,
https://access.redhat.com/solutions/4972291

Comment 34 errata-xmlrpc 2020-07-13 17:24:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409