Description of problem: Currently, the basic-user ClusterRole is not managed/restored by CVO, if a user edits the ClusterRole, leaving an empty list of rules, different cluster components will stop working. The basic-user ClusterRole is used by image registry, prometheus and others, if a user manages to erase its permissions, different errors start occurring in the cluster: 1) Failure to pull images 2) Blocking pod creation due to img failures: Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/cli:latest": rpc error: code = Unknown desc = unable to retrieve auth token: invalid username/password: unauthorized: unable to validate token: Forbidden 3) Errors from different ServiceAccounts, like prometheus: time="2020-05-05T01:28:38.007246679Z" level=error msg="Get user failed with error: users.user.openshift.io \"~\" is forbidden: User \"system:serviceaccount:openshift-monitoring:prometheus-k8s\" cannot get resource \"users\" in API group \"user.openshift.io\" at the cluster scope" Version-Release number of the following components: Openshift 4.3.12 How reproducible: Every time you perform the ClusterRole update described below Steps to Reproduce: As a cluster user edit the basic-user ClusterRole and clear the rules array: Example: $ oc edit clusterrole basic-user --as-group=dedicated-admins --as=myded-admin Actual results: Different errors (listed above) start happening in cluster Expected results: CVO should reconcile the basic-user ClusterRole back to the default/initial config
You can duplicate this with a user that is *not* cluster-admin, but *has* been given access to patch clusterroles: oc patch clusterrole/basic-user --type=json -p '[{"op": "remove", "path": "/rules" }]' This will cause the above impacts to the registry and prometheus.
The stock ClusterRoles are maintained during kube-apiserver start, not via an operator or a controller. To fix this in the future, you can force a redeployment of the kube-apiserver to trigger a restart and the role will be fixed. A better solution of having the kube-apiserver run a controller to maintain these roles is technically possible, but currently expensive in terms of QPS. It could be done, but it won't be something quickly actionable.
There is a workaround. This is the behaviour since RBAC was created. As we don't promise to repair anything possible in the cluster, this is not a bug or regression. If we want to repair RBAC roles in realtime, that's a feature. Please move to Jira.
Created https://issues.redhat.com/browse/RFE-970 for this