1974401 – oauth-apiserver clusterrolebindings are getting removed from the cluster

Bug 1974401 - oauth-apiserver clusterrolebindings are getting removed from the cluster

Summary: oauth-apiserver clusterrolebindings are getting removed from the cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 1975456
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Per da Silva
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-21 15:08 UTC by Rutvik
Modified:	2024-12-20 20:18 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-07 02:04:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6126771	0	None	None	None	2021-06-25 14:02:26 UTC

Description Rutvik 2021-06-21 15:08:30 UTC

Description of problem:

Recently we had encountered this issue where openshift-oauth-apiserver pods were stuck in CrashLoopBackOff state and through the logs it pointed out to the RBAC issue. After investigating, we found that one of the default clusterrolebindings is missing which caused this outage.

Affected Pod Logs:
---
2021-06-14T15:24:48.524431717Z E0614 15:24:48.524391 1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156:<https://xyz.com/1/3735928009/http://k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156:> Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:openshift-authentication:oauth-openshift" cannot list resource "configmaps" in API group "" in the namespace "kube-system"
---

We fixed the missing one by creating it manually:

$ cat oauth-apiserver.yaml
-----
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:openshift:oauth-apiserver
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: oauth-apiserver-sa
  namespace: openshift-oauth-apiserver
-----


Version-Release number of selected component (if applicable):
v4.6.31

Clusterrolebindings "system:openshift:oauth-apiserver" found to be missing initially.
Later multiple cluster operators were found with the same issue.

Actual results:
Missing of this particular Clusterrolebinding  has caused authentication outage.

Expected results:
Default clusterrolebindings should not be deleted / reconciled by the operator as soon as it gets removed.
Customers expecting an alert when such important RBAC gets deleted from the cluster.

Comment 3 Sebastian Łaskawiec 2021-06-23 06:57:35 UTC

The file containing the ClusterRoleBinding might be found in bindata directory of the Cluster Authentication Operator [1]. The file is installed by the Cluster Authentication Operator during the bootstrap process. If Cluster Authentication Operator fails for some reason, the `system:openshift:oauth-apiserver` could be missing. This theory could be verified by inspecting Cluster Authentication Operator logs and Events. @Rutvik Could I ask you to check this out?

The Security Context constraint validation seems to be the result of the missing `system:openshift:oauth-apiserver` ClusterRoleBinding. 

[1] https://github.com/openshift/cluster-authentication-operator/blob/master/bindata/oauth-apiserver/apiserver-clusterrolebinding.yaml

Comment 6 Rutvik 2021-06-24 14:10:12 UTC

Thanks for sharing your inputs.

I could recall the case was opened on Jun 10 2021 and the mentioned error of CVO was seen on 2021-06-09.
Just FYI, recently customer triggered an upgrade from 4.6.31 to 4.7.13 and it didn't move ahead, and when I checked the CVO logs I found the same error which you have mentioned in here[1].
Because of this issue, customer could not hit the upgrade button on the UI, when they attempted from CLI, it showed "The cluster is updating to 4.7.13" however update didn't happen.
I took the entire clusterrolebindings from their cluster & can see this one was missing --> https://github.com/openshift/cluster-version-operator/blob/master/install/0000_00_cluster-version-operator_02_roles.yaml

Now we are yet to here on upgrade progress but I hope it must have completed/progressed by now.

My concern is, when such cascading failures could happen as you've mentioned, why not each cluster operator has ability to reconcile the default RBAC?

I would attach the list of clusterrolebindings collected from the v4.6.31 cluster, you can compare the same with any of our test clusters and will notice there is significant drop even though we have fixed most of them manually.

Let me know if you all need any piece of logs/data for further investigation.


[1] 
~~~
2021-06-23T15:44:06.384942255Z E0623 15:44:06.384905       1 leaderelection.go:321] error retrieving resource lock openshift-cluster-version/version: configmaps "version" is forbidden: User "system:serviceaccount:openshift-cluster-version:default" cannot get resource "configmaps" in API group "" in the namespace "openshift-cluster-version"

2021-06-23T15:44:06.384942255Z I0623 15:44:06.384921       1 leaderelection.go:248] failed to acquire lease openshift-cluster-version/version
~~~

Comment 8 Jack Ottofaro 2021-06-25 15:31:45 UTC

Going back to the original issue for which the case was opened to be sure we understand, the below error was noticed after a successful 4.6.21 to 4.6.31 upgrade:

---
2021-06-14T15:24:48.524431717Z E0614 15:24:48.524391 1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156:<https://xyz.com/1/3735928009/http://k8s.io/client-go@v0.19.0/tools/cache/reflector.go:156:> Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps "extension-apiserver-authentication" is forbidden: User "system:serviceaccount:openshift-authentication:oauth-openshift" cannot list resource "configmaps" in API group "" in the namespace "kube-system"
---

About when was that upgrade started and when did it complete?

I accessed the must-gathers attached to case 02960277 and must-gather-TCB-20210610.tar.gz is the only one with CVO logs which are dated 06/08 - 06/10. As stated, in the log CVO started having a similar issue to one above:

---
2021-06-09T02:06:14.102264864Z E0609 02:06:14.102193       1 leaderelection.go:321] error retrieving resource lock openshift-cluster-version/version: configmaps "version" is forbidden: User "system:serviceaccount:openshift-cluster-version:default" cannot get resource "configmaps" in API group "" in the namespace "openshift-cluster-version"
---

Some 06/11 logs are referenced along with this must-gather https://attachment-viewer.cee.redhat.com/#/viewer/02961185?uuid=c4fc0893-2066-4825-8ff5-54bfe799a002. I can't access that link however case 02961185 has many must-gathers attached. Which one was taken immediately after the upgrade when this issue was first noticed? Were any taken during the upgrade?

Comment 9 Rutvik 2021-06-25 18:46:00 UTC

Regarding the first case i.e 02961185 for which the BZ was reported (this case came in after an upgrade from 4.6.21 to 4.6.31, the customer team was unable to capture the must-gather since it got failed with some errors. After fixing the authentication RBAC, we captured the must-gather which was the first one collected by us on Jun 12, and prior to that customer did not have any must-gather collected during an upgrade 4.6.21->4.6.31. We also had SOS reports from all the masters & the drive link has been shared with you on Gchat.

These cases 02960277 & 02972998 have nothing to do with the upgrades. These are the cases where we have randomly found this issue, and no changes/patches were done recently.

Comment 10 Jack Ottofaro 2021-06-28 16:02:29 UTC

CVO is simply getting the same type of error that many other components are getting when these RBs and CRBs are deleted so this doesn't tell us anything about the root cause. The reason I focused on upgrades is because of Comment #5 and the statement that "something went wrong with the upgrade" and then the component was changed to CVO. But as you point out the log from containing that CVO error was from a cluster that wasn't being upgraded so not sure why upgrades were even mentioned in that context.

I went through all 3 cases and the must-gathers and it seems best lead is the audit logs in case 02972998 and the possible correlation to the installation of the compliance operator. If you haven't already you should check if any such correlation exists on the other clusters or if they are also running this compliance operator.

In any case this does not appear to be a CVO issue.

Comment 11 Rutvik 2021-06-28 19:51:23 UTC

Hello Jack, 

Thanks for the inputs.

>> it seems best lead is the audit logs in case 02972998 and the possible correlation to the installation of the compliance operator.

I could understand your take on the compliance operator however would you please share a snip of those audit events where you feel we should take them in consideration or perhaps match with other cluster's audit history? If I get to see those, it would be easier for me to go ahead and validate them against the other clusters in question.

Comment 12 Jack Ottofaro 2021-06-28 20:45:29 UTC

(In reply to Rutvik from comment #11)

> please share a snip of those audit events where you feel we should take them

I was just referring to the ones cited in the case comments: https://gss--c.visualforce.com/apex/Case_View?sbstr=02972998#comment_a0a2K00000bK6NeQAK and https://gss--c.visualforce.com/apex/Case_View?sbstr=02972998#comment_a0a2K00000bKOxFQAW.

However I don't know much about the OCP audit events but noticed they were for a "delete" and the "requestURI" was "...openshift-compliance...".

Comment 17 Juan Antonio Osorio 2021-07-02 05:48:37 UTC

The Compliance Operator doesn't have permission to modify clusterrolebindings, or any RBAC-related object for that matter. We deliberately only gave it read-only permissions for when we need to verify that a specific permission is set.

Comment 18 Jack Ottofaro 2021-07-02 12:12:18 UTC

(In reply to Juan Antonio Osorio from comment #17)
> The Compliance Operator doesn't have permission to modify
> clusterrolebindings, or any RBAC-related object for that matter. We
> deliberately only gave it read-only permissions for when we need to verify
> that a specific permission is set.

Sorry, that was a mistake. I meant to change this back to the component it was originally opened against, oauth-apiserver.

Comment 19 Sebastian Łaskawiec 2021-07-06 09:40:17 UTC

Unfortunately, we don't have sufficient information to sort this out. We'd need a must-gather taken right after the failure. By inspecting the audit logs we could find out who deleted those objects and if this happened due to some bug, find the right component. 

Rutvik - I'm closing this issue - please reopen it once you get necessary files.

Comment 20 Rutvik 2021-07-06 10:22:57 UTC

Sure, I could understand that must-gather & audit logs close to the issue timestamp would have been more useful.
However, the customers were unable to process the must-gather collection through the affected cluster in the first place since multiple operators were unhealthy in the background, it got hung.
I would definitely ask customers to collect audit logs as soon as they identify this type of issue again.

I have a couple of questions regarding the same;

1. Any specific audit filter/setting you would recommend us to use in such scenario?
2. Why the operator in general does not reconcile their default RBAC? Since this issue had affected multiple core operators so this becomes a serious issue for productive clusters.
3. If an operator could not reconcile its CRB, can there be any alert at least that would indicate if operator CRB goes missing?

Comment 22 Standa Laznicka 2021-07-07 12:52:00 UTC

Rutvik, those don't seem to be audit logs, but some events. In order to figure out what happened, we would need the actual audit log. These would appear on the node and do not have to be taken freshly after the upgrade, although I believe we only store those for 5 days. It would very likely be useful to see what the authentication-operator logs were at that time, too. If we don't have that kind of information we will not be able to RCA this.

Comment 30 Abu Kashem 2021-07-19 13:37:51 UTC

Cluster Resource Override is owned by the node team now, so I believe that we should have it looked at by them as well.

I have a few points to add:
- the crb is managed by the cluster auth operator, so if it gets deleted, the operator should recreate it, right?
- it's odd the deletion is not present in audit logs. the default audit profile should log the delete operation in audit. Can we get the jq or grep command they are using to search audit. or, we can give them the right query to search audit.
- Cluster Resource Override Operator is an olm enabled operator, how is the customer removing the operator, using olm or directly removing the operator manifests? I assume the customer is removing it via OLM. 


I quickly skimmed through the Cluster Resource Override code, it does not have any logic to delete a cluster role binding object directly. when the operator is removed I guess it's OLM that's kicking into action. 

I would recommend another test where we do the following:
- run the kube-apiserver in log level 4, it will have httplog
- remove the Cluster Resource Override Operator
- capture logs for OLM, Cluster Resource Override and kube-apiserver logs
- capture audit logs

Comment 49 Kevin Rizza 2022-01-05 19:25:57 UTC

Looks like this is probably just a duplicate of another issue that was already resolved, as Nick noted. Reassigned this one so someone can take a look at and confirm

Comment 50 Per da Silva 2022-01-07 02:04:18 UTC

From reading the RHKB solution attached to this ticket. I believe the problem is already solved: https://access.redhat.com/solutions/6126771.

*** This bug has been marked as a duplicate of bug 1975456 ***

Comment 51 Red Hat Bugzilla 2023-09-15 01:34:40 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.

abraj
akashem
alchan
aos-bugs
aos-team-ota
dholler
dmesser
jack.ottofaro
jokerman
krizza
mdeore
mfojtik
mrogers
nhale
rkshirsa
rpalathi
slaskawi
slaznick
surbania
xiyuan