Bug 1970910

Summary: Uninstalling kube-descheduler clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 removes some clusterrolebindings
Product: OpenShift Container Platform Reporter: Jose Ortiz Padilla <jortizpa>
Component: OLMAssignee: Vu Dinh <vdinh>
OLM sub component: OLM QA Contact: RamaKasturi <knarra>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, bluddy, cldavey, davegord, inecas, jchaloup, jiazha, jrosenta, knarra, mfojtik, nhale, openshift-bugs-escalate, tsedovic, vdinh
Version: 4.6Keywords: Triaged
Target Milestone: ---Flags: vdinh: needinfo-
Target Release: 4.9.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The over-limit (over 63 characters) CSV name causes invalid ownerref label. Consequence: When OLM uses ownerref reference to retrieve owned resources (including clusterrolebindings), the lister will return all clusterrolebindings in the namespaces due to invalid ownerref label. Fix: OLM uses a different method to let the server reject invalid ownerref label instead. Result: When the CSV has invalid name, the OLM will not remove any clusterrolebindings.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:33:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1974414    
Attachments:
Description Flags
aio-cluster-kube-descheduler-operator.yaml and aio-cluster-kube-descheduler-operator.yaml none

Description Jose Ortiz Padilla 2021-06-11 12:49:14 UTC
Created attachment 1790212 [details]
aio-cluster-kube-descheduler-operator.yaml and aio-cluster-kube-descheduler-operator.yaml

Description of problem:
Uninstalling kube-descheduler clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 removes some clusterrolebindings causing the cluster to be unusable.

Version-Release number of selected component (if applicable):
clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5

How reproducible:
Always. 

Steps to Reproduce:
1. Create a fresh installation of OCP 4.6
2. oc create -f aio-cluster-kube-descheduler-operator.yaml
3. oc create -f kubedescheduler-cluster.yaml
4. check csv and rolebindings:
oc get clusterrolebinding -A | wc -l
oc get csv
NAME                                                               DISPLAY                     VERSION                             REPLACES   PHASE
clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5   Kube Descheduler Operator   4.6.0-202106010807.p0.git.5db84c5              Pending
5. oc delete csv clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5
6. Wait for OLM to remove clusterrolebindings
7. oc get clusterrolebinding -A | wc -l

Actual results:
Number of clusterrolebindings reduced severely


Expected results:
Just the clusterrolebindings of the namespace been removed

Additional info:
Adding yaml files mentioned in reproducer steps.

Comment 1 Jan Chaloupka 2021-06-11 14:16:55 UTC
> 5. oc delete csv clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5

Can't verify if you are supposed to remove the csv directly.

> 6. Wait for OLM to remove clusterrolebindings

Descheduler operator's RBAC rules for 4.6 are defined in https://github.com/openshift/cluster-kube-descheduler-operator/blob/release-4.6/manifests/4.6/cluster-kube-descheduler-operator.v4.6.0.clusterserviceversion.yaml#L102-L144. If OLM is removing more clusterrolebindings than specified by the descheduler operators csv, it's a bug in the OLM itself.

Switching to OLM for further analysis.

Comment 2 RamaKasturi 2021-06-11 17:41:05 UTC
Hello,

   I have tried similar steps as per the description and i could reproduce the issue. Before deleting the csv below are the clusterrolebindings which were present.

Before deleting csv below are the clusterrolebindings:
======================================================
[knarra@knarra openshift-client-linux-4.6.0-0.nightly-2021-06-11-023746]$ ./oc get clusterrolebinding -A | wc -l
187 

After deleting csv below are the clusterrolebindings:
=======================================================
[knarra@knarra openshift-client-linux-4.6.0-0.nightly-2021-06-11-023746]$ ./oc get clusterrolebinding -A | wc -l
140 

Below link contains the must-gather file

http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1970910/

Comment 5 RamaKasturi 2021-06-17 06:29:48 UTC
Hello Vu Dinh,

   May be the version with which you tried reproducing has fixed the issue ? I tried with same version as mentioned in the bug and was able to repro the issue. Must-gather collected from my cluster is present at [1]. you could take a look at it.

[1] http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1970910/

Thanks
kasturi

Comment 6 Vu Dinh 2021-06-17 06:59:40 UTC
Hi Rama,

I would like to ask how did you install this version `clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5` given it seem that version is no longer available in the index for installation. Plus, if this version is a faulty one, then the team who owns this operator should recommend customers to upgrade to a different version to avoid this issue. It seems to me this is descheduler operator issue than OLM issue.

Vu

Comment 10 Jose Ortiz Padilla 2021-06-17 10:08:52 UTC
The operator on another cluster is behaving like that:

- lastTransitionTime: "2021-06-17T08:53:26Z"
    lastUpdateTime: "2021-06-17T08:53:26Z"
    message: 'install strategy failed: Deployment.apps "descheduler-operator" is invalid:
      metadata.labels: Invalid value: "clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5":
      must be no more than 63 characters'
    phase: Failed
    reason: InstallComponentFailed

In my reproducer I haven't set this version, I just added the kube-descheduler operator and it upgraded to that version itself, I did anything but installing.

It is in a infinite loop from:

Pending --> Install Ready --> Failed

If we remove the CSV we will lose the clusterrolebindings, so...the question is...how can we get rid of this operator infinite install loop without breaking the cluster?

Comment 11 Jan Chaloupka 2021-06-17 12:48:39 UTC
> It seems to me this is descheduler operator issue than OLM issue.

How can a different version name of the descheduler operator (or any operator) have effect on how many CRBs get deleted? Is it possible something in the OLM 

> clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5

I am not fully familiar with the image promotion for OLM operator images. Though, given this failed the validation, I find it unlikely this would get to a customer. Unless some of the validation tests are not required to pass.

> If we remove the CSV we will lose the clusterrolebindings, so...the question is...how can we get rid of this operator infinite install loop without breaking the cluster?

Is it possible to replace the existing CSV with a new one? Or, an older one?

Comment 12 Jan Chaloupka 2021-06-17 13:02:12 UTC
Checking http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1970910/must-gather.local.5159323793101762871/namespaces/openshift-operator-lifecycle-manager/pods/olm-operator-67545f87c4-gv94t/olm-operator/olm-operator/logs/current.log, there's a lot of "cannot delete cluster role binding":

```
2021-06-11T16:46:16.943744597Z time="2021-06-11T16:46:16Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"console\" is forbidden: User \"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operato
2021-06-11T16:46:16.993671976Z time="2021-06-11T16:46:16Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"system:controller:deployment-controller\" is forbidden: User \"system:serviceaccount:openshift-operat
2021-06-11T16:46:17.043913067Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"default-account-openshift-machine-config-operator\" is forbidden: User \"system:serviceaccount:opensh
2021-06-11T16:46:17.093769174Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"cluster-node-tuning:tuned\" is forbidden: User \"system:serviceaccount:openshift-operator-lifecycle-m2021-06-11T16:46:17.143765805Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"insights-operator\" is forbidden: User \"system:serviceaccount:openshift-operator-lifecycle-manager:o2021-06-11T16:46:17.193750113Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"system:openshift:operator:openshift-controller-manager-operator\" is forbidden: User 
...
```

Why would OLM tried to delete so many CRBs due to "broken" operator csv?

Comment 13 Jose Ortiz Padilla 2021-06-17 13:45:05 UTC
The cluster where I reproduced the issue is no longer available.

If I try to install the kube-descheduler.operator in a new cluster the version installed is 'clusterkubedescheduleroperator.4.6.0-202103010126.p0' instead of 'clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5'

Maybe @knarra has the cluster available to check if 'oc replace --force -f clusterkubedescheduleroperator.4.6.0-202103010126.p0.yaml' works or not. 
Regarding question:


> Why would OLM tried to delete so many CRBs due to "broken" operator csv?

@jchaloup our guess is that somehow, and don't know why, OLM tried to remove all CBRs it just stopped after removing the CBR that allowed it to remove more CBRs.

Comment 23 Jose Ortiz Padilla 2021-06-18 08:22:16 UTC
Hi, team!

things that I have already checked:

- oc replace won't work, as it keeps the name of the resource
- oc patch or oc edit replacing the name do not work either as the name of the resource cannot be changed
- Given that we have the version with name 'clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5' cluster operator is not able to update because name is too long and the update is in a infinite loop:
  lastTransitionTime: "2021-06-17T08:53:26Z"
    lastUpdateTime: "2021-06-17T08:53:26Z"
    message: 'install strategy failed: Deployment.apps "descheduler-operator" is invalid:
      metadata.labels: Invalid value: "clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5":
      must be no more than 63 characters'
    phase: Failed
    reason: InstallComponentFailed

- So, I see just two ways to go on:

- backup CRBs, delete the faulty csv, restore CRBs

or:

- wait for fix, apply the fix (upgrade), restart the cluster (as part of the upgrade) and then safely remove the faulty csv

Am I right @vdinh ?

Comment 24 Vu Dinh 2021-06-18 14:17:31 UTC
Hey Jose,

I occur which the options that you mentioned. Unfortunately, those are two options at the moment.

Vu

Comment 25 Jose Ortiz Padilla 2021-06-18 14:23:38 UTC
Hi, Vu!

we have been thinking about my first option and it is too risky. We are not sure in which order will the CRBs be removed, so might lose the ability to recreate the CRBs. So, it's no longer an option.

The only feasible option is to get the fix to 4.6.

Cheers,
Jose

Comment 27 RamaKasturi 2021-06-21 07:29:02 UTC
Verified the bug by creating a 4.9 cluster using cluster-bot. Below are the steps i followed to verify the bug.

steps followed:
===================
1) create namespace called 'openshift-kube-descheduler-operator'
2) create operatorgroup using the yaml below
[knarra@knarra ~]$ cat /tmp/operatorgroup.yaml 
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: openshift-kube-descheduler-operator
  namespace: openshift-kube-descheduler-operator
spec:
  targetNamespaces:
    - openshift-kube-descheduler-operator
3) create catalogsource with index image using the yaml below
[knarra@knarra ~]$ cat /tmp/catalogsource.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: qe-app-registry
  namespace: openshift-kube-descheduler-operator
spec:
  sourceType: grpc
  image: docker.io/dinhxuanvu/descheduler-index:v1

4) create subscription using the yaml file below

[knarra@knarra ~]$ cat /tmp/subscription.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: cluster-kube-descheduler-operator
  namespace: openshift-kube-descheduler-operator
spec:
  channel: stable
  name: cluster-kube-descheduler-operator
  source: qe-app-registry
  sourceNamespace: openshift-kube-descheduler-operator

Now you can see that csv is in pending state with error "one or more requirements could not be found"

Events:
  Type    Reason               Age                From                        Message
  ----    ------               ----               ----                        -------
  Normal  RequirementsUnknown  40s (x3 over 41s)  operator-lifecycle-manager  requirements not yet checked
  Normal  RequirementsNotMet   40s (x2 over 40s)  operator-lifecycle-manager  one or more requirements couldn't be found

Now check the clusterrolebindings count

[knarra@knarra ~]$ oc get clusterrolebinding -A | wc -l
200

Delete the csv and check the count again.

[knarra@knarra ~]$ oc delete csv clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5
clusterserviceversion.operators.coreos.com "clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5" deleted
[knarra@knarra ~]$ oc get clusterrolebinding -A | wc -l
200
[knarra@knarra ~]$ oc get clusterrolebinding -A | wc -l
200

I see that clusterrolebinding count remained the same. 

@vdinh could you please help confirm if the steps above looks good for verification or do we need to do any other additional step verification ?

Comment 28 Vu Dinh 2021-06-21 15:15:14 UTC
Hi,

Yes, those steps you've done is sufficient to verify this BZ.

Vu

Comment 29 RamaKasturi 2021-06-21 15:32:54 UTC
Thanks Vu Dinh !!

Based on comment27 & comment28 moving the bug to verified state.

Comment 40 errata-xmlrpc 2021-10-18 17:33:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759