Created attachment 1790212 [details] aio-cluster-kube-descheduler-operator.yaml and aio-cluster-kube-descheduler-operator.yaml Description of problem: Uninstalling kube-descheduler clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 removes some clusterrolebindings causing the cluster to be unusable. Version-Release number of selected component (if applicable): clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 How reproducible: Always. Steps to Reproduce: 1. Create a fresh installation of OCP 4.6 2. oc create -f aio-cluster-kube-descheduler-operator.yaml 3. oc create -f kubedescheduler-cluster.yaml 4. check csv and rolebindings: oc get clusterrolebinding -A | wc -l oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 Kube Descheduler Operator 4.6.0-202106010807.p0.git.5db84c5 Pending 5. oc delete csv clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 6. Wait for OLM to remove clusterrolebindings 7. oc get clusterrolebinding -A | wc -l Actual results: Number of clusterrolebindings reduced severely Expected results: Just the clusterrolebindings of the namespace been removed Additional info: Adding yaml files mentioned in reproducer steps.
> 5. oc delete csv clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 Can't verify if you are supposed to remove the csv directly. > 6. Wait for OLM to remove clusterrolebindings Descheduler operator's RBAC rules for 4.6 are defined in https://github.com/openshift/cluster-kube-descheduler-operator/blob/release-4.6/manifests/4.6/cluster-kube-descheduler-operator.v4.6.0.clusterserviceversion.yaml#L102-L144. If OLM is removing more clusterrolebindings than specified by the descheduler operators csv, it's a bug in the OLM itself. Switching to OLM for further analysis.
Hello, I have tried similar steps as per the description and i could reproduce the issue. Before deleting the csv below are the clusterrolebindings which were present. Before deleting csv below are the clusterrolebindings: ====================================================== [knarra@knarra openshift-client-linux-4.6.0-0.nightly-2021-06-11-023746]$ ./oc get clusterrolebinding -A | wc -l 187 After deleting csv below are the clusterrolebindings: ======================================================= [knarra@knarra openshift-client-linux-4.6.0-0.nightly-2021-06-11-023746]$ ./oc get clusterrolebinding -A | wc -l 140 Below link contains the must-gather file http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1970910/
Hello Vu Dinh, May be the version with which you tried reproducing has fixed the issue ? I tried with same version as mentioned in the bug and was able to repro the issue. Must-gather collected from my cluster is present at [1]. you could take a look at it. [1] http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1970910/ Thanks kasturi
Hi Rama, I would like to ask how did you install this version `clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5` given it seem that version is no longer available in the index for installation. Plus, if this version is a faulty one, then the team who owns this operator should recommend customers to upgrade to a different version to avoid this issue. It seems to me this is descheduler operator issue than OLM issue. Vu
The operator on another cluster is behaving like that: - lastTransitionTime: "2021-06-17T08:53:26Z" lastUpdateTime: "2021-06-17T08:53:26Z" message: 'install strategy failed: Deployment.apps "descheduler-operator" is invalid: metadata.labels: Invalid value: "clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5": must be no more than 63 characters' phase: Failed reason: InstallComponentFailed In my reproducer I haven't set this version, I just added the kube-descheduler operator and it upgraded to that version itself, I did anything but installing. It is in a infinite loop from: Pending --> Install Ready --> Failed If we remove the CSV we will lose the clusterrolebindings, so...the question is...how can we get rid of this operator infinite install loop without breaking the cluster?
> It seems to me this is descheduler operator issue than OLM issue. How can a different version name of the descheduler operator (or any operator) have effect on how many CRBs get deleted? Is it possible something in the OLM > clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 I am not fully familiar with the image promotion for OLM operator images. Though, given this failed the validation, I find it unlikely this would get to a customer. Unless some of the validation tests are not required to pass. > If we remove the CSV we will lose the clusterrolebindings, so...the question is...how can we get rid of this operator infinite install loop without breaking the cluster? Is it possible to replace the existing CSV with a new one? Or, an older one?
Checking http://virt-openshift-05.lab.eng.nay.redhat.com/knarra/1970910/must-gather.local.5159323793101762871/namespaces/openshift-operator-lifecycle-manager/pods/olm-operator-67545f87c4-gv94t/olm-operator/olm-operator/logs/current.log, there's a lot of "cannot delete cluster role binding": ``` 2021-06-11T16:46:16.943744597Z time="2021-06-11T16:46:16Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"console\" is forbidden: User \"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operato 2021-06-11T16:46:16.993671976Z time="2021-06-11T16:46:16Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"system:controller:deployment-controller\" is forbidden: User \"system:serviceaccount:openshift-operat 2021-06-11T16:46:17.043913067Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"default-account-openshift-machine-config-operator\" is forbidden: User \"system:serviceaccount:opensh 2021-06-11T16:46:17.093769174Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"cluster-node-tuning:tuned\" is forbidden: User \"system:serviceaccount:openshift-operator-lifecycle-m2021-06-11T16:46:17.143765805Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"insights-operator\" is forbidden: User \"system:serviceaccount:openshift-operator-lifecycle-manager:o2021-06-11T16:46:17.193750113Z time="2021-06-11T16:46:17Z" level=warning msg="cannot delete cluster role binding" error="clusterrolebindings.rbac.authorization.k8s.io \"system:openshift:operator:openshift-controller-manager-operator\" is forbidden: User ... ``` Why would OLM tried to delete so many CRBs due to "broken" operator csv?
The cluster where I reproduced the issue is no longer available. If I try to install the kube-descheduler.operator in a new cluster the version installed is 'clusterkubedescheduleroperator.4.6.0-202103010126.p0' instead of 'clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5' Maybe @knarra has the cluster available to check if 'oc replace --force -f clusterkubedescheduleroperator.4.6.0-202103010126.p0.yaml' works or not. Regarding question: > Why would OLM tried to delete so many CRBs due to "broken" operator csv? @jchaloup our guess is that somehow, and don't know why, OLM tried to remove all CBRs it just stopped after removing the CBR that allowed it to remove more CBRs.
Hi, team! things that I have already checked: - oc replace won't work, as it keeps the name of the resource - oc patch or oc edit replacing the name do not work either as the name of the resource cannot be changed - Given that we have the version with name 'clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5' cluster operator is not able to update because name is too long and the update is in a infinite loop: lastTransitionTime: "2021-06-17T08:53:26Z" lastUpdateTime: "2021-06-17T08:53:26Z" message: 'install strategy failed: Deployment.apps "descheduler-operator" is invalid: metadata.labels: Invalid value: "clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5": must be no more than 63 characters' phase: Failed reason: InstallComponentFailed - So, I see just two ways to go on: - backup CRBs, delete the faulty csv, restore CRBs or: - wait for fix, apply the fix (upgrade), restart the cluster (as part of the upgrade) and then safely remove the faulty csv Am I right @vdinh ?
Hey Jose, I occur which the options that you mentioned. Unfortunately, those are two options at the moment. Vu
Hi, Vu! we have been thinking about my first option and it is too risky. We are not sure in which order will the CRBs be removed, so might lose the ability to recreate the CRBs. So, it's no longer an option. The only feasible option is to get the fix to 4.6. Cheers, Jose
Verified the bug by creating a 4.9 cluster using cluster-bot. Below are the steps i followed to verify the bug. steps followed: =================== 1) create namespace called 'openshift-kube-descheduler-operator' 2) create operatorgroup using the yaml below [knarra@knarra ~]$ cat /tmp/operatorgroup.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: openshift-kube-descheduler-operator namespace: openshift-kube-descheduler-operator spec: targetNamespaces: - openshift-kube-descheduler-operator 3) create catalogsource with index image using the yaml below [knarra@knarra ~]$ cat /tmp/catalogsource.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: qe-app-registry namespace: openshift-kube-descheduler-operator spec: sourceType: grpc image: docker.io/dinhxuanvu/descheduler-index:v1 4) create subscription using the yaml file below [knarra@knarra ~]$ cat /tmp/subscription.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: cluster-kube-descheduler-operator namespace: openshift-kube-descheduler-operator spec: channel: stable name: cluster-kube-descheduler-operator source: qe-app-registry sourceNamespace: openshift-kube-descheduler-operator Now you can see that csv is in pending state with error "one or more requirements could not be found" Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal RequirementsUnknown 40s (x3 over 41s) operator-lifecycle-manager requirements not yet checked Normal RequirementsNotMet 40s (x2 over 40s) operator-lifecycle-manager one or more requirements couldn't be found Now check the clusterrolebindings count [knarra@knarra ~]$ oc get clusterrolebinding -A | wc -l 200 Delete the csv and check the count again. [knarra@knarra ~]$ oc delete csv clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5 clusterserviceversion.operators.coreos.com "clusterkubedescheduleroperator.4.6.0-202106010807.p0.git.5db84c5" deleted [knarra@knarra ~]$ oc get clusterrolebinding -A | wc -l 200 [knarra@knarra ~]$ oc get clusterrolebinding -A | wc -l 200 I see that clusterrolebinding count remained the same. @vdinh could you please help confirm if the steps above looks good for verification or do we need to do any other additional step verification ?
Hi, Yes, those steps you've done is sufficient to verify this BZ. Vu
Thanks Vu Dinh !! Based on comment27 & comment28 moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759