Description of problem: This bug is a clone of bug 1732214. It should be fixed in the 4.1.z version. catalog-operator pod in crash loop with following error ``` time="2019-07-22T21:21:53Z" level=info msg="log level info" time="2019-07-22T21:21:53Z" level=info msg="TLS keys set, using https for metrics" time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config" time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config" time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config" time="2019-07-22T21:21:53Z" level=info msg="connection established. cluster-version: v1.13.4+6569b4f" time="2019-07-22T21:21:53Z" level=info msg="operator ready" time="2019-07-22T21:21:53Z" level=info msg="starting informers..." time="2019-07-22T21:21:53Z" level=info msg="waiting for caches to sync..." time="2019-07-22T21:21:53Z" level=info msg="starting workers..." time="2019-07-22T21:21:53Z" level=info msg=syncing id=J0bpl ip=install-ssz97 namespace=openshift-dedicated-admin phase=Installing time="2019-07-22T21:21:53Z" level=info msg="building connection to registry" currentSource="{community-operators openshift-marketplace}" id=WiU2e source=community-operators time="2019-07-22T21:21:53Z" level=info msg=syncing id=uqL/b ip=install-7h475 namespace=openshift-logging phase=Complete time="2019-07-22T21:21:53Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{community-operators openshift-marketplace}" id=WiU2e source=community-operators time="2019-07-22T21:21:53Z" level=info msg=syncing id=BZh1k ip=install-c2thj namespace=openshift-monitoring phase=Complete time="2019-07-22T21:21:54Z" level=info msg="retrying openshift-logging" E0722 21:21:54.040243 1 queueinformer_operator.go:186] Sync "openshift-logging" failed: no catalog sources available time="2019-07-22T21:21:54Z" level=info msg="retrying openshift-operator-lifecycle-manager" E0722 21:21:54.239699 1 queueinformer_operator.go:186] Sync "openshift-operator-lifecycle-manager" failed: no catalog sources available time="2019-07-22T21:21:54Z" level=info msg="building connection to registry" currentSource="{configure-alertmanager-operator-registry openshift-operator-lifecycle-manager}" id=lLs1i source=configure-alertmanager-operator-registry time="2019-07-22T21:21:54Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{configure-alertmanager-operator-registry openshift-operator-lifecycle-manager}" id=lLs1i source=configure-alertmanager-operator-registry panic: assignment to entry in nil map goroutine 183 [running]: github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).ExecutePlan(0xc4204bec00, 0xc421924000, 0x1, 0x1) /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:1187 +0x4243 github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.transitionInstallPlanState(0xc420084180, 0x1601da0, 0xc4204bec00, 0xc420cde4c0, 0xb, 0xc4206cb5a0, 0x1d, 0xc420cde500, 0xd, 0xc420cde4e0, ...) /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:992 +0x2ad github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc4204bec00, 0x14a01c0, 0xc420888000, 0xc420888000, 0x1) Version-Release number of selected component (if applicable): OLM: 4.1.6 (release-4.1) How reproducible: always Steps to Reproduce: 1. Deploy ClusterRole/ClusterRoleBinding to the cluster first manually 2. Deploy operator that attempts to create the same ClusterRole/ClusterRoleBinding 3. Observe panic and crash loop in catalog-operator pod Actual results: Crash looping catalog-operator pod Expected results: catalog-operator is able to re-label ClusterRole/ClusterRoleBinding and continue Additional info: Proposed fix for master branch(4.2): https://github.com/operator-framework/operator-lifecycle-manager/pull/959 This will also require backporting a fix to release branches, as the github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog package was changed in a substantial way in https://github.com/operator-framework/operator-lifecycle-manager/pull/892
*** Bug 1732911 has been marked as a duplicate of this bug. ***
Hi, Evan Sorry, I didn't find the fixed PR merged in the release-4.1 branch, or am I missing something? Change status to `ASSIGNED` first since no fixed PR.
Making this the bug for 4.2 and will duplicate for 4.1.z
@Evan, > Making this the bug for 4.2 and will duplicate for 4.1.z OK, so I change the `Target Release` of bug 1732214 to 4.1.z since this one if for 4.2 now.
Hi, Christoph > Steps to Reproduce: 1. Deploy ClusterRole/ClusterRoleBinding to the cluster first manually 2. Deploy operator that attempts to create the same ClusterRole/ClusterRoleBinding Based on my understanding, OLM will create the `ClusterRole/ClusterRoleBinding` objects with a random number. Such as: ClusterRole: etcdoperator.v0.9.4-clusterwide-4t9p5 ClusterRoleBinding: etcdoperator.v0.9.4-clusterwide-4t9p5-etcd-operator-zslsf So, my question is how can we deploy the operator with the same ClusterRole/ClusterRoleBinding? Thanks!
It's possible to include additional ClusterRole/ClusterRoleBindings objects in the operator bundle with static names. This isn't optimal, but this was the scenario we saw this bug trigger in.
Christoph, Yeah, thanks! Below are the test steps, please let me know if anymore steps. Thanks! 1) Add Clusterrole/ClusterRoleBinding files in operator bundles. The statics ClusterRole/ClusterRoleBinding names are: etcdoperator.v0.9.4-clusterwide-test, etcdoperator.v0.9.4-clusterrolebinding-test, see below: mac:etcd jianzhang$ pwd /Users/jianzhang/goproject/src/github.com/operator-framework/operator-registry/manifests/etcd mac:etcd jianzhang$ ls etcd.package.yaml etcdclusterrolebinding.yaml etcdbackup.crd.yaml etcdoperator.v0.9.4-clusterwide.clusterserviceversion.yaml etcdcluster.crd.yaml etcdrestore.crd.yaml etcdclusterrole.yaml mac:etcd jianzhang$ cat etcdclusterrole.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: etcdoperator.v0.9.4-clusterwide-test rules: - apiGroups: - etcd.database.coreos.com resources: - etcdclusters - etcdbackups - etcdrestores verbs: - '*' - apiGroups: - "" resources: - pods - services - endpoints - persistentvolumeclaims - events verbs: - '*' - apiGroups: - apps resources: - deployments verbs: - '*' - apiGroups: - "" resources: - secrets verbs: - get mac:etcd jianzhang$ cat etcdclusterrolebinding.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: etcdoperator.v0.9.4-clusterrolebinding-test roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: etcdoperator.v0.9.4-clusterwide-test subjects: - kind: ServiceAccount name: etcd-operator namespace: openshift-operators 2) Build a test registry image and push it to Quay. mac:operator-registry jianzhang$ docker build -f upstream-example.Dockerfile -t quay.io/jiazha/etcd-operator:bug-1732302 . ... Successfully built b25276cabf1e Successfully tagged quay.io/jiazha/etcd-operator:bug-1732302 mac:operator-registry jianzhang$ docker push quay.io/jiazha/etcd-operator:bug-1732302 ... 3) Create a CatalogSource to consume this test image. mac:~ jianzhang$ cat cs-bug.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: etcd-bug-operator namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/jiazha/etcd-operator:bug-1732302 displayName: ETCD Bug Operators publisher: jian mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace NAME NAME TYPE PUBLISHER AGE certified-operators Certified Operators grpc Red Hat 3h43m community-operators Community Operators grpc Red Hat 3h43m etcd-bug-operator ETCD Bug Operators grpc jian 22s redhat-operators Red Hat Operators grpc Red Hat 3h43m 3) Create that static ClusterRole/ClusterRoleBinding objects. mac:operator-registry jianzhang$ oc create -f manifests/etcd/etcdclusterrole.yaml clusterrole.rbac.authorization.k8s.io/etcdoperator.v0.9.4-clusterwide-test created mac:operator-registry jianzhang$ oc create -f manifests/etcd/etcdclusterrolebinding.yaml clusterrolebinding.rbac.authorization.k8s.io/etcdoperator.v0.9.4-clusterrolebinding-test created mac:~ jianzhang$ oc get clusterrolebinding |grep etcd etcdoperator.v0.9.4-clusterrolebinding-test 8s mac:~ jianzhang$ oc get clusterrole |grep etcd etcdoperator.v0.9.4-clusterwide-test 24s 4) Create this test operator. mac:~ jianzhang$ cat sub-bug.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: generateName: etcd-bug- namespace: openshift-operators spec: source: etcd-bug-operator sourceNamespace: openshift-marketplace name: etcd startingCSV: etcdoperator.v0.9.4-clusterwide channel: clusterwide-alpha mac:~ jianzhang$ oc get sub -n openshift-operators NAME PACKAGE SOURCE CHANNEL etcd-bug-kjtv2 etcd etcd-bug-operator clusterwide-alpha mac:~ jianzhang$ oc get csv -n openshift-operators NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4-clusterwide etcd 0.9.4-clusterwide Succeeded 5) Check the OLM pods status. mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-7d78f889bf-85vlx 1/1 Running 0 4h6m olm-operator-5c744884f9-q8l4n 1/1 Running 0 4h6m packageserver-578f95779-288kf 1/1 Running 0 4h3m packageserver-578f95779-mjhz6 1/1 Running 0 4h3m 6) Re-run above steps 1,2,4,5,6 with a new registry image(quay.io/jiazha/etcd-operator:bug2-1732302) which no `clusterPermission` configured in the csv. mac:~ jianzhang$ cat cs-bug.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: etcd-bug-operator namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/jiazha/etcd-operator:bug2-1732302 displayName: ETCD Bug Operators publisher: jian mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-7d78f889bf-85vlx 1/1 Running 0 4h23m olm-operator-5c744884f9-q8l4n 1/1 Running 0 4h23m packageserver-578f95779-288kf 1/1 Running 0 4h20m packageserver-578f95779-mjhz6 1/1 Running 0 4h20m The OLM pods worked well, no panic, LGTM, verify it. Cluster and OLM versions: mac:~ jianzhang$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-07-31-162901 True False 4h12m Cluster version is 4.2.0-0.nightly-2019-07-31-162901 mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7d78f889bf-85vlx -- olm --version OLM version: 0.11.0 git commit: d2209c409b35f1db4669c474044decc6995f624d
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922