Bug 1732214

Summary: catalog-operator panic on labelling ClusterRole/ClusterRoleBinding
Product: OpenShift Container Platform Reporter: Christoph Blecker <cblecker>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bandrade, ecordell, jfan, scolange
Version: 4.1.z   
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-28 19:54:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1732911    

Description Christoph Blecker 2019-07-23 03:25:02 UTC
Description of problem:
catalog-operator pod in crash loop with following error

```
time="2019-07-22T21:21:53Z" level=info msg="log level info"
time="2019-07-22T21:21:53Z" level=info msg="TLS keys set, using https for metrics"
time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config"
time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config"
time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config"
time="2019-07-22T21:21:53Z" level=info msg="connection established. cluster-version: v1.13.4+6569b4f"
time="2019-07-22T21:21:53Z" level=info msg="operator ready"
time="2019-07-22T21:21:53Z" level=info msg="starting informers..."
time="2019-07-22T21:21:53Z" level=info msg="waiting for caches to sync..."
time="2019-07-22T21:21:53Z" level=info msg="starting workers..."
time="2019-07-22T21:21:53Z" level=info msg=syncing id=J0bpl ip=install-ssz97 namespace=openshift-dedicated-admin phase=Installing
time="2019-07-22T21:21:53Z" level=info msg="building connection to registry" currentSource="{community-operators openshift-marketplace}" id=WiU2e source=community-operators
time="2019-07-22T21:21:53Z" level=info msg=syncing id=uqL/b ip=install-7h475 namespace=openshift-logging phase=Complete
time="2019-07-22T21:21:53Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{community-operators openshift-marketplace}" id=WiU2e source=community-operators
time="2019-07-22T21:21:53Z" level=info msg=syncing id=BZh1k ip=install-c2thj namespace=openshift-monitoring phase=Complete
time="2019-07-22T21:21:54Z" level=info msg="retrying openshift-logging"
E0722 21:21:54.040243       1 queueinformer_operator.go:186] Sync "openshift-logging" failed: no catalog sources available
time="2019-07-22T21:21:54Z" level=info msg="retrying openshift-operator-lifecycle-manager"
E0722 21:21:54.239699       1 queueinformer_operator.go:186] Sync "openshift-operator-lifecycle-manager" failed: no catalog sources available
time="2019-07-22T21:21:54Z" level=info msg="building connection to registry" currentSource="{configure-alertmanager-operator-registry openshift-operator-lifecycle-manager}" id=lLs1i source=configure-alertmanager-operator-registry
time="2019-07-22T21:21:54Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{configure-alertmanager-operator-registry openshift-operator-lifecycle-manager}" id=lLs1i source=configure-alertmanager-operator-registry
panic: assignment to entry in nil map

goroutine 183 [running]:
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).ExecutePlan(0xc4204bec00, 0xc421924000, 0x1, 0x1)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:1187 +0x4243
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.transitionInstallPlanState(0xc420084180, 0x1601da0, 0xc4204bec00, 0xc420cde4c0, 0xb, 0xc4206cb5a0, 0x1d, 0xc420cde500, 0xd, 0xc420cde4e0, ...)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:992 +0x2ad
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc4204bec00, 0x14a01c0, 0xc420888000, 0xc420888000, 0x1)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:938 +0x458
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).(github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.syncInstallPlans)-fm(0x14a01c0, 0xc420888000, 0x27, 0x14a01c0)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:149 +0x3e
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*Operator).sync(0xc4201c41c0, 0xc420309ec0, 0xc42086fcb0, 0x27, 0xc4207193c0, 0x0)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:215 +0x1a4
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*Operator).processNextWorkItem(0xc4201c41c0, 0xc420309ec0, 0x0)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:183 +0xfa
github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*Operator).worker(0xc4201c41c0, 0xc420309ec0)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:169 +0x35
created by github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer.(*Operator).Run.func1
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:151 +0x9cd
```

Version-Release number of selected component (if applicable):
4.1.6 (release-4.1)


How reproducible:
Consistent if you exercise this code path.


Steps to Reproduce:
1. Deploy ClusterRole/ClusterRoleBinding to the cluster manually
2. Deploy operator that attempts to create ClusterRole/ClusterRoleBinding with same name
3. Observe panic and crash loop in catalog-operator pod

Actual results:
Crash looping catalog-operator pod


Expected results:
catalog-operator is able to re-label ClusterRole/ClusterRoleBinding and continue


Additional info:
Proposed fix for master branch: https://github.com/operator-framework/operator-lifecycle-manager/pull/959

This will also require backporting a fix to release branches, as the github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog package was changed in a substantial way in https://github.com/operator-framework/operator-lifecycle-manager/pull/892

Comment 1 Jian Zhang 2019-07-23 07:03:28 UTC
Hi, Christoph

Thanks for your report, I create bug 1732302 for 4.1.z version.

@Evan
Do we need to submit another fixed PR to release-4.1 branch? Or just cherry-pick this fixed PR to it from the master branch?

Comment 2 Evan Cordell 2019-08-06 21:26:06 UTC
I cherry picked the master pr to 4.1, should merge after approval.

Comment 3 Evan Cordell 2019-08-06 21:27:06 UTC
*** Bug 1733324 has been marked as a duplicate of this bug. ***

Comment 5 Jian Zhang 2019-08-20 05:59:18 UTC
LGTM, steps as below:
Cluster version is 4.1.0-0.nightly-2019-08-19-173358
OLM version:                
io.openshift.build.commit.url=https://github.com/operator-framework/operator-lifecycle-manager/commit/e782ca5034ae1fc706145ffd4634ebdffb58b2ba
io.openshift.build.source-location=https://github.com/operator-framework/operator-lifecycle-manager

1) Create a CatalogSource which contains additional Clusterrole/ClusterRoleBinding files.
mac:~ jianzhang$ cat cs-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: etcd-bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/jiazha/etcd-operator:bug-1732302
  displayName: ETCD Bug Operators
  publisher: jian

mac:~ jianzhang$ oc create -f cs-bug.yaml 
catalogsource.operators.coreos.com/etcd-bug-operator created
mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  NAME                  TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     139m
community-operators   Community Operators   grpc   Red Hat     139m
etcd-bug-operator     ETCD Bug Operators    grpc   jian        17s
redhat-operators      Red Hat Operators     grpc   Red Hat     139m

2)  Create that static ClusterRole/ClusterRoleBinding objects.
mac:operator-registry jianzhang$ oc create -f manifests/etcd/etcdclusterrole.yaml 
clusterrole.rbac.authorization.k8s.io/etcdoperator.v0.9.4-clusterwide-test created
mac:operator-registry jianzhang$ oc create -f manifests/etcd/etcdclusterrolebinding.yaml 
clusterrolebinding.rbac.authorization.k8s.io/etcdoperator.v0.9.4-clusterrolebinding-test created
mac:operator-registry jianzhang$ oc get clusterrolebinding |grep etcd
etcdoperator.v0.9.4-clusterrolebinding-test                                       12s
mac:operator-registry jianzhang$ oc get clusterrole |grep etcd
etcdoperator.v0.9.4-clusterwide-test                                   43s

3) Create a OperatorGroup in openshift-marketplace project.
mac:~ jianzhang$ oc get og -n openshift-marketplace
NAME     AGE
bug-og   32s

4) Subscribe this test operator.
mac:~ jianzhang$ cat sub-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  generateName: etcd-bug-
  namespace: openshift-marketplace
spec:
  source: etcd-bug-operator
  sourceNamespace: openshift-marketplace
  name: etcd
  startingCSV: etcdoperator.v0.9.4-clusterwide
  channel: clusterwide-alpha

mac:~ jianzhang$ oc get csv -n openshift-marketplace
NAME                              DISPLAY   VERSION             REPLACES   PHASE
etcdoperator.v0.9.4-clusterwide   etcd      0.9.4-clusterwide              Succeeded

mac:~ jianzhang$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-68f759cbc7-v4q4r    1/1     Running   0          154m
community-operators-6c5ffdc5f-ldg5f     1/1     Running   0          154m
etcd-bug-operator-gqkqf                 1/1     Running   0          15m
etcd-operator-bf4866946-m7vdz           3/3     Running   0          47s
marketplace-operator-5fc975bc86-c9qsv   1/1     Running   0          154m
redhat-operators-775568dd5-ckb5k        1/1     Running   0          154m

5) Check the OLM pods status.
mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-5d48c4d4bc-xmg5t   1/1     Running   0          164m
olm-operator-7f66446cfb-cb9zq       1/1     Running   0          164m
olm-operators-jcqbz                 1/1     Running   0          160m
packageserver-5c6d7445df-45j9v      1/1     Running   0          160m
packageserver-5c6d7445df-sd8hj      1/1     Running   0          160m

6) Re-run above steps 1,2,4,5 with a new registry image(quay.io/jiazha/etcd-operator:bug2-1732302) which no `clusterPermission` configured in the csv.

mac:~ jianzhang$ cat cs-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: etcd-bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/jiazha/etcd-operator:bug2-1732302
  displayName: ETCD Bug Operators
  publisher: jian

mac:~ jianzhang$ oc create -f cs-bug.yaml 
catalogsource.operators.coreos.com/etcd-bug-operator created
mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  NAME                  TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     160m
community-operators   Community Operators   grpc   Red Hat     160m
etcd-bug-operator     ETCD Bug Operators    grpc   jian        5s
redhat-operators      Red Hat Operators     grpc   Red Hat     160m

mac:~ jianzhang$ oc get sub -n openshift-marketplace
NAME             PACKAGE   SOURCE              CHANNEL
etcd-bug-4ls2t   etcd      etcd-bug-operator   clusterwide-alpha
mac:~ jianzhang$ oc get csv -n openshift-marketplace
NAME                              DISPLAY   VERSION             REPLACES   PHASE
etcdoperator.v0.9.4-clusterwide   etcd      0.9.4-clusterwide              Succeeded
mac:~ jianzhang$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-68f759cbc7-v4q4r    1/1     Running   0          162m
community-operators-6c5ffdc5f-ldg5f     1/1     Running   0          162m
etcd-bug-operator-w2jb9                 1/1     Running   0          119s
etcd-operator-bf4866946-vrwfj           3/3     Running   0          79s
marketplace-operator-5fc975bc86-c9qsv   1/1     Running   0          162m
redhat-operators-775568dd5-ckb5k        1/1     Running   0          162m

mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-5d48c4d4bc-xmg5t   1/1     Running   0          169m
olm-operator-7f66446cfb-cb9zq       1/1     Running   0          169m
olm-operators-jcqbz                 1/1     Running   0          165m
packageserver-5c6d7445df-45j9v      1/1     Running   0          165m
packageserver-5c6d7445df-sd8hj      1/1     Running   0          165m

Comment 7 errata-xmlrpc 2019-08-28 19:54:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2547