Bug 1732302 - catalog-operator will panic when the installing operator's ClusterRole/ClusterRoleBinding exist
Summary: catalog-operator will panic when the installing operator's ClusterRole/Cluste...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Evan Cordell
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1732911 (view as bug list)
Depends On:
Blocks: 1732911 1733324
TreeView+ depends on / blocked
 
Reported: 2019-07-23 07:00 UTC by Jian Zhang
Modified: 2019-10-16 06:30 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1732911 1733324 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:30:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 959 0 None closed Bug 1732302: Fix panic when binding already exists 2020-11-11 10:49:16 UTC
Github operator-framework operator-lifecycle-manager pull 964 0 None closed Bug 1732214: Fix panic when binding already exists 2020-11-11 10:49:16 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:30:59 UTC

Description Jian Zhang 2019-07-23 07:00:12 UTC
Description of problem:
This bug is a clone of bug 1732214. It should be fixed in the 4.1.z version.
catalog-operator pod in crash loop with following error

```
time="2019-07-22T21:21:53Z" level=info msg="log level info"
time="2019-07-22T21:21:53Z" level=info msg="TLS keys set, using https for metrics"
time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config"
time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config"
time="2019-07-22T21:21:53Z" level=info msg="Using in-cluster kube client config"
time="2019-07-22T21:21:53Z" level=info msg="connection established. cluster-version: v1.13.4+6569b4f"
time="2019-07-22T21:21:53Z" level=info msg="operator ready"
time="2019-07-22T21:21:53Z" level=info msg="starting informers..."
time="2019-07-22T21:21:53Z" level=info msg="waiting for caches to sync..."
time="2019-07-22T21:21:53Z" level=info msg="starting workers..."
time="2019-07-22T21:21:53Z" level=info msg=syncing id=J0bpl ip=install-ssz97 namespace=openshift-dedicated-admin phase=Installing
time="2019-07-22T21:21:53Z" level=info msg="building connection to registry" currentSource="{community-operators openshift-marketplace}" id=WiU2e source=community-operators
time="2019-07-22T21:21:53Z" level=info msg=syncing id=uqL/b ip=install-7h475 namespace=openshift-logging phase=Complete
time="2019-07-22T21:21:53Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{community-operators openshift-marketplace}" id=WiU2e source=community-operators
time="2019-07-22T21:21:53Z" level=info msg=syncing id=BZh1k ip=install-c2thj namespace=openshift-monitoring phase=Complete
time="2019-07-22T21:21:54Z" level=info msg="retrying openshift-logging"
E0722 21:21:54.040243       1 queueinformer_operator.go:186] Sync "openshift-logging" failed: no catalog sources available
time="2019-07-22T21:21:54Z" level=info msg="retrying openshift-operator-lifecycle-manager"
E0722 21:21:54.239699       1 queueinformer_operator.go:186] Sync "openshift-operator-lifecycle-manager" failed: no catalog sources available
time="2019-07-22T21:21:54Z" level=info msg="building connection to registry" currentSource="{configure-alertmanager-operator-registry openshift-operator-lifecycle-manager}" id=lLs1i source=configure-alertmanager-operator-registry
time="2019-07-22T21:21:54Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{configure-alertmanager-operator-registry openshift-operator-lifecycle-manager}" id=lLs1i source=configure-alertmanager-operator-registry
panic: assignment to entry in nil map

goroutine 183 [running]:
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).ExecutePlan(0xc4204bec00, 0xc421924000, 0x1, 0x1)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:1187 +0x4243
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.transitionInstallPlanState(0xc420084180, 0x1601da0, 0xc4204bec00, 0xc420cde4c0, 0xb, 0xc4206cb5a0, 0x1d, 0xc420cde500, 0xd, 0xc420cde4e0, ...)
        /go/src/github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog/operator.go:992 +0x2ad
github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog.(*Operator).syncInstallPlans(0xc4204bec00, 0x14a01c0, 0xc420888000, 0xc420888000, 0x1)


Version-Release number of selected component (if applicable):
OLM: 4.1.6 (release-4.1)

How reproducible:
always

Steps to Reproduce:
1. Deploy ClusterRole/ClusterRoleBinding to the cluster first manually
2. Deploy operator that attempts to create the same ClusterRole/ClusterRoleBinding
3. Observe panic and crash loop in catalog-operator pod

Actual results:
Crash looping catalog-operator pod

Expected results:
catalog-operator is able to re-label ClusterRole/ClusterRoleBinding and continue


Additional info:
Proposed fix for master branch(4.2): https://github.com/operator-framework/operator-lifecycle-manager/pull/959

This will also require backporting a fix to release branches, as the github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/catalog package was changed in a substantial way in https://github.com/operator-framework/operator-lifecycle-manager/pull/892

Comment 1 Evan Cordell 2019-07-24 16:58:19 UTC
*** Bug 1732911 has been marked as a duplicate of this bug. ***

Comment 3 Jian Zhang 2019-07-25 01:46:44 UTC
Hi, Evan

Sorry, I didn't find the fixed PR merged in the release-4.1 branch, or am I missing something?
Change status to `ASSIGNED` first since no fixed PR.

Comment 6 Evan Cordell 2019-07-25 17:20:58 UTC
Making this the bug for 4.2 and will duplicate for 4.1.z

Comment 8 Jian Zhang 2019-07-29 05:31:57 UTC
@Evan,

> Making this the bug for 4.2 and will duplicate for 4.1.z

OK, so I change the `Target Release` of bug 1732214 to 4.1.z since this one if for 4.2 now.

Comment 9 Jian Zhang 2019-07-29 06:12:18 UTC
Hi, Christoph

> Steps to Reproduce:
1. Deploy ClusterRole/ClusterRoleBinding to the cluster first manually
2. Deploy operator that attempts to create the same ClusterRole/ClusterRoleBinding

Based on my understanding, OLM will create the `ClusterRole/ClusterRoleBinding` objects with a random number. Such as:
ClusterRole: etcdoperator.v0.9.4-clusterwide-4t9p5  
ClusterRoleBinding: etcdoperator.v0.9.4-clusterwide-4t9p5-etcd-operator-zslsf 
So, my question is how can we deploy the operator with the same ClusterRole/ClusterRoleBinding? Thanks!

Comment 10 Christoph Blecker 2019-07-30 18:07:57 UTC
It's possible to include additional ClusterRole/ClusterRoleBindings objects in the operator bundle with static names. This isn't optimal, but this was the scenario we saw this bug trigger in.

Comment 11 Jian Zhang 2019-08-01 06:12:28 UTC
Christoph,

Yeah, thanks! Below are the test steps, please let me know if anymore steps. Thanks!
1) Add Clusterrole/ClusterRoleBinding files in operator bundles. The statics ClusterRole/ClusterRoleBinding names are: etcdoperator.v0.9.4-clusterwide-test, etcdoperator.v0.9.4-clusterrolebinding-test, see below:
mac:etcd jianzhang$ pwd
/Users/jianzhang/goproject/src/github.com/operator-framework/operator-registry/manifests/etcd
mac:etcd jianzhang$ ls
etcd.package.yaml                                          etcdclusterrolebinding.yaml
etcdbackup.crd.yaml                                        etcdoperator.v0.9.4-clusterwide.clusterserviceversion.yaml
etcdcluster.crd.yaml                                       etcdrestore.crd.yaml
etcdclusterrole.yaml
mac:etcd jianzhang$ cat etcdclusterrole.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: etcdoperator.v0.9.4-clusterwide-test
rules:
- apiGroups:
  - etcd.database.coreos.com
  resources:
  - etcdclusters
  - etcdbackups
  - etcdrestores
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - persistentvolumeclaims
  - events
  verbs:
  - '*'
- apiGroups:
  - apps
  resources:
  - deployments
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get

mac:etcd jianzhang$ cat etcdclusterrolebinding.yaml 
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: etcdoperator.v0.9.4-clusterrolebinding-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: etcdoperator.v0.9.4-clusterwide-test
subjects:
- kind: ServiceAccount
  name: etcd-operator
  namespace: openshift-operators

2) Build a test registry image and push it to Quay.
mac:operator-registry jianzhang$ docker build -f upstream-example.Dockerfile -t quay.io/jiazha/etcd-operator:bug-1732302 .
...
Successfully built b25276cabf1e
Successfully tagged quay.io/jiazha/etcd-operator:bug-1732302
mac:operator-registry jianzhang$ docker push quay.io/jiazha/etcd-operator:bug-1732302
...

3) Create a CatalogSource to consume this test image.
mac:~ jianzhang$ cat cs-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: etcd-bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/jiazha/etcd-operator:bug-1732302
  displayName: ETCD Bug Operators
  publisher: jian

mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  NAME                  TYPE   PUBLISHER   AGE
certified-operators   Certified Operators   grpc   Red Hat     3h43m
community-operators   Community Operators   grpc   Red Hat     3h43m
etcd-bug-operator     ETCD Bug Operators    grpc   jian        22s
redhat-operators      Red Hat Operators     grpc   Red Hat     3h43m

3) Create that static ClusterRole/ClusterRoleBinding objects.
mac:operator-registry jianzhang$ oc create -f manifests/etcd/etcdclusterrole.yaml 
clusterrole.rbac.authorization.k8s.io/etcdoperator.v0.9.4-clusterwide-test created
mac:operator-registry jianzhang$ oc create -f manifests/etcd/etcdclusterrolebinding.yaml 
clusterrolebinding.rbac.authorization.k8s.io/etcdoperator.v0.9.4-clusterrolebinding-test created
mac:~ jianzhang$ oc get clusterrolebinding |grep etcd
etcdoperator.v0.9.4-clusterrolebinding-test                                       8s
mac:~ jianzhang$ oc get clusterrole |grep etcd
etcdoperator.v0.9.4-clusterwide-test                                   24s

4) Create this test operator.
mac:~ jianzhang$ cat sub-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  generateName: etcd-bug-
  namespace: openshift-operators
spec:
  source: etcd-bug-operator
  sourceNamespace: openshift-marketplace
  name: etcd
  startingCSV: etcdoperator.v0.9.4-clusterwide
  channel: clusterwide-alpha

mac:~ jianzhang$ oc get sub -n openshift-operators
NAME             PACKAGE   SOURCE              CHANNEL
etcd-bug-kjtv2   etcd      etcd-bug-operator   clusterwide-alpha
mac:~ jianzhang$ oc get csv -n openshift-operators
NAME                              DISPLAY   VERSION             REPLACES   PHASE
etcdoperator.v0.9.4-clusterwide   etcd      0.9.4-clusterwide              Succeeded

5) Check the OLM pods status.
mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-7d78f889bf-85vlx   1/1     Running   0          4h6m
olm-operator-5c744884f9-q8l4n       1/1     Running   0          4h6m
packageserver-578f95779-288kf       1/1     Running   0          4h3m
packageserver-578f95779-mjhz6       1/1     Running   0          4h3m

6) Re-run above steps 1,2,4,5,6 with a new registry image(quay.io/jiazha/etcd-operator:bug2-1732302) which no `clusterPermission` configured in the csv.
mac:~ jianzhang$ cat cs-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: etcd-bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/jiazha/etcd-operator:bug2-1732302
  displayName: ETCD Bug Operators
  publisher: jian
mac:~ jianzhang$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-7d78f889bf-85vlx   1/1     Running   0          4h23m
olm-operator-5c744884f9-q8l4n       1/1     Running   0          4h23m
packageserver-578f95779-288kf       1/1     Running   0          4h20m
packageserver-578f95779-mjhz6       1/1     Running   0          4h20m

The OLM pods worked well, no panic, LGTM, verify it. Cluster and OLM versions:
mac:~ jianzhang$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-07-31-162901   True        False         4h12m   Cluster version is 4.2.0-0.nightly-2019-07-31-162901

mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-7d78f889bf-85vlx -- olm --version
OLM version: 0.11.0
git commit: d2209c409b35f1db4669c474044decc6995f624d

Comment 12 errata-xmlrpc 2019-10-16 06:30:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.