Delete and recreate of a subscription object without any delay causes operator install to fail. How to reproduce: 1. Create a CatalogSource object, wait for it to become healthy. 2. Create a Subscription that refers to the CatalogSource above. 3. Wait for the operator to install successfully. 4. Update the CatalogSource 5. Wait for the updated CatalogSource to become healthy. 6. Delete the Subscription object ( created above ). 7. Recreate the Subscription object ( no time delay between delete and create ). Delete and Create can be done one after another, there is no need to make them concurrent. Actual Result: - Operator install fails. The subscription status has the following condition with a reason `ReferencedInstallPlanNotFound`. { "lastTransitionTime": "2019-08-19T18:42:34Z", "reason": "ReferencedInstallPlanNotFound", "status": "True", "type": "InstallPlanMissing" } The InstallPlan object referenced by the new Subscription object no longer exists on the cluster. Root cause: - OLM uses a lister to get the list of Subscription(s) in a given namespace and sets the relevant subscriptions(s) found in the list as owner of the installplan object(s). - Because lister uses cache, it will return a deleted subscription until the cache is synced. - The new Installplan object may get an owner ref that points to the deleted Subscription. - GC garbage collector collects the deleted Subscription and consequently deletes the new InstallPlan object. - Subscription reconciler reports that the new InstallPlan object is missing and moves the Subscription to a Failed state. The api audit log has entries that validates that GC is "deleting" the new InstallPlan object. Fix: For now, use a direct non-cached client to retrieve the list of Subscription.
Cluster version is 4.2.0-0.nightly-2019-08-22-153337 mac:~ jianzhang$ oc -n openshift-operator-lifecycle-manager exec catalog-operator-fb7988846-kbg25 -- olm --version OLM version: 0.11.0 git commit: 33c4d969a098c1a32e2a2f3f6ca1a1b417923acb 1. Create a CatalogSource object, wait for it to become healthy. mac:~ jianzhang$ oc create -f cs-bug.yaml catalogsource.operators.coreos.com/bug-operator created mac:~ jianzhang$ cat cs-bug.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: bug-operator namespace: openshift-marketplace spec: sourceType: grpc image: quay.io/jiazha/etcd-operator:bug-1732302 displayName: Bug Operators publisher: jian mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace NAME DISPLAY TYPE PUBLISHER AGE bug-operator Bug Operators grpc jian 49s certified-operators Certified Operators grpc Red Hat 2d22h community-operators Community Operators grpc Red Hat 2d22h 2. Create a Subscription that refers to the CatalogSource above. 3. Wait for the operator to install successfully. mac:~ jianzhang$ oc get sub -n openshift-operators NAME PACKAGE SOURCE CHANNEL bug-cp6hp etcd bug-operator clusterwide-alpha mac:~ jianzhang$ oc get csv -n openshift-operators NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4-clusterwide etcd 0.9.4-clusterwide Succeeded mac:~ jianzhang$ oc get pods -n openshift-operators NAME READY STATUS RESTARTS AGE etcd-operator-7cd8588557-4dl55 3/3 Running 0 114s 4. Update the CatalogSource Change the catalogsource image to "quay.io/jiazha/etcd-operator:bug2-1732302" from "quay.io/jiazha/etcd-operator:bug-1732302". 5. Wait for the updated CatalogSource to become healthy. mac:~ jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE bug-operator-wp7w7 1/1 Running 0 25s 6. Delete the Subscription object ( created above ). 7. Recreate the Subscription object ( no time delay between delete and create ). Delete and Create can be done one after another, there is no need to make them concurrent. mac:~ jianzhang$ oc delete sub bug-cp6hp -n openshift-operators subscription.operators.coreos.com "bug-cp6hp" deleted mac:~ jianzhang$ oc create -f sub-bug.yaml subscription.operators.coreos.com/bug-7q6cp created mac:~ jianzhang$ oc get sub -n openshift-operators NAME PACKAGE SOURCE CHANNEL bug-7q6cp etcd bug-operator clusterwide-alpha mac:~ jianzhang$ oc get ip -n openshift-operators NAME CSV SOURCE APPROVAL APPROVED install-phrbw etcdoperator.v0.9.4-clusterwide Automatic true install-pwwxl etcdoperator.v0.9.4-clusterwide Automatic true mac:~ jianzhang$ oc get csv -n openshift-operators NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4-clusterwide etcd 0.9.4-clusterwide Succeeded mac:~ jianzhang$ oc get pod -n openshift-operators NAME READY STATUS RESTARTS AGE etcd-operator-7cd8588557-4dl55 3/3 Running 0 16m The csv/pod were not updated. This new created subscription refers to the old InstallPlan. mac:~ jianzhang$ oc get sub -n openshift-operators bug-7q6cp -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription ... conditions: - lastTransitionTime: "2019-08-23T10:14:05Z" message: all available catalogsources are healthy reason: AllCatalogSourcesHealthy status: "False" type: CatalogSourcesUnhealthy currentCSV: etcdoperator.v0.9.4-clusterwide installPlanRef: apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan name: install-phrbw namespace: openshift-operators resourceVersion: "1478436" uid: bd1140b9-c58e-11e9-9e50-fa163e923999 installedCSV: etcdoperator.v0.9.4-clusterwide installplan: apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan name: install-phrbw uuid: bd1140b9-c58e-11e9-9e50-fa163e923999 lastUpdated: "2019-08-23T10:14:07Z" state: AtLatestKnown mac:~ jianzhang$ oc get ip -n openshift-operators install-phrbw -o yaml |grep "conditions:" -A 4 conditions: - lastTransitionTime: "2019-08-23T10:14:05Z" lastUpdateTime: "2019-08-23T10:14:05Z" status: "True" type: Installed mac:~ jianzhang$ oc get ip -n openshift-operators install-pwwxl -o yaml |grep "conditions:" -A 4 conditions: - lastTransitionTime: "2019-08-23T10:05:48Z" lastUpdateTime: "2019-08-23T10:05:48Z" status: "True" type: Installed Verify fail.
LGTM, marking as verified. Cluster Version: 4.2.0-0.nightly-2019-09-09-073137 OLM version: 0.11.0 git commit: d6056ddf181798e740178d2c6cad76d60bd0b52c Steps used to validate 1- Create a namespace oc create ns test-operators-2 2. Create a CatalogSource object, wait for it to become healthy. oc apply -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml -n openshift-marketplace oc apply -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/catalogsource/catalogsource.yaml -n openshift-marketplace 3. Create an Operator Group oc create -f - <<EOF apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: test-operators-og namespace: test-operators-2 spec: targetNamespaces: - test-operators-2 EOF 3. Create a Subscription that refers to the CatalogSource above. oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd-etcdoperator.v0.9.2 namespace: test-operators-2 spec: channel: alpha installPlanApproval: Automatic name: etcd-update source: installed-community-global-operators sourceNamespace: openshift-marketplace startingCSV: etcdoperator.v0.9.2 EOF 3. Wait for the operator to install successfully. oc get csv -n test-operators-2 NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.2 etcd 0.9.2 Succeeded 4. Update the CatalogSource oc apply -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcdv4.yaml -n openshift-marketplace 5. Wait for the updated CatalogSource to become healthy. oc get pods -n openshift-marketplace 6. Delete the Subscription object ( created above ). oc delete subs etcd-etcdoperator.v0.9.2 -n test-operators-2 7. Recreate the Subscription object ( no time delay between delete and create ). Delete and Create can be done one after another, there is no need to make them concurrent. oc create -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd-etcdoperator.v0.9.2 namespace: test-operators-2 spec: channel: alpha installPlanApproval: Automatic name: etcd-update source: installed-community-global-operators sourceNamespace: openshift-marketplace startingCSV: etcdoperator.v0.9.2 EOF subscription.operators.coreos.com/etcd-etcdoperator.v0.9.2 created It has the new InstallPlan referenced oc get csv -n test-operators-2 NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Succeeded oc get ip -n test-operators-2 NAME CSV APPROVAL APPROVED install-htfxs etcdoperator.v0.9.2 Automatic true install-vvshp etcdoperator.v0.9.4 Automatic true oc get ip install-vvshp -o yaml -n test-operators-2 |grep "conditions:" -A 4 conditions: - lastTransitionTime: "2019-09-10T16:40:14Z" lastUpdateTime: "2019-09-10T16:40:14Z" status: "True" type: Installed oc get subs etcd-etcdoperator.v0.9.2 -o yaml -n test-operators-2 installPlanRef: apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan name: install-vvshp namespace: test-operators-2 resourceVersion: "874190" uid: aa245b66-d3e9-11e9-93a3-02ecac833838 installedCSV: etcdoperator.v0.9.4 installplan: apiVersion: operators.coreos.com/v1alpha1 kind: InstallPlan name: install-vvshp uuid: aa245b66-d3e9-11e9-93a3-02ecac833838 lastUpdated: "2019-09-10T16:40:18Z"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922