Description of problem: When an InstallPlan fails to apply a CRD -- after applying a new CSV -- during an Operator upgrade, it's possible for the new CSV to temporarily transition to Succeeded which causes OLM to erroneously garbage collect the CSV being replaced, along with required resources that have yet to be adopted by the new CSV (due to their application being blocked in the InstallPlan by the failed CRD application). The end result is the new Operator in a permanent Pending state, missing a subset of required resources. Version-Release number of selected component (if applicable): 4.4.8 How reproducible: Always Steps to Reproduce: 1. Build a catalog containing an operator with two bundles in a package/channel - 0.0.1: Requiring a CRD "Foo" and specifying permissions on a ServiceAccount "sa" - 0.0.2: Requiring the same CRD "Foo", with an invalid OpenAPI schema, and specifying permissions on a ServiceAccount "sa", replacing 0.0.1 2. Create a Namespace and OperatorGroup compatible with both CSVs 3. Create a CatalogSource referencing the catalog in the Namespace 4. Create a Subscription in the Namespace on the package/channel with a manual approval strategy and startingCSV set to 0.0.1 5. Approve the resulting InstallPlan and wait for the 0.0.1 CSV to transition to Succeeded 6. Approve the next InstallPlan to be generated Actual results: - CSV 0.0.1 is deleted along with ServiceAccount "sa" - CSV 0.0.2 is in a Pending state, with conditions that show it has transitioned to Succeeded Expected results: - CSV 0.0.1 and CSV 0.0.2 are present - CSV 0.0.2 never transitioned to succeeded Additional info: Customer Report: Succeeded new CSV: https://gist.github.com/alexeykazakov/2f16daab0b14c83b1852f8a93cbf47bd InstallPlan blocked on CRD upgrade issue: https://gist.github.com/alexeykazakov/fa3a224b48091a6d21c5c886666bec22
I can reproduce this on Cluster version is 4.7.0-0.nightly-2020-12-04-013308 1, Create the index image for etcd 0.9.2 version. [root@preserve-olm-env etcd]# opm alpha bundle build -c alpha -e alpha -d ./0.9.2/ -o -b docker -p etcd -t quay.io/olmqe/etcd-bundle:0.9.2-sa ... [root@preserve-olm-env etcd]# docker push quay.io/olmqe/etcd-bundle:0.9.2-sa The push refers to repository [quay.io/olmqe/etcd-bundle] 1f7e5652ecb7: Pushed f9cde18c30f6: Pushed 0.9.2-sa: digest: sha256:5aedf81994df417ea9a051738d499e7bd66b9faf1bf74be528d92b8a35fbae20 size: 732 [root@preserve-olm-env etcd]# opm index add -b quay.io/olmqe/etcd-bundle:0.9.2-sa -t quay.io/olmqe/etcd-index:0.9.2-sa INFO[0000] building the index bundles="[quay.io/olmqe/etcd-bundle:0.9.2-sa]" [root@preserve-olm-env etcd]# docker push quay.io/olmqe/etcd-index:0.9.2-sa The push refers to repository [quay.io/olmqe/etcd-index] ... 2, Modify the CRD etcdcluster for etcd 0.9.4 version, add an invalid OpenAPI schema: https://github.com/jianzhangbjz/community-operators/tree/bug-1857877/community-operators/etcd/0.9.4 1) Create a bundle image [root@preserve-olm-env etcd]# opm alpha bundle build -c alpha -e alpha -d ./0.9.4/ -o -b docker -p etcd -t quay.io/olmqe/etcd-bundle:0.9.4-sa ... 2) add the bundle image to 0.9.2 index image and generate a new index image: quay.io/olmqe/etcd-index:0.9.4-sa [root@preserve-olm-env etcd]# opm index add -f quay.io/olmqe/etcd-index:0.9.2-sa --mode semver -c docker -b quay.io/olmqe/etcd-bundle:0.9.4-sa -t quay.io/olmqe/etcd-index:0.9.4-sa INFO[0000] building the index bundles="[quay.io/olmqe/etcd-bundle:0.9.4-sa]" ... 3, Consume this index image on the cluster. [root@preserve-olm-env etcd]# cat /data/cs-etcd.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: etcd-test namespace: openshift-marketplace spec: displayName: Jian Test publisher: Jian sourceType: grpc image: quay.io/olmqe/etcd-index:0.9.4-sa updateStrategy: registryPoll: interval: 10m [root@preserve-olm-env etcd]# oc get catalogsource -n openshift-marketplace NAME DISPLAY TYPE PUBLISHER AGE ... etcd-test Jian Test grpc Jian 94m 4, subscribe to the etcd operator with manual approval. [root@preserve-olm-env etcd]# cat /data/og.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: test-og namespace: default spec: targetNamespaces: - default [root@preserve-olm-env etcd]# cat /data/sub-0.9.2.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: etcd-sub namespace: default spec: installPlanApproval: Manual channel: alpha name: etcd source: etcd-test sourceNamespace: openshift-marketplace startingCSV: etcdoperator.v0.9.2 [root@preserve-olm-env etcd]# oc get sub -n default NAME PACKAGE SOURCE CHANNEL etcd-sub etcd etcd-test alpha [root@preserve-olm-env etcd]# oc get ip -n default NAME CSV APPROVAL APPROVED install-mc4cw etcdoperator.v0.9.2 Manual false [root@preserve-olm-env etcd]# oc get ip NAME CSV APPROVAL APPROVED install-jfnrv etcdoperator.v0.9.4 Manual false install-mc4cw etcdoperator.v0.9.2 Manual true [root@preserve-olm-env etcd]# oc get csv NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.2 etcd 0.9.2 Succeeded 5, Approve the 0.9.4 installplan [root@preserve-olm-env etcd]# oc get ip NAME CSV APPROVAL APPROVED install-jfnrv etcdoperator.v0.9.4 Manual true install-mc4cw etcdoperator.v0.9.2 Manual true [root@preserve-olm-env etcd]# oc get csv NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.2 etcd 0.9.2 Replacing etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Installing [root@preserve-olm-env etcd]# oc get csv NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Pending The etcd-operator ServiceAccount was deleted. [root@preserve-olm-env etcd]# oc get sa NAME SECRETS AGE builder 2 4h59m default 2 5h11m deployer 2 4h59m [root@preserve-olm-env etcd]# oc get csv etcdoperator.v0.9.4 -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion ... phase: Pending reason: RequirementsNotMet requirementStatus: ... - group: "" kind: ServiceAccount message: Service account does not exist name: etcd-operator status: NotPresent version: v1 Test it on the cluster that contains the fixed PR: Cluster version is 4.7.0-0.nightly-2020-12-07-232943 [root@preserve-olm-env data]# oc -n openshift-operator-lifecycle-manager exec catalog-operator-8649b7f8d5-f4lhq -- olm --version OLM version: 0.17.0 git commit: 4ee4e876522c4d1b97e59d96588b2468149673eb Rerun the above steps: 3, 4, 5 [root@preserve-olm-env data]# oc get sa NAME SECRETS AGE builder 2 31m default 2 45m deployer 2 31m etcd-operator 2 2m14s [root@preserve-olm-env data]# oc get csv NAME DISPLAY VERSION REPLACES PHASE etcdoperator.v0.9.2 etcd 0.9.2 Replacing etcdoperator.v0.9.4 etcd 0.9.4 etcdoperator.v0.9.2 Pending ... The sa still exist and the owner is v0.9.2 csv. [root@preserve-olm-env data]# oc get sa etcd-operator -o yaml apiVersion: v1 imagePullSecrets: ... ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: false kind: ClusterServiceVersion name: etcdoperator.v0.9.2 uid: 6f4527b0-e200-49be-ab5d-7c3c387bc441 The error info is "Service account is not owned by this ClusterServiceVersion", LGTM. [root@preserve-olm-env data]# oc get csv etcdoperator.v0.9.4 -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: ClusterServiceVersion metadata: annotations: ... - group: "" kind: ServiceAccount message: Service account is not owned by this ClusterServiceVersion name: etcd-operator status: PresentNotSatisfied version: v1 verify it.
*** Bug 1907586 has been marked as a duplicate of this bug. ***
*** Bug 1904585 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633