Bug 1744245 - Delete and recreate of a subscription object without delay should not cause operator install to fail.
Summary: Delete and recreate of a subscription object without delay should not cause o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Abu Kashem
QA Contact: Bruno Andrade
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-21 15:45 UTC by Abu Kashem
Modified: 2019-10-16 06:37 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:37:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1001 0 None closed Bug 1744245: fix e2e failure 2021-01-26 02:50:57 UTC
Github operator-framework operator-lifecycle-manager pull 1023 0 None closed Bug 1744245: Subscription should not point to deleted ip 2021-01-26 02:50:57 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:37:11 UTC

Description Abu Kashem 2019-08-21 15:45:12 UTC
Delete and recreate of a subscription object without any delay causes
operator install to fail.

How to reproduce:

1. Create a CatalogSource object, wait for it to become healthy.
2. Create a Subscription that refers to the CatalogSource above.
3. Wait for the operator to install successfully.
4. Update the CatalogSource
5. Wait for the updated CatalogSource to become healthy.
6. Delete the Subscription object ( created above ).
7. Recreate the Subscription object ( no time delay between delete
and create ). Delete and Create can be done one after another, there is no need to make them concurrent.

Actual Result:
- Operator install fails. The subscription status has the following condition with a reason `ReferencedInstallPlanNotFound`.

{
       "lastTransitionTime": "2019-08-19T18:42:34Z",
       "reason": "ReferencedInstallPlanNotFound",
       "status": "True",
       "type": "InstallPlanMissing"
}
 

The InstallPlan object referenced by the new Subscription object no longer exists on the cluster. 


Root cause:
- OLM uses a lister to get the list of Subscription(s) in a given namespace and sets the relevant subscriptions(s) found in the list as owner of the installplan object(s).
- Because lister uses cache, it will return a deleted subscription
until the cache is synced.
- The new Installplan object may get an owner ref that points to the
deleted Subscription.
- GC garbage collector collects the deleted Subscription and consequently deletes the new InstallPlan object.
- Subscription reconciler reports that the new InstallPlan object is missing and moves the Subscription to a Failed state. The api audit log has entries that validates that GC is "deleting" the new InstallPlan object.

Fix:

For now, use a direct non-cached client to retrieve the list of
Subscription.

Comment 2 Jian Zhang 2019-08-23 10:28:41 UTC
Cluster version is 4.2.0-0.nightly-2019-08-22-153337
mac:~ jianzhang$ oc  -n openshift-operator-lifecycle-manager  exec catalog-operator-fb7988846-kbg25 -- olm --version
OLM version: 0.11.0
git commit: 33c4d969a098c1a32e2a2f3f6ca1a1b417923acb

1. Create a CatalogSource object, wait for it to become healthy.
mac:~ jianzhang$ oc create -f cs-bug.yaml 
catalogsource.operators.coreos.com/bug-operator created
mac:~ jianzhang$ cat cs-bug.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: bug-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/jiazha/etcd-operator:bug-1732302
  displayName: Bug Operators
  publisher: jian

mac:~ jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  DISPLAY                        TYPE   PUBLISHER   AGE
bug-operator          Bug Operators                  grpc   jian        49s
certified-operators   Certified Operators            grpc   Red Hat     2d22h
community-operators   Community Operators            grpc   Red Hat     2d22h

2. Create a Subscription that refers to the CatalogSource above.
3. Wait for the operator to install successfully.

mac:~ jianzhang$ oc get sub -n openshift-operators
NAME                     PACKAGE                  SOURCE         CHANNEL
bug-cp6hp                etcd                     bug-operator   clusterwide-alpha

mac:~ jianzhang$ oc get csv -n openshift-operators
NAME                              DISPLAY                  VERSION             REPLACES   PHASE
etcdoperator.v0.9.4-clusterwide   etcd                     0.9.4-clusterwide              Succeeded

mac:~ jianzhang$ oc get pods -n openshift-operators
NAME                                      READY   STATUS    RESTARTS   AGE
etcd-operator-7cd8588557-4dl55            3/3     Running   0          114s

4. Update the CatalogSource
Change the catalogsource image to "quay.io/jiazha/etcd-operator:bug2-1732302" from "quay.io/jiazha/etcd-operator:bug-1732302".

5. Wait for the updated CatalogSource to become healthy.
mac:~ jianzhang$ oc get pods -n openshift-marketplace
NAME                                    READY   STATUS    RESTARTS   AGE
bug-operator-wp7w7                      1/1     Running   0          25s

6. Delete the Subscription object ( created above ).
7. Recreate the Subscription object ( no time delay between delete
and create ). Delete and Create can be done one after another, there is no need to make them concurrent.

mac:~ jianzhang$ oc delete sub bug-cp6hp  -n openshift-operators
subscription.operators.coreos.com "bug-cp6hp" deleted
mac:~ jianzhang$ oc create -f sub-bug.yaml 
subscription.operators.coreos.com/bug-7q6cp created

mac:~ jianzhang$ oc get sub  -n openshift-operators
NAME                     PACKAGE                  SOURCE         CHANNEL
bug-7q6cp                etcd                     bug-operator   clusterwide-alpha

mac:~ jianzhang$ oc get ip  -n openshift-operators
NAME            CSV                               SOURCE   APPROVAL    APPROVED
install-phrbw   etcdoperator.v0.9.4-clusterwide            Automatic   true
install-pwwxl   etcdoperator.v0.9.4-clusterwide            Automatic   true

mac:~ jianzhang$ oc get csv  -n openshift-operators
NAME                              DISPLAY                  VERSION             REPLACES   PHASE
etcdoperator.v0.9.4-clusterwide   etcd                     0.9.4-clusterwide              Succeeded
mac:~ jianzhang$ oc get pod  -n openshift-operators
NAME                                      READY   STATUS    RESTARTS   AGE
etcd-operator-7cd8588557-4dl55            3/3     Running   0          16m

The csv/pod were not updated. This new created subscription refers to the old InstallPlan.
mac:~ jianzhang$ oc get sub  -n openshift-operators bug-7q6cp -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
...
  conditions:
  - lastTransitionTime: "2019-08-23T10:14:05Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  currentCSV: etcdoperator.v0.9.4-clusterwide
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-phrbw
    namespace: openshift-operators
    resourceVersion: "1478436"
    uid: bd1140b9-c58e-11e9-9e50-fa163e923999
  installedCSV: etcdoperator.v0.9.4-clusterwide
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-phrbw
    uuid: bd1140b9-c58e-11e9-9e50-fa163e923999
  lastUpdated: "2019-08-23T10:14:07Z"
  state: AtLatestKnown

mac:~ jianzhang$ oc get ip  -n openshift-operators install-phrbw -o yaml |grep "conditions:" -A 4
  conditions:
  - lastTransitionTime: "2019-08-23T10:14:05Z"
    lastUpdateTime: "2019-08-23T10:14:05Z"
    status: "True"
    type: Installed
mac:~ jianzhang$ oc get ip  -n openshift-operators install-pwwxl -o yaml |grep "conditions:" -A 4
  conditions:
  - lastTransitionTime: "2019-08-23T10:05:48Z"
    lastUpdateTime: "2019-08-23T10:05:48Z"
    status: "True"
    type: Installed

Verify fail.

Comment 4 Bruno Andrade 2019-09-10 16:50:23 UTC
LGTM, marking as verified.

Cluster Version: 4.2.0-0.nightly-2019-09-09-073137
OLM version: 0.11.0
git commit: d6056ddf181798e740178d2c6cad76d60bd0b52c

Steps used to validate
1- Create a namespace
oc create ns test-operators-2

2. Create a CatalogSource object, wait for it to become healthy.
oc apply -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcd.yaml -n openshift-marketplace
oc apply -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/catalogsource/catalogsource.yaml -n openshift-marketplace

3. Create an Operator Group
oc create -f - <<EOF
 apiVersion: operators.coreos.com/v1
 kind: OperatorGroup
 metadata:
   name: test-operators-og
   namespace: test-operators-2
 spec:
   targetNamespaces:
   - test-operators-2
EOF

3. Create a Subscription that refers to the CatalogSource above.
oc create -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-etcdoperator.v0.9.2
  namespace: test-operators-2
spec:
  channel: alpha
  installPlanApproval: Automatic
  name: etcd-update
  source: installed-community-global-operators
  sourceNamespace: openshift-marketplace
  startingCSV: etcdoperator.v0.9.2
EOF



3. Wait for the operator to install successfully.
oc get csv -n test-operators-2
NAME                  DISPLAY   VERSION   REPLACES   PHASE
etcdoperator.v0.9.2   etcd      0.9.2                Succeeded

4. Update the CatalogSource
oc apply -f https://raw.githubusercontent.com/bandrade/v3-testfiles/v4.1/olm/configmap/configmap_etcdv4.yaml -n openshift-marketplace

5. Wait for the updated CatalogSource to become healthy.
oc get pods -n openshift-marketplace


6. Delete the Subscription object ( created above ).
oc delete subs etcd-etcdoperator.v0.9.2 -n test-operators-2


7. Recreate the Subscription object ( no time delay between delete
and create ). Delete and Create can be done one after another, there is no need to make them concurrent.

oc create -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: etcd-etcdoperator.v0.9.2
  namespace: test-operators-2
spec:
  channel: alpha
  installPlanApproval: Automatic
  name: etcd-update
  source: installed-community-global-operators
  sourceNamespace: openshift-marketplace
  startingCSV: etcdoperator.v0.9.2
EOF

subscription.operators.coreos.com/etcd-etcdoperator.v0.9.2 created

It has the new InstallPlan referenced

oc get csv  -n test-operators-2
NAME                  DISPLAY   VERSION   REPLACES              PHASE
etcdoperator.v0.9.4   etcd      0.9.4     etcdoperator.v0.9.2   Succeeded

oc get ip -n test-operators-2
NAME            CSV                   APPROVAL    APPROVED
install-htfxs   etcdoperator.v0.9.2   Automatic   true
install-vvshp   etcdoperator.v0.9.4   Automatic   true

oc get ip install-vvshp -o yaml -n test-operators-2 |grep "conditions:" -A 4
  conditions:
  - lastTransitionTime: "2019-09-10T16:40:14Z"
    lastUpdateTime: "2019-09-10T16:40:14Z"
    status: "True"
    type: Installed

oc get subs etcd-etcdoperator.v0.9.2 -o yaml -n test-operators-2

  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-vvshp
    namespace: test-operators-2
    resourceVersion: "874190"
    uid: aa245b66-d3e9-11e9-93a3-02ecac833838
  installedCSV: etcdoperator.v0.9.4
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-vvshp
    uuid: aa245b66-d3e9-11e9-93a3-02ecac833838
  lastUpdated: "2019-09-10T16:40:18Z"

Comment 5 errata-xmlrpc 2019-10-16 06:37:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.