Bug 1706232

Summary: Recreating a CatalogSource and subscription for something from that catalog source results in 'stuck' subscription
Product: OpenShift Container Platform Reporter: Paul Morie <pmorie>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OLM QA Contact: Cuiping HUO <chuo>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: chuo, eparis, jiazha
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Paul Morie 2019-05-03 20:50:01 UTC
Description of problem:

When a CatalogSource is deleted, all associated subscriptions torn down, etc. and then recreated, and a new subscription is created to something offered by that CatalogSource, the subscription appears to be 'stuck' (ie, never gains a status) until the catalog-operator pod is deleted and the subscription is recreated/modified.

Version-Release number of selected component (if applicable):


How reproducible:

1. Create a catalog source.
2. Create a subscription to something that comes from that catalog source.
3. Everything works OK.
4. Delete the subscription.
5. Delete the catalog source.
6. Create the catalog source.
7. Create the subscription.
8. Nothing happens.
9. Delete the catalog-operator pod
10. Sometimes, things get unstuck at this point; many times it takes bumping the subscription in some way, however, for the Subscription to gain a status. Sometimes, restarting the catalog-operator pod appears to be insufficient and manually bumping the subscription resource is required.

Actual results:

Subscription does not gain a status without outside intervention.

Expected results:

Subscriptions should be serviced when they are created; if there is a problem servicing a resource it should be reflected in the status.

Additional info:

I am happy to provide a reproducer for this interactively with someone; I know that folks have said that they have trouble reproducing this.

Comment 1 Jian Zhang 2019-05-06 10:21:18 UTC
Paul,

Thanks for your reporting! I could NOT reproduce this issue with the below version:
OLM version: io.openshift.build.commit.id=b2d1cd21368bc8cc10e4ca11a231f09077630c33
Cluster version is 4.1.0-0.nightly-2019-05-06-011159

1, Create a new project called "debug" and install the "AMQ" operator in it.
mac:~ jianzhang$ oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
amq-streams-cluster-operator-779f9ffbd4-dfz69   1/1     Running   0          8m28s

2, Delete the subscription and catalog source.
3, Recreate the subscription and catalog source.
4, Check the status of the subscription. It works well, as below:
mac:~ jianzhang$ oc get sub
NAME          PACKAGE       SOURCE                   CHANNEL
amq-streams   amq-streams   installed-redhat-debug   stable
mac:~ jianzhang$ oc get sub amq-streams -o go-template='{{ .status.state }}'
AtLatestKnownmac:~ jianzhang$ 

Could you help share me with the details steps to reproduce this issue? Thanks!

Comment 2 Evan Cordell 2019-05-07 15:23:23 UTC
I was finally able to reproduce this by running the same test multiple times in a cluster - thanks for the report!

This is fixed in this commit: https://github.com/operator-framework/operator-lifecycle-manager/pull/846/commits/8d9664a6e3ecbf5615a1e74911a6a87efb11e998 (may go in a different PR depending on how other PRs merge)

After this change, I can no longer reproduce the stuck subscription bug.

Comment 3 Evan Cordell 2019-05-07 17:32:50 UTC
Proposed fix doesn't address the issue.

Comment 4 Evan Cordell 2019-05-07 18:20:03 UTC
This PR contains the fix for the issue: https://github.com/operator-framework/operator-lifecycle-manager/pull/847

Comment 5 Evan Cordell 2019-05-07 18:25:34 UTC
*** Bug 1704940 has been marked as a duplicate of this bug. ***

Comment 10 Cuiping HUO 2019-05-10 09:27:31 UTC
Verified Failed with the below version:
OLM version: io.openshift.build.commit.id=19e7914e33f723c6f77f7aaa0892c7684ce94ed4
Cluster version: 4.1.0-0.nightly-2019-05-09-182710

1, Install the "etcd" operator in project "default".
[chuo@dhcp-140-165 .kube]$ oc get sub
NAME                             PACKAGE                          SOURCE                        CHANNEL
couchbase-enterprise-certified   couchbase-enterprise-certified   installed-certified-default   preview
etcd                             etcd                             installed-community-default   singlenamespace-alpha
[chuo@dhcp-140-165 .kube]$ oc get catsrc
NAME                          NAME                  TYPE   PUBLISHER   AGE
installed-certified-default   Certified Operators   grpc   Certified   38m
installed-community-default   Community Operators   grpc   Community   46s
[chuo@dhcp-140-165 .kube]$ oc get ip
NAME            CSV                         SOURCE   APPROVAL    APPROVED
install-cgsm4   couchbase-operator.v1.1.0            Automatic   true
install-g7wmf   etcdoperator.v0.9.4                  Manual      false

2, Manual approved ip
[chuo@dhcp-140-165 .kube]$ oc edit ip install-g7wmf
installplan.operators.coreos.com/install-g7wmf edited
[chuo@dhcp-140-165 .kube]$ oc get ip
NAME            CSV                         SOURCE   APPROVAL    APPROVED
install-g7wmf   etcdoperator.v0.9.4                  Manual      true
[chuo@dhcp-140-165 .kube]$ oc get csv
NAME                        DISPLAY              VERSION   REPLACES              PHASE
etcdoperator.v0.9.4         etcd                 0.9.4     etcdoperator.v0.9.2   Succeeded

3, delete the subscription and catalog source.
[chuo@dhcp-140-165 .kube]$ oc delete catsrc installed-community-default
catalogsource.operators.coreos.com "installed-community-default" deleted
[chuo@dhcp-140-165 .kube]$ oc delete sub etcd
subscription.operators.coreos.com "etcd" deleted
[chuo@dhcp-140-165 .kube]$ oc get catsrc
NAME                          NAME                  TYPE   PUBLISHER   AGE
installed-certified-default   Certified Operators   grpc   Certified   46m
[chuo@dhcp-140-165 .kube]$ oc get sub
NAME                             PACKAGE                          SOURCE                        CHANNEL
couchbase-enterprise-certified   couchbase-enterprise-certified   installed-certified-default   preview
[chuo@dhcp-140-165 .kube]$ oc get ip
NAME            CSV                         SOURCE   APPROVAL    APPROVED
install-cgsm4   couchbase-operator.v1.1.0            Automatic   true
install-g7wmf   etcdoperator.v0.9.4                  Manual      true

subscription can be deleted from the back end, but ip still exsits, meanwhile from Webconsole-Installed Operators(for Project "default") etcd operator exists with status "InstallSucceeded".

Comment 12 Cuiping HUO 2019-05-13 08:40:30 UTC
Verification success with the below version:
OLM version: io.openshift.build.commit.id=19e7914e33f723c6f77f7aaa0892c7684ce94ed4
Cluster version: 4.1.0-0.nightly-2019-05-09-182710

1, Install the "etcd" operator in project "test".
[chuo@dhcp-140-165 .kube]$ oc get sub
NAME   PACKAGE   SOURCE                     CHANNEL
etcd   etcd      installed-community-test   singlenamespace-alpha
[chuo@dhcp-140-165 .kube]$ oc get catsrc
NAME                       NAME                  TYPE   PUBLISHER   AGE
installed-community-test   Community Operators   grpc   Community   2m16s
[chuo@dhcp-140-165 .kube]$ oc get ip
NAME            CSV                   SOURCE   APPROVAL    APPROVED
install-7wnvx   etcdoperator.v0.9.4            Automatic   true

2, delete the subscription and catalog source.
3, re-create subscription and catlogsource
[chuo@dhcp-140-165 .kube]$ oc get sub
NAME   PACKAGE   SOURCE                     CHANNEL
etcd   etcd      installed-community-test   singlenamespace-alpha
[chuo@dhcp-140-165 .kube]$ oc get catsrc
NAME                       NAME                  TYPE   PUBLISHER   AGE
installed-community-test   Community Operators   grpc   Community   2m16s
[chuo@dhcp-140-165 .kube]$ oc get ip
NAME            CSV                   SOURCE   APPROVAL    APPROVED
install-7wnvx   etcdoperator.v0.9.4            Automatic   true

4, repeat step2 and step 3 for 10 times, subscription success 10 times
5, delete catalog-operator and wait until new pod is running
[chuo@dhcp-140-165 .kube]$ oc delete po catalog-operator-569b689878-g8zzh -n openshift-operator-lifecycle-manager
pod "catalog-operator-569b689878-g8zzh" deleted
[chuo@dhcp-140-165 .kube]$ oc  get po -n openshift-operator-lifecycle-manager 
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-569b689878-p7c2f   0/1     Running   0          14s

6.re-create subscription and catlogsource
[chuo@dhcp-140-165 .kube]$ oc get sub
NAME   PACKAGE   SOURCE                     CHANNEL
etcd   etcd      installed-community-test   singlenamespace-alpha
[chuo@dhcp-140-165 .kube]$ oc get ip
NAME            CSV                   SOURCE   APPROVAL    APPROVED
install-znv6w   etcdoperator.v0.9.4            Automatic   true
[chuo@dhcp-140-165 .kube]$ oc get catsrc
NAME                       NAME                  TYPE   PUBLISHER   AGE
installed-community-test   Community Operators   grpc   Community   2m33s

[chuo@dhcp-140-165 .kube]$ oc get sub etcd -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2019-05-13T08:34:10Z"
  generation: 1
  labels:
    csc-owner-name: installed-community-test
    csc-owner-namespace: openshift-marketplace
  name: etcd
  namespace: test
  resourceVersion: "149673"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test/subscriptions/etcd
  uid: e1912e81-7559-11e9-8544-0aa12d6c2fce
spec:
  channel: singlenamespace-alpha
  installPlanApproval: Automatic
  name: etcd
  source: installed-community-test
  sourceNamespace: test
  startingCSV: etcdoperator.v0.9.4
status:
  currentCSV: etcdoperator.v0.9.4
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-znv6w
    namespace: test
    resourceVersion: "149643"
    uid: e2095587-7559-11e9-8bba-02c4299c1f3a
  installedCSV: etcdoperator.v0.9.4
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-znv6w
    uuid: e2095587-7559-11e9-8bba-02c4299c1f3a
  lastUpdated: "2019-05-13T08:34:14Z"
  state: AtLatestKnown

[chuo@dhcp-140-165 .kube]$ oc get catsrc installed-community-test -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  creationTimestamp: "2019-05-13T08:34:02Z"
  generation: 1
  labels:
    csc-owner-name: installed-community-test
    csc-owner-namespace: openshift-marketplace
  name: installed-community-test
  namespace: test
  resourceVersion: "152087"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/test/catalogsources/installed-community-test
  uid: dcdf4ca8-7559-11e9-b532-06d8365d7bd0
spec:
  address: 172.30.208.199:50051
  displayName: Community Operators
  icon:
    base64data: ""
    mediatype: ""
  publisher: Community
  sourceType: grpc
status:
  lastSync: "2019-05-13T08:39:09Z"
  registryService:
    createdAt: "2019-05-13T08:39:07Z"
    protocol: grpc

Comment 14 errata-xmlrpc 2019-06-04 10:48:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758