Description of problem: Fail to reinstall the operators in the namespace which uninstalled an operator before, no InstallPlan generated. Version-Release number of selected component (if applicable): Cluster version is 4.1.0-0.nightly-2019-04-22-005054 The OLM version is: io.openshift.build.commit.id=c718ec855bb26a111d66ba2ba193d30e54f7feb1 How reproducible: sometimes Steps to Reproduce: 1. Create a project called `test`, install the `etcd-operator` in it on the Webconsole. 2. Remove the `etcd-operator` on the Web console. 3. Install the `etcd-operator` and `couchbase` in it. Actual results: Both the `etcd-operator` and `couchbase operator` failed to install, no `InstallPlan` objects generated. mac:~ jianzhang$ oc get sub -n test NAME PACKAGE SOURCE CHANNEL couchbase-enterprise-certified couchbase-enterprise-certified installed-certified-test preview etcd etcd installed-custom-test alpha mac:~ jianzhang$ oc get ip -n test No resources found. mac:~ jianzhang$ oc get csv -n test No resources found. mac:~ jianzhang$ oc get catalogsource -n test NAME NAME TYPE PUBLISHER AGE installed-certified-test Certified Operators grpc Certified 24m installed-custom-test Custom Operators grpc Custom 34m Expected results: Reinstall the `etcd-operator` and `couchbase operator` successfully. Additional info: mac:~ jianzhang$ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE certified-operators-7ff46f9885-knddq 1/1 Running 0 10h community-operators-5b4db765fb-mktr6 1/1 Running 0 10h installed-certified-test-544fb97654-g892x 1/1 Running 0 3h43m installed-community-global-operators-ph4b7 1/1 Running 0 8h installed-community-jian-5689d9889c-hnq8m 1/1 Running 0 7h20m installed-custom-jian-65cfbd784c-mm77s 1/1 Running 0 5h7m installed-custom-test-567bfd845c-85nq8 1/1 Running 0 3h53m marketplace-operator-698d46f9d5-prm9j 1/1 Running 0 10h redhat-operators-5cb954dc98-2n46r 1/1 Running 0 10h mac:~ jianzhang$ oc logs installed-certified-test-544fb97654-g892x -n openshift-marketplace time="2019-04-22T10:12:32Z" level=info msg="Using in-cluster kube client config" port=50051 type=appregistry time="2019-04-22T10:12:32Z" level=info msg="operator source(s) specified are - [https://quay.io/cnr%7Ccertified-operators]" port=50051 type=appregistry time="2019-04-22T10:12:32Z" level=info msg="package(s) specified are - couchbase-enterprise-certified" port=50051 type=appregistry time="2019-04-22T10:12:32Z" level=info msg="input has been sanitized" port=50051 type=appregistry time="2019-04-22T10:12:32Z" level=info msg="sources: [https://quay.io/cnr/certified-operators]" port=50051 type=appregistry time="2019-04-22T10:12:32Z" level=info msg="packages: [couchbase-enterprise-certified]" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="resolved the following packages: [certified-operators/couchbase-enterprise-certified:1.0.0]" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="downloading repository: certified-operators/couchbase-enterprise-certified:1.0.0 from https://quay.io/cnr" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="download complete - 1 repositories have been downloaded" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="decoding the downloaded operator manifest(s)" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="manifest format is - flattened" port=50051 repository="certified-operators/couchbase-enterprise-certified:1.0.0" type=appregistry time="2019-04-22T10:12:33Z" level=info msg="decoded successfully" port=50051 repository="certified-operators/couchbase-enterprise-certified:1.0.0" type=appregistry time="2019-04-22T10:12:33Z" level=info msg="merging all flattened manifests into a single configmap 'data' section" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="decoded 1 flattened and 0 nested operator manifest(s)" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="loading flattened operator manifest(s) into sqlite" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="using configmap loader to build sqlite database" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="loading CRDs" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="loading Bundles" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="loading Packages" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="extracting provided API information" port=50051 type=appregistry time="2019-04-22T10:12:33Z" level=info msg="serving registry" port=50051 type=appregistry mac:~ jianzhang$ oc logs installed-custom-test-567bfd845c-85nq8 -n openshift-marketplace time="2019-04-22T10:02:43Z" level=info msg="Using in-cluster kube client config" port=50051 type=appregistry time="2019-04-22T10:02:43Z" level=info msg="operator source(s) specified are - [https://quay.io/cnr%7Ccommunity-operators]" port=50051 type=appregistry time="2019-04-22T10:02:43Z" level=info msg="package(s) specified are - etcd" port=50051 type=appregistry time="2019-04-22T10:02:43Z" level=info msg="input has been sanitized" port=50051 type=appregistry time="2019-04-22T10:02:43Z" level=info msg="sources: [https://quay.io/cnr/community-operators]" port=50051 type=appregistry time="2019-04-22T10:02:43Z" level=info msg="packages: [etcd]" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="resolved the following packages: [community-operators/etcd:0.0.11]" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="downloading repository: community-operators/etcd:0.0.11 from https://quay.io/cnr" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="download complete - 1 repositories have been downloaded" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="decoding the downloaded operator manifest(s)" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="manifest format is - flattened" port=50051 repository="community-operators/etcd:0.0.11" type=appregistry time="2019-04-22T10:02:44Z" level=info msg="decoded successfully" port=50051 repository="community-operators/etcd:0.0.11" type=appregistry time="2019-04-22T10:02:44Z" level=info msg="merging all flattened manifests into a single configmap 'data' section" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="decoded 1 flattened and 0 nested operator manifest(s)" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="loading flattened operator manifest(s) into sqlite" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="using configmap loader to build sqlite database" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="loading CRDs" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="loading Bundles" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="loading Packages" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="extracting provided API information" port=50051 type=appregistry time="2019-04-22T10:02:44Z" level=info msg="serving registry" port=50051 type=appregistry
CC Abu and Nick, below are the key debugging info from them on Slack: Abu: The subscription object `couchbase-enterprise-certified` does not have any `status` descriptor. catalog operator failed to sync the `catsrc` object associated with the subscription. In contrast, package api server syncs it successfully. The latest entry in catalog operator log dates to `2019-04-23T16:07:38Z`, whereas the last time it tried to sync `installed-certified-test` catsrc is about 6 hours ago. package api server logs: time="2019-04-23T15:42:38Z" level=info msg="attempting to add a new grpc connection" action="sync catalogsource" name=installed-certified-test namespace=test time="2019-04-23T15:42:38Z" level=info msg="new grpc connection added" action="sync catalogsource" name=installed-certified-test namespace=test catalog operator logs: time="2019-04-23T09:39:55Z" level=info msg="building connection to registry" currentSource="{installed-certified-test test}" id=as8KJ source=installed-certified-test time="2019-04-23T09:39:55Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{installed-certified-test test}" id=as8KJ source=installed-certified-test I can start the `catalog` operator in debug mode to get more context. I am not sure though why it wont try to sync it for this long. I thought the default `resync` interval is set to 15 minutes. Nick: You are correct about the default resync period. My guess is that there’s some edge case being hit with registry gRPC connection health and Subscription reconciliation.
Hit this issue when we deployed elasticsearch into openshift-operators namespace. That block our testing. $ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE certified-operators-7c6cb74d76-khs55 1/1 Running 0 6h22m community-operators-6bbbcfbc64-tgbnm 1/1 Running 0 34m federation-system-68d5898f5b-f57hc 0/1 CrashLoopBackOff 11 33m installed-community-ansible-service-broker-6fd6bcd647-5mzfz 1/1 Running 0 106m installed-community-openshift-logging-7cd4b76b7-rn9cn 1/1 Running 0 126m installed-community-openshift-operators-66bfc6bfff-cfh5j 1/1 Running 0 11m marketplace-operator-56f7d5f5c4-dwwct 1/1 Running 0 6h22m redhat-operators-77848897d7-lc929 1/1 Running 0 6h22m $ oc get is -n openshift-operators No resources found. $oc get subscription elasticsearch-operator -o yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: creationTimestamp: 2019-04-28T09:08:32Z generation: 1 labels: csc-owner-name: installed-community-openshift-operators csc-owner-namespace: openshift-marketplace name: elasticsearch-operator namespace: openshift-operators resourceVersion: "167086" selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/elasticsearch-operator uid: 3244b6e0-6995-11e9-b7cf-0aea0dfb1820 spec: channel: preview installPlanApproval: Automatic name: elasticsearch-operator source: installed-community-openshift-operators sourceNamespace: openshift-operators startingCSV: elasticsearch-operator.v4.1.0
oc get sub/InstallPlan Error from server (NotFound): subscriptions.operators.coreos.com "InstallPlan" not found
Hi Jian, I've followed the reproduction steps you described and don't see the issue. I can install and uninstall operators as many times as I want. I tried with just the prometheus operator, then I tried installing etcd, removing etcd, and installing etcd and couchbase as you described. Both get installed and installplans were generated every time. Sometimes, when installing via OperatorHub, it can take a few moments for the install to succeed. But even when trying to make bad things happen (by installing two operators rapidly in an attempt to make OLM's grpc connection go bad) everything resolved and installed within a few seconds. Anping, I don't think your problems are related to this bug report. It looks like federation is crashlooping in your cluster. Because operators interact at the kube-api level, we don't install a new operator or upgrade existing ones until all the current operators are healthy. We're working on making the status for this better so that it's more obvious that's what's happening. Additionally: > oc get sub/InstallPlan Error from server (NotFound): subscriptions.operators.coreos.com "InstallPlan" not found InstallPlans are a separate resource, not the name of a subscription resource. Please try `oc get installplans` instead. Closing as worksforme, please re-open if you can reproduce and can provide steps for me to reproduce.
*** Bug 1700453 has been marked as a duplicate of this bug. ***
Hi Jian, I am no longer able to reproduce it on my cluster. I followed the steps below - create a namespace test - install etcd on test using the UI - uninstall etcd from the UI. - install etcd and couchbase. both subscriptions are picked up by OLM. My cluster version: https://openshift-release-artifacts.svc.ci.openshift.org/4.1.0-0.ci-2019-05-02-115733/
Hi, Evan, Abu Thanks for your clarification! Yeah, I understood, this issue not always reproduce. Anyway, we will reopen it when we encounter it. @Evan > Because operators interact at the kube-api level, we don't install a new operator or upgrade existing ones until all the current operators are healthy. Does this also mean the users cannot install other operators if one of the operators crashed in the cluster? If yes, I don't think this is a good way since they are in different namespaces.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days