Bug 1702540

Summary: Sometimes fail to create the InstallPlan object.
Product: OpenShift Container Platform Reporter: Jian Zhang <jiazha>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OperatorHub QA Contact: Jian Zhang <jiazha>
Status: CLOSED WORKSFORME Docs Contact:
Severity: medium    
Priority: high CC: akashem, anli, bandrade, chezhang, dyan, ecordell, jfan, jlee, krizza, scolange, zitang
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-30 13:01:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jian Zhang 2019-04-24 05:21:18 UTC
Description of problem:
Fail to reinstall the operators in the namespace which uninstalled an operator before, no InstallPlan generated.

Version-Release number of selected component (if applicable):
Cluster version is 4.1.0-0.nightly-2019-04-22-005054
The OLM version is: io.openshift.build.commit.id=c718ec855bb26a111d66ba2ba193d30e54f7feb1

How reproducible:
sometimes

Steps to Reproduce:
1. Create a project called `test`, install the `etcd-operator` in it on the Webconsole.
2. Remove the `etcd-operator` on the Web console.
3. Install the `etcd-operator` and `couchbase` in it.

Actual results:
Both the `etcd-operator` and `couchbase operator` failed to install, no `InstallPlan` objects generated.
mac:~ jianzhang$ oc get sub -n test
NAME                             PACKAGE                          SOURCE                     CHANNEL
couchbase-enterprise-certified   couchbase-enterprise-certified   installed-certified-test   preview
etcd                             etcd                             installed-custom-test      alpha
mac:~ jianzhang$ oc get ip -n test
No resources found.
mac:~ jianzhang$ oc get csv -n test
No resources found.
mac:~ jianzhang$ oc get catalogsource -n test
NAME                       NAME                  TYPE      PUBLISHER   AGE
installed-certified-test   Certified Operators   grpc      Certified   24m
installed-custom-test      Custom Operators      grpc      Custom      34m

Expected results:
Reinstall the `etcd-operator` and `couchbase operator` successfully.

Additional info:
mac:~ jianzhang$ oc get pods -n openshift-marketplace
NAME                                         READY     STATUS    RESTARTS   AGE
certified-operators-7ff46f9885-knddq         1/1       Running   0          10h
community-operators-5b4db765fb-mktr6         1/1       Running   0          10h
installed-certified-test-544fb97654-g892x    1/1       Running   0          3h43m
installed-community-global-operators-ph4b7   1/1       Running   0          8h
installed-community-jian-5689d9889c-hnq8m    1/1       Running   0          7h20m
installed-custom-jian-65cfbd784c-mm77s       1/1       Running   0          5h7m
installed-custom-test-567bfd845c-85nq8       1/1       Running   0          3h53m
marketplace-operator-698d46f9d5-prm9j        1/1       Running   0          10h
redhat-operators-5cb954dc98-2n46r            1/1       Running   0          10h

mac:~ jianzhang$ oc logs installed-certified-test-544fb97654-g892x -n openshift-marketplace
time="2019-04-22T10:12:32Z" level=info msg="Using in-cluster kube client config" port=50051 type=appregistry
time="2019-04-22T10:12:32Z" level=info msg="operator source(s) specified are - [https://quay.io/cnr%7Ccertified-operators]" port=50051 type=appregistry
time="2019-04-22T10:12:32Z" level=info msg="package(s) specified are - couchbase-enterprise-certified" port=50051 type=appregistry
time="2019-04-22T10:12:32Z" level=info msg="input has been sanitized" port=50051 type=appregistry
time="2019-04-22T10:12:32Z" level=info msg="sources: [https://quay.io/cnr/certified-operators]" port=50051 type=appregistry
time="2019-04-22T10:12:32Z" level=info msg="packages: [couchbase-enterprise-certified]" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="resolved the following packages: [certified-operators/couchbase-enterprise-certified:1.0.0]" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="downloading repository: certified-operators/couchbase-enterprise-certified:1.0.0 from https://quay.io/cnr" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="download complete - 1 repositories have been downloaded" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="decoding the downloaded operator manifest(s)" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="manifest format is - flattened" port=50051 repository="certified-operators/couchbase-enterprise-certified:1.0.0" type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="decoded successfully" port=50051 repository="certified-operators/couchbase-enterprise-certified:1.0.0" type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="merging all flattened manifests into a single configmap 'data' section" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="decoded 1 flattened and 0 nested operator manifest(s)" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="loading flattened operator manifest(s) into sqlite" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="using configmap loader to build sqlite database" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="loading CRDs" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="loading Bundles" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="loading Packages" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="extracting provided API information" port=50051 type=appregistry
time="2019-04-22T10:12:33Z" level=info msg="serving registry" port=50051 type=appregistry

mac:~ jianzhang$ oc logs installed-custom-test-567bfd845c-85nq8 -n openshift-marketplace
time="2019-04-22T10:02:43Z" level=info msg="Using in-cluster kube client config" port=50051 type=appregistry
time="2019-04-22T10:02:43Z" level=info msg="operator source(s) specified are - [https://quay.io/cnr%7Ccommunity-operators]" port=50051 type=appregistry
time="2019-04-22T10:02:43Z" level=info msg="package(s) specified are - etcd" port=50051 type=appregistry
time="2019-04-22T10:02:43Z" level=info msg="input has been sanitized" port=50051 type=appregistry
time="2019-04-22T10:02:43Z" level=info msg="sources: [https://quay.io/cnr/community-operators]" port=50051 type=appregistry
time="2019-04-22T10:02:43Z" level=info msg="packages: [etcd]" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="resolved the following packages: [community-operators/etcd:0.0.11]" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="downloading repository: community-operators/etcd:0.0.11 from https://quay.io/cnr" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="download complete - 1 repositories have been downloaded" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="decoding the downloaded operator manifest(s)" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="manifest format is - flattened" port=50051 repository="community-operators/etcd:0.0.11" type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="decoded successfully" port=50051 repository="community-operators/etcd:0.0.11" type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="merging all flattened manifests into a single configmap 'data' section" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="decoded 1 flattened and 0 nested operator manifest(s)" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="loading flattened operator manifest(s) into sqlite" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="using configmap loader to build sqlite database" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="loading CRDs" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="loading Bundles" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="loading Packages" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="extracting provided API information" port=50051 type=appregistry
time="2019-04-22T10:02:44Z" level=info msg="serving registry" port=50051 type=appregistry

Comment 1 Jian Zhang 2019-04-24 05:31:51 UTC
CC Abu and Nick, below are the key debugging info from them on Slack:

Abu:
The subscription object `couchbase-enterprise-certified` does not have any `status` descriptor. catalog operator failed to sync the `catsrc` object associated with the subscription. In contrast, package api server syncs it successfully. The latest entry in catalog operator log dates to `2019-04-23T16:07:38Z`, whereas the last time it tried to sync `installed-certified-test` catsrc is about 6 hours ago.

package api server logs:
time="2019-04-23T15:42:38Z" level=info msg="attempting to add a new grpc connection" action="sync catalogsource" name=installed-certified-test namespace=test
time="2019-04-23T15:42:38Z" level=info msg="new grpc connection added" action="sync catalogsource" name=installed-certified-test namespace=test

catalog operator logs:
time="2019-04-23T09:39:55Z" level=info msg="building connection to registry" currentSource="{installed-certified-test test}" id=as8KJ source=installed-certified-test
time="2019-04-23T09:39:55Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{installed-certified-test test}" id=as8KJ source=installed-certified-test

I can start the `catalog` operator in debug mode to get more context. I am not sure though why it wont try to sync it for this long. I thought the default `resync` interval is set to 15 minutes.

Nick:
You are correct about the default resync period. My guess is that there’s some edge case being hit with registry gRPC connection health and Subscription reconciliation.

Comment 2 Anping Li 2019-04-28 09:21:33 UTC
Hit this issue when we deployed elasticsearch into openshift-operators namespace.  That block our testing.

$ oc get pods -n openshift-marketplace
NAME                                                          READY   STATUS             RESTARTS   AGE
certified-operators-7c6cb74d76-khs55                          1/1     Running            0          6h22m
community-operators-6bbbcfbc64-tgbnm                          1/1     Running            0          34m
federation-system-68d5898f5b-f57hc                            0/1     CrashLoopBackOff   11         33m
installed-community-ansible-service-broker-6fd6bcd647-5mzfz   1/1     Running            0          106m
installed-community-openshift-logging-7cd4b76b7-rn9cn         1/1     Running            0          126m
installed-community-openshift-operators-66bfc6bfff-cfh5j      1/1     Running            0          11m
marketplace-operator-56f7d5f5c4-dwwct                         1/1     Running            0          6h22m
redhat-operators-77848897d7-lc929                             1/1     Running            0          6h22m

$ oc get is -n openshift-operators
No resources found.


$oc get subscription elasticsearch-operator -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: 2019-04-28T09:08:32Z
  generation: 1
  labels:
    csc-owner-name: installed-community-openshift-operators
    csc-owner-namespace: openshift-marketplace
  name: elasticsearch-operator
  namespace: openshift-operators
  resourceVersion: "167086"
  selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/elasticsearch-operator
  uid: 3244b6e0-6995-11e9-b7cf-0aea0dfb1820
spec:
  channel: preview
  installPlanApproval: Automatic
  name: elasticsearch-operator
  source: installed-community-openshift-operators
  sourceNamespace: openshift-operators
  startingCSV: elasticsearch-operator.v4.1.0

Comment 3 Anping Li 2019-04-28 09:48:19 UTC
oc get sub/InstallPlan
Error from server (NotFound): subscriptions.operators.coreos.com "InstallPlan" not found

Comment 5 Evan Cordell 2019-04-30 13:01:14 UTC
Hi Jian,

I've followed the reproduction steps you described and don't see the issue. I can install and uninstall operators as many times as I want. I tried with just the prometheus operator, then I tried installing etcd, removing etcd, and installing etcd and couchbase as you described. Both get installed and installplans were generated every time.

Sometimes, when installing via OperatorHub, it can take a few moments for the install to succeed. But even when trying to make bad things happen (by installing two operators rapidly in an attempt to make OLM's grpc connection go bad) everything resolved and installed within a few seconds.

Anping,

I don't think your problems are related to this bug report. It looks like federation is crashlooping in your cluster. Because operators interact at the kube-api level, we don't install a new operator or upgrade existing ones until all the current operators are healthy. We're working on making the status for this better so that it's more obvious that's what's happening.

Additionally:

> oc get sub/InstallPlan
Error from server (NotFound): subscriptions.operators.coreos.com "InstallPlan" not found

InstallPlans are a separate resource, not the name of a subscription resource. Please try `oc get installplans` instead.

Closing as worksforme, please re-open if you can reproduce and can provide steps for me to reproduce.

Comment 6 Evan Cordell 2019-04-30 13:06:16 UTC
*** Bug 1700453 has been marked as a duplicate of this bug. ***

Comment 7 Abu Kashem 2019-05-03 03:37:25 UTC
Hi Jian,
I am no longer able to reproduce it on my cluster. I followed the steps below
- create a namespace test
- install etcd on test using the UI
- uninstall etcd from the UI.
- install etcd and couchbase.

both subscriptions are picked up by OLM. 

My cluster version: https://openshift-release-artifacts.svc.ci.openshift.org/4.1.0-0.ci-2019-05-02-115733/

Comment 8 Jian Zhang 2019-05-06 07:17:44 UTC
Hi, Evan, Abu

Thanks for your clarification! Yeah, I understood, this issue not always reproduce. Anyway, we will reopen it when we encounter it.

@Evan

> Because operators interact at the kube-api level, we don't install a new operator or upgrade existing ones until all the current operators are healthy.

Does this also mean the users cannot install other operators if one of the operators crashed in the cluster? If yes, I don't think this is a good way since they are in different namespaces.

Comment 10 Red Hat Bugzilla 2023-09-14 05:27:26 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days