Bug 1702540
| Summary: | Sometimes fail to create the InstallPlan object. | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> |
| Component: | OLM | Assignee: | Kevin Rizza <krizza> |
| OLM sub component: | OperatorHub | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | akashem, anli, bandrade, chezhang, dyan, ecordell, jfan, jlee, krizza, scolange, zitang |
| Version: | 4.1.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-04-30 13:01:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jian Zhang
2019-04-24 05:21:18 UTC
CC Abu and Nick, below are the key debugging info from them on Slack:
Abu:
The subscription object `couchbase-enterprise-certified` does not have any `status` descriptor. catalog operator failed to sync the `catsrc` object associated with the subscription. In contrast, package api server syncs it successfully. The latest entry in catalog operator log dates to `2019-04-23T16:07:38Z`, whereas the last time it tried to sync `installed-certified-test` catsrc is about 6 hours ago.
package api server logs:
time="2019-04-23T15:42:38Z" level=info msg="attempting to add a new grpc connection" action="sync catalogsource" name=installed-certified-test namespace=test
time="2019-04-23T15:42:38Z" level=info msg="new grpc connection added" action="sync catalogsource" name=installed-certified-test namespace=test
catalog operator logs:
time="2019-04-23T09:39:55Z" level=info msg="building connection to registry" currentSource="{installed-certified-test test}" id=as8KJ source=installed-certified-test
time="2019-04-23T09:39:55Z" level=info msg="client hasn't yet become healthy, attempt a health check" currentSource="{installed-certified-test test}" id=as8KJ source=installed-certified-test
I can start the `catalog` operator in debug mode to get more context. I am not sure though why it wont try to sync it for this long. I thought the default `resync` interval is set to 15 minutes.
Nick:
You are correct about the default resync period. My guess is that there’s some edge case being hit with registry gRPC connection health and Subscription reconciliation.
Hit this issue when we deployed elasticsearch into openshift-operators namespace. That block our testing.
$ oc get pods -n openshift-marketplace
NAME READY STATUS RESTARTS AGE
certified-operators-7c6cb74d76-khs55 1/1 Running 0 6h22m
community-operators-6bbbcfbc64-tgbnm 1/1 Running 0 34m
federation-system-68d5898f5b-f57hc 0/1 CrashLoopBackOff 11 33m
installed-community-ansible-service-broker-6fd6bcd647-5mzfz 1/1 Running 0 106m
installed-community-openshift-logging-7cd4b76b7-rn9cn 1/1 Running 0 126m
installed-community-openshift-operators-66bfc6bfff-cfh5j 1/1 Running 0 11m
marketplace-operator-56f7d5f5c4-dwwct 1/1 Running 0 6h22m
redhat-operators-77848897d7-lc929 1/1 Running 0 6h22m
$ oc get is -n openshift-operators
No resources found.
$oc get subscription elasticsearch-operator -o yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
creationTimestamp: 2019-04-28T09:08:32Z
generation: 1
labels:
csc-owner-name: installed-community-openshift-operators
csc-owner-namespace: openshift-marketplace
name: elasticsearch-operator
namespace: openshift-operators
resourceVersion: "167086"
selfLink: /apis/operators.coreos.com/v1alpha1/namespaces/openshift-operators/subscriptions/elasticsearch-operator
uid: 3244b6e0-6995-11e9-b7cf-0aea0dfb1820
spec:
channel: preview
installPlanApproval: Automatic
name: elasticsearch-operator
source: installed-community-openshift-operators
sourceNamespace: openshift-operators
startingCSV: elasticsearch-operator.v4.1.0
oc get sub/InstallPlan Error from server (NotFound): subscriptions.operators.coreos.com "InstallPlan" not found Hi Jian,
I've followed the reproduction steps you described and don't see the issue. I can install and uninstall operators as many times as I want. I tried with just the prometheus operator, then I tried installing etcd, removing etcd, and installing etcd and couchbase as you described. Both get installed and installplans were generated every time.
Sometimes, when installing via OperatorHub, it can take a few moments for the install to succeed. But even when trying to make bad things happen (by installing two operators rapidly in an attempt to make OLM's grpc connection go bad) everything resolved and installed within a few seconds.
Anping,
I don't think your problems are related to this bug report. It looks like federation is crashlooping in your cluster. Because operators interact at the kube-api level, we don't install a new operator or upgrade existing ones until all the current operators are healthy. We're working on making the status for this better so that it's more obvious that's what's happening.
Additionally:
> oc get sub/InstallPlan
Error from server (NotFound): subscriptions.operators.coreos.com "InstallPlan" not found
InstallPlans are a separate resource, not the name of a subscription resource. Please try `oc get installplans` instead.
Closing as worksforme, please re-open if you can reproduce and can provide steps for me to reproduce.
*** Bug 1700453 has been marked as a duplicate of this bug. *** Hi Jian, I am no longer able to reproduce it on my cluster. I followed the steps below - create a namespace test - install etcd on test using the UI - uninstall etcd from the UI. - install etcd and couchbase. both subscriptions are picked up by OLM. My cluster version: https://openshift-release-artifacts.svc.ci.openshift.org/4.1.0-0.ci-2019-05-02-115733/ Hi, Evan, Abu
Thanks for your clarification! Yeah, I understood, this issue not always reproduce. Anyway, we will reopen it when we encounter it.
@Evan
> Because operators interact at the kube-api level, we don't install a new operator or upgrade existing ones until all the current operators are healthy.
Does this also mean the users cannot install other operators if one of the operators crashed in the cluster? If yes, I don't think this is a good way since they are in different namespaces.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |