Created attachment 1520677 [details] catalog-operator-logs Description of problem: Failed to subscribe to mongodb-enterprise via OLM Version-Release number of selected component (if applicable): cluster version: 4.0.0-0.1 OLM version: 0.8.1 The OLM code source version: "io.openshift.source-repo-commit": "47482491fb29def1a3df05c3178b07de5761708f" How reproducible: Always Steps to Reproduce: 1.Create a subscription in openshift-operators project $ oc create -f mongodb-sub.yaml 2.Check csv and crd status 3. Actual results: csv is not existing Expected results: csv is existing, succeed in subscribe to mongodb-enterprise Additional info: Change the watching namespace to "openshift-marketplace" from "openshift-operator-lifecycle-manager", as below: mac:aws-ocp jianzhang$ oc get deployment packageserver -o yaml |grep command -A 6 - command: - /bin/package-server - -v=4 - --secure-port - "5443" - --global-namespace - openshift-marketplace cat mongodb-sub.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: generateName: mongodb-enterprise- namespace: openshift-operators spec: source: certified-operators sourceNamespace: openshift-marketplace name: mongodb-enterprise startingCSV: mongodboperator.v0.3.2 channel: preview
Hi, Evan I also encounter this issue when subscribing the Service Catalog. mac:aws-ocp jianzhang$ oc logs -f catalog-operator-7cc8654dd8-8t68m ... time="2019-01-16T08:13:57Z" level=info msg="couldn't get from queue" key=wewang1/frontend-1-xh2r5 queue=pod time="2019-01-16T08:14:57Z" level=info msg="retrying kube-service-catalog/svcat-wdjp2" E0116 08:14:57.796000 1 queueinformer_operator.go:155] Sync "kube-service-catalog/svcat-wdjp2" failed: {svcat alpha {community-operators openshift-marketplace}} not found: CatalogSource {community-operators openshift-marketplace} not found ... But, we can get the "community-operators" CatalogSource object successfully. mac:aws-ocp jianzhang$ oc get catalogsource -n openshift-marketplace NAME NAME TYPE PUBLISHER AGE certified-operators Certified Operators internal Red Hat 5h community-operators Community Operators internal Red Hat 5h redhat-operators Red Hat Operators internal Red Hat 5h Below are the detailed steps: 1, Change the global namespace to "openshift-marketplace", so that can get these packagemanifests: mac:aws-ocp jianzhang$ oc get packagemanifest NAME AGE amq-streams 5h packageserver 5h couchbase-enterprise 53m descheduler 53m mongodb-enterprise 53m mongodb-enterprise-test 53m couchbase-enterprise 5h dynatrace-monitoring 5h mongodb-enterprise 5h automationbroker 5h cluster-logging 5h descheduler 5h etcd 5h federationv2 5h jaeger 5h metering 5h prometheus 5h svcat 5h templateservicebroker 5h 2, Create the "kube-service-catalog" project: mac:aws-ocp jianzhang$ oc adm new-project kube-service-catalog Created project kube-service-catalog 3, Create the operatorgroup: mac:aws-ocp jianzhang$ cat og-all.yaml apiVersion: operators.coreos.com/v1alpha2 kind: OperatorGroup metadata: name: catalog-operators namespace: kube-service-catalog spec: selector: {} mac:aws-ocp jianzhang$ oc create -f og-all.yaml 4, Subscribe the ServiceCatalog: mac:aws-ocp jianzhang$ cat svcat.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: generateName: svcat- namespace: kube-service-catalog spec: source: community-operators sourceNamespace: openshift-marketplace name: svcat startingCSV: svcat.v0.1.34 channel: alpha But, no csv object generated. mac:aws-ocp jianzhang$ oc get sub -n kube-service-catalog NAME PACKAGE SOURCE CHANNEL svcat-qp84x svcat community-operators alpha mac:aws-ocp jianzhang$ oc get csv -n kube-service-catalog No resources found. mac:aws-ocp jianzhang$ oc get installplan -n kube-service-catalog No resources found.
Now, I used the latest version of the OLM, details in bug https://bugzilla.redhat.com/show_bug.cgi?id=1667027#c2 But, still encounter this issue: E0118 09:25:58.774596 1 queueinformer_operator.go:155] Sync "service-catalog/etcd" failed: {etcd alpha {marketplace-enabled-operators-community service-catalog}} not found: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.30.20.73:50051: connect: no route to host" Steps: 1, Create a project called "service-catalog". 2, Create an operator group, as below: mac:project jianzhang$ cat og-default.yaml apiVersion: operators.coreos.com/v1alpha2 kind: OperatorGroup metadata: name: catalog-operators namespace: service-catalog spec: selector: {} 3, Access the Web console, Click "Catalog"-> "Operator Hub" -> "Show community operators"-> "etcd oeprator", select the "catalog-operators" OperatorGroup. Results: Only the subscription created, no corresponding csv created. Check the catalog-operator logs, got the above logs. So, add the "TestBlocker" label since it's blocking relevant testing.
Test it with the latest OLM image. But, still failed. [core@ip-10-0-3-254 ~]$ oc image info registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-01-20-082408@sha256:04ce25d455d2ac2d424584a61c7b1ca5567ce6c561fae36208845ab3744645f9|grep io.openshift.build.commit.id io.openshift.build.commit.id=f3b9375590334b0a3bfc8a9acf13dc5bde05da58 Here are the error logs of the PackageServer when installing the Service Catalog: I0121 06:59:19.896532 1 wrap.go:42] GET /healthz: (139.874µs) 200 [[kube-probe/1.11+] 10.130.0.1:47760] E0121 06:59:21.069024 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=23, ErrCode=NO_ERROR, debug="" I0121 06:59:21.069074 1 reflector.go:428] github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112: Watch close - *v1alpha1.CatalogSource total 0 items received I0121 06:59:22.339121 1 wrap.go:42] GET /healthz: (156.955µs) 200 [[kube-probe/1.11+] 10.130.0.1:47898] W0121 06:59:23.238273 1 reflector.go:341] github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112: watch of *v1alpha1.CatalogSource ended with: too old resource version: 200093 (204266) I0121 06:59:24.238535 1 reflector.go:240] Listing and watching *v1alpha1.CatalogSource from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112 time="2019-01-21T06:59:24Z" level=info msg="update detected, attempting to reset grpc connection" action="sync catalogsource" name=marketplace-enabled-operators-community namespace=kube-service-catalog time="2019-01-21T06:59:24Z" level=info msg="grpc connection reset" action="sync catalogsource" name=marketplace-enabled-operators-community namespace=kube-service-catalog time="2019-01-21T06:59:24Z" level=info msg="update detected, attempting to reset grpc connection" action="sync catalogsource" name=certified-operators namespace=openshift-marketplace time="2019-01-21T06:59:24Z" level=info msg="grpc connection reset" action="sync catalogsource" name=certified-operators namespace=openshift-marketplace ... Change status to "Urgent" since it's blocking the OLM/ServiceCatalog/ASB testing. If you have any workaround, please let me know.
These commits addressed several bugs with CatalogSource syncing: https://github.com/operator-framework/operator-lifecycle-manager/pull/670/commits It merged on friday and should be available in the latest master of openshift. I believe this issue should be fixed. If there's still an issue, we'll need more information to debug it - specifically the subscription object and the installplan object (if created).
Evan, > It's possible that there's an underlying issue in openshift based on this log. For OLM to work properly, basics like networking and dns need to work as well. Yeah, I checked the networking and dns, seems like they worked well. And, I guess this error from this vendor file: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go#L105 I think the marketplace team met this issue yesterday too. here: https://bugzilla.redhat.com/show_bug.cgi?id=1669992#c2 I'm not sure this bug belongs to which team. So, could you help transfer this bug to the appropriate team? Thanks very much! > Can you please provide steps to reproduce from a fresh cluster? This issue not always occurs, but often. The steps: 1) Create a new OCP 4.0 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully. 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often. > this was labelled a testblocker because it was blocking service catalog testing - I see in the latest update that service catalog is installed on the cluster. Is this still a testblocker? Yes, I label it with "testblocker" since we hit this issue many times that day. I will remove it, but I will leave the "beta2blocker" label because we should transfer this bug to the appropriate team to find the root cause and solve it.
I just found a similar fixed PR submitted by Clayton on Kubernetes: https://github.com/kubernetes/kubernetes/pull/73277 @Clayton Could you help to process this issue? Thanks very much!
*** Bug 1669992 has been marked as a duplicate of this bug. ***
Hit same issue with 4.0.0-0.alpha-2019-02-13-200735. It is a fresh cluster, it is running about 1 hour.
This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while )
Jian and everyone else: I've been looking into "Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection" and from what I've been told, it's an inconsequential log and is hidden in newer client releases. https://coreos.slack.com/archives/CEKNRGF25/p1550177624280300 > This issue not always occurs, but often. The steps: > 1) Create a new OCP 4.0 > 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully. In what namespace? in what operator-group? openshift-operators? > 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often. Subscribe where? What operator? These are important as we have not seen this issue once. Anping Li > Hit same issue with 4.0.0-0.alpha-2019-02-13-200735. It is a fresh cluster, it is running about 1 hour. Which issue? You can't install an operator after an hour? which operator and in what operator-group? Does the CatlogSource the new Subscription is referencing still exist? Fan Jia > This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while ) OLM does not delete CatalogSources, so there's no chance OLM is causing marketplace's issue (if that's what you're trying to say here). Please, if you are using marketplace to generate CatalogSources in the openshift-operators namespace: 1. Make sure the CatalogSource has a status and that you can query for the packagemanifest you want to install in that namespace `kubectl -n openshift-operators get packagemanifests` *at the time you go to create a Subscription* If marketplace's CatalogSources are being deleted for some totally unrelated reason, any new Subscriptions referencing those CatalogSources will not progress. 2. Ensure the Subscriptions `sourceName` and `sourceNamespace` fields point to the CatalogSource in the openshift-operators namespace 3. Check for InstallPlans, Subscriptions, and CSVs after you create a Susbcription 4. Make sure you're not trying to create a Subscription to the same operator in the same namespace We need actual data (status and logs) if you are still seeing this.
(In reply to Nick Hale from comment #21) > Jian and everyone else: > > I've been looking into "Unable to decode an event from the watch stream: > http2: server sent GOAWAY and closed the connection" and from what I've been > told, it's an inconsequential log and is hidden in newer client releases. > https://coreos.slack.com/archives/CEKNRGF25/p1550177624280300 > > > This issue not always occurs, but often. The steps: > > 1) Create a new OCP 4.0 > > 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully. > In what namespace? in what operator-group? openshift-operators? > > > 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often. > Subscribe where? What operator? These are important as we have not seen this > issue once. > > Anping Li > > Hit same issue with 4.0.0-0.alpha-2019-02-13-200735. It is a fresh cluster, it is running about 1 hour. > Which issue? You can't install an operator after an hour? which operator and > in what operator-group? Does the CatlogSource the new Subscription is > referencing still exist? > > Fan Jia > > This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while ) > OLM does not delete CatalogSources, so there's no chance OLM is causing > marketplace's issue (if that's what you're trying to say here). > > Please, if you are using marketplace to generate CatalogSources in the > openshift-operators namespace: > 1. Make sure the CatalogSource has a status and that you can query for the > packagemanifest you want to install in that namespace `kubectl -n > openshift-operators get packagemanifests` *at the time you go to create a > Subscription* > If marketplace's CatalogSources are being deleted for some totally unrelated > reason, any new Subscriptions referencing those CatalogSources will not > progress. > 2. Ensure the Subscriptions `sourceName` and `sourceNamespace` fields point > to the CatalogSource in the openshift-operators namespace > 3. Check for InstallPlans, Subscriptions, and CSVs after you create a > Susbcription > 4. Make sure you're not trying to create a Subscription to the same operator > in the same namespace > > We need actual data (status and logs) if you are still seeing this. Hi Nick, miss my message, The marketplace have already found the mistake about the situation that I mentioned, It is not caused by this bug.
@Nick, I filed a new bug to describe my issue. refer to https://bugzilla.redhat.com/show_bug.cgi?id=1677524.
Nick, Thanks for your information! > I've been looking into "Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection" and from what I've been told, it's an inconsequential log and is hidden in newer client releases. Agree, it as a vendor in the OLM component, I point out this in comment 13, but we can only find this kind of ERROR logs from the OLM. It's so weird that no other Error info when the csv cannot be installed. You can see details in comment 16. Seems like it is more like a platform issue, I point out this in comment 14, but no response, correct me if I'm wrong. > In what namespace? in what operator-group? openshift-operators? openshift-operators, global-operators. > Subscribe where? What operator? These are important as we have not seen this issue once. You can subscribe it on the Web console(Operator Hub). Or you can use a subscription YAML file. Any operators we provided. I tried etcd-operator, service-catalog, couchbase-operators, etc, all failed.
Evan, Thanks for your information! > The stated problem of operators not installing is not something we can reproduce. We have used the installer to create clusters and successfully installed operators. Yes, I understood. We cannot reproduce this issue always too. For this issue, in short, only the subscription object generated, no csv generated by following the basic steps that install an operator on Web. > We have discovered a separate issue which may be exacerbating this bug report. It appears that after some time (a couple of hours), something in an openshift cluster appears to delete some of our service accounts. We're trying to track down the source of this (it may be an issue in OLM or some other component.) It causes, among other things, the package server to fail. Yes, we also encounter this issue in two clusters. I think it's better we create a new bug to trace this issue. And, I also encounter another issue which the APIService are deleted after a couple of hours. Maybe the same root cause, here: bug 1678606 For this bug 1666225, we will keep an eye on it, we can verify it if we cannot reproduce it in the coming test rounds, what do you suggest?
*** This bug has been marked as a duplicate of bug 1679309 ***