Bug 1666225
| Summary: | Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Dongbo Yan <dyan> | ||||
| Component: | OLM | Assignee: | Evan Cordell <ecordell> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Jian Zhang <jiazha> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 4.1.0 | CC: | akashem, anli, aravindh, ccoleman, dageoffr, ecordell, jeder, jfan, jforrest, jiazha, jmatthew, nhale, nstielau, scuppett, sponnaga, zitang | ||||
| Target Milestone: | --- | Keywords: | TestBlocker | ||||
| Target Release: | 4.1.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | aos-scalability-40 | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-02-20 21:05:30 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Hi, Evan
I also encounter this issue when subscribing the Service Catalog.
mac:aws-ocp jianzhang$ oc logs -f catalog-operator-7cc8654dd8-8t68m
...
time="2019-01-16T08:13:57Z" level=info msg="couldn't get from queue" key=wewang1/frontend-1-xh2r5 queue=pod
time="2019-01-16T08:14:57Z" level=info msg="retrying kube-service-catalog/svcat-wdjp2"
E0116 08:14:57.796000 1 queueinformer_operator.go:155] Sync "kube-service-catalog/svcat-wdjp2" failed: {svcat alpha {community-operators openshift-marketplace}} not found: CatalogSource {community-operators openshift-marketplace} not found
...
But, we can get the "community-operators" CatalogSource object successfully.
mac:aws-ocp jianzhang$ oc get catalogsource -n openshift-marketplace
NAME NAME TYPE PUBLISHER AGE
certified-operators Certified Operators internal Red Hat 5h
community-operators Community Operators internal Red Hat 5h
redhat-operators Red Hat Operators internal Red Hat 5h
Below are the detailed steps:
1, Change the global namespace to "openshift-marketplace", so that can get these packagemanifests:
mac:aws-ocp jianzhang$ oc get packagemanifest
NAME AGE
amq-streams 5h
packageserver 5h
couchbase-enterprise 53m
descheduler 53m
mongodb-enterprise 53m
mongodb-enterprise-test 53m
couchbase-enterprise 5h
dynatrace-monitoring 5h
mongodb-enterprise 5h
automationbroker 5h
cluster-logging 5h
descheduler 5h
etcd 5h
federationv2 5h
jaeger 5h
metering 5h
prometheus 5h
svcat 5h
templateservicebroker 5h
2, Create the "kube-service-catalog" project:
mac:aws-ocp jianzhang$ oc adm new-project kube-service-catalog
Created project kube-service-catalog
3, Create the operatorgroup:
mac:aws-ocp jianzhang$ cat og-all.yaml
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
name: catalog-operators
namespace: kube-service-catalog
spec:
selector: {}
mac:aws-ocp jianzhang$ oc create -f og-all.yaml
4, Subscribe the ServiceCatalog:
mac:aws-ocp jianzhang$ cat svcat.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
generateName: svcat-
namespace: kube-service-catalog
spec:
source: community-operators
sourceNamespace: openshift-marketplace
name: svcat
startingCSV: svcat.v0.1.34
channel: alpha
But, no csv object generated.
mac:aws-ocp jianzhang$ oc get sub -n kube-service-catalog
NAME PACKAGE SOURCE CHANNEL
svcat-qp84x svcat community-operators alpha
mac:aws-ocp jianzhang$ oc get csv -n kube-service-catalog
No resources found.
mac:aws-ocp jianzhang$ oc get installplan -n kube-service-catalog
No resources found.
Now, I used the latest version of the OLM, details in bug https://bugzilla.redhat.com/show_bug.cgi?id=1667027#c2 But, still encounter this issue: E0118 09:25:58.774596 1 queueinformer_operator.go:155] Sync "service-catalog/etcd" failed: {etcd alpha {marketplace-enabled-operators-community service-catalog}} not found: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.30.20.73:50051: connect: no route to host" Steps: 1, Create a project called "service-catalog". 2, Create an operator group, as below: mac:project jianzhang$ cat og-default.yaml apiVersion: operators.coreos.com/v1alpha2 kind: OperatorGroup metadata: name: catalog-operators namespace: service-catalog spec: selector: {} 3, Access the Web console, Click "Catalog"-> "Operator Hub" -> "Show community operators"-> "etcd oeprator", select the "catalog-operators" OperatorGroup. Results: Only the subscription created, no corresponding csv created. Check the catalog-operator logs, got the above logs. So, add the "TestBlocker" label since it's blocking relevant testing. Test it with the latest OLM image. But, still failed.
[core@ip-10-0-3-254 ~]$ oc image info registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-01-20-082408@sha256:04ce25d455d2ac2d424584a61c7b1ca5567ce6c561fae36208845ab3744645f9|grep io.openshift.build.commit.id
io.openshift.build.commit.id=f3b9375590334b0a3bfc8a9acf13dc5bde05da58
Here are the error logs of the PackageServer when installing the Service Catalog:
I0121 06:59:19.896532 1 wrap.go:42] GET /healthz: (139.874µs) 200 [[kube-probe/1.11+] 10.130.0.1:47760]
E0121 06:59:21.069024 1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=23, ErrCode=NO_ERROR, debug=""
I0121 06:59:21.069074 1 reflector.go:428] github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112: Watch close - *v1alpha1.CatalogSource total 0 items received
I0121 06:59:22.339121 1 wrap.go:42] GET /healthz: (156.955µs) 200 [[kube-probe/1.11+] 10.130.0.1:47898]
W0121 06:59:23.238273 1 reflector.go:341] github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112: watch of *v1alpha1.CatalogSource ended with: too old resource version: 200093 (204266)
I0121 06:59:24.238535 1 reflector.go:240] Listing and watching *v1alpha1.CatalogSource from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112
time="2019-01-21T06:59:24Z" level=info msg="update detected, attempting to reset grpc connection" action="sync catalogsource" name=marketplace-enabled-operators-community namespace=kube-service-catalog
time="2019-01-21T06:59:24Z" level=info msg="grpc connection reset" action="sync catalogsource" name=marketplace-enabled-operators-community namespace=kube-service-catalog
time="2019-01-21T06:59:24Z" level=info msg="update detected, attempting to reset grpc connection" action="sync catalogsource" name=certified-operators namespace=openshift-marketplace
time="2019-01-21T06:59:24Z" level=info msg="grpc connection reset" action="sync catalogsource" name=certified-operators namespace=openshift-marketplace
...
Change status to "Urgent" since it's blocking the OLM/ServiceCatalog/ASB testing. If you have any workaround, please let me know.
These commits addressed several bugs with CatalogSource syncing: https://github.com/operator-framework/operator-lifecycle-manager/pull/670/commits It merged on friday and should be available in the latest master of openshift. I believe this issue should be fixed. If there's still an issue, we'll need more information to debug it - specifically the subscription object and the installplan object (if created). Evan, > It's possible that there's an underlying issue in openshift based on this log. For OLM to work properly, basics like networking and dns need to work as well. Yeah, I checked the networking and dns, seems like they worked well. And, I guess this error from this vendor file: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go#L105 I think the marketplace team met this issue yesterday too. here: https://bugzilla.redhat.com/show_bug.cgi?id=1669992#c2 I'm not sure this bug belongs to which team. So, could you help transfer this bug to the appropriate team? Thanks very much! > Can you please provide steps to reproduce from a fresh cluster? This issue not always occurs, but often. The steps: 1) Create a new OCP 4.0 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully. 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often. > this was labelled a testblocker because it was blocking service catalog testing - I see in the latest update that service catalog is installed on the cluster. Is this still a testblocker? Yes, I label it with "testblocker" since we hit this issue many times that day. I will remove it, but I will leave the "beta2blocker" label because we should transfer this bug to the appropriate team to find the root cause and solve it. I just found a similar fixed PR submitted by Clayton on Kubernetes: https://github.com/kubernetes/kubernetes/pull/73277 @Clayton Could you help to process this issue? Thanks very much! *** Bug 1669992 has been marked as a duplicate of this bug. *** Hit same issue with 4.0.0-0.alpha-2019-02-13-200735. It is a fresh cluster, it is running about 1 hour. This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while ) Jian and everyone else: I've been looking into "Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection" and from what I've been told, it's an inconsequential log and is hidden in newer client releases. https://coreos.slack.com/archives/CEKNRGF25/p1550177624280300 > This issue not always occurs, but often. The steps: > 1) Create a new OCP 4.0 > 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully. In what namespace? in what operator-group? openshift-operators? > 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often. Subscribe where? What operator? These are important as we have not seen this issue once. Anping Li > Hit same issue with 4.0.0-0.alpha-2019-02-13-200735. It is a fresh cluster, it is running about 1 hour. Which issue? You can't install an operator after an hour? which operator and in what operator-group? Does the CatlogSource the new Subscription is referencing still exist? Fan Jia > This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while ) OLM does not delete CatalogSources, so there's no chance OLM is causing marketplace's issue (if that's what you're trying to say here). Please, if you are using marketplace to generate CatalogSources in the openshift-operators namespace: 1. Make sure the CatalogSource has a status and that you can query for the packagemanifest you want to install in that namespace `kubectl -n openshift-operators get packagemanifests` *at the time you go to create a Subscription* If marketplace's CatalogSources are being deleted for some totally unrelated reason, any new Subscriptions referencing those CatalogSources will not progress. 2. Ensure the Subscriptions `sourceName` and `sourceNamespace` fields point to the CatalogSource in the openshift-operators namespace 3. Check for InstallPlans, Subscriptions, and CSVs after you create a Susbcription 4. Make sure you're not trying to create a Subscription to the same operator in the same namespace We need actual data (status and logs) if you are still seeing this. (In reply to Nick Hale from comment #21) > Jian and everyone else: > > I've been looking into "Unable to decode an event from the watch stream: > http2: server sent GOAWAY and closed the connection" and from what I've been > told, it's an inconsequential log and is hidden in newer client releases. > https://coreos.slack.com/archives/CEKNRGF25/p1550177624280300 > > > This issue not always occurs, but often. The steps: > > 1) Create a new OCP 4.0 > > 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully. > In what namespace? in what operator-group? openshift-operators? > > > 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often. > Subscribe where? What operator? These are important as we have not seen this > issue once. > > Anping Li > > Hit same issue with 4.0.0-0.alpha-2019-02-13-200735. It is a fresh cluster, it is running about 1 hour. > Which issue? You can't install an operator after an hour? which operator and > in what operator-group? Does the CatlogSource the new Subscription is > referencing still exist? > > Fan Jia > > This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while ) > OLM does not delete CatalogSources, so there's no chance OLM is causing > marketplace's issue (if that's what you're trying to say here). > > Please, if you are using marketplace to generate CatalogSources in the > openshift-operators namespace: > 1. Make sure the CatalogSource has a status and that you can query for the > packagemanifest you want to install in that namespace `kubectl -n > openshift-operators get packagemanifests` *at the time you go to create a > Subscription* > If marketplace's CatalogSources are being deleted for some totally unrelated > reason, any new Subscriptions referencing those CatalogSources will not > progress. > 2. Ensure the Subscriptions `sourceName` and `sourceNamespace` fields point > to the CatalogSource in the openshift-operators namespace > 3. Check for InstallPlans, Subscriptions, and CSVs after you create a > Susbcription > 4. Make sure you're not trying to create a Subscription to the same operator > in the same namespace > > We need actual data (status and logs) if you are still seeing this. Hi Nick, miss my message, The marketplace have already found the mistake about the situation that I mentioned, It is not caused by this bug. @Nick, I filed a new bug to describe my issue. refer to https://bugzilla.redhat.com/show_bug.cgi?id=1677524. Nick, Thanks for your information! > I've been looking into "Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection" and from what I've been told, it's an inconsequential log and is hidden in newer client releases. Agree, it as a vendor in the OLM component, I point out this in comment 13, but we can only find this kind of ERROR logs from the OLM. It's so weird that no other Error info when the csv cannot be installed. You can see details in comment 16. Seems like it is more like a platform issue, I point out this in comment 14, but no response, correct me if I'm wrong. > In what namespace? in what operator-group? openshift-operators? openshift-operators, global-operators. > Subscribe where? What operator? These are important as we have not seen this issue once. You can subscribe it on the Web console(Operator Hub). Or you can use a subscription YAML file. Any operators we provided. I tried etcd-operator, service-catalog, couchbase-operators, etc, all failed. Evan, Thanks for your information! > The stated problem of operators not installing is not something we can reproduce. We have used the installer to create clusters and successfully installed operators. Yes, I understood. We cannot reproduce this issue always too. For this issue, in short, only the subscription object generated, no csv generated by following the basic steps that install an operator on Web. > We have discovered a separate issue which may be exacerbating this bug report. It appears that after some time (a couple of hours), something in an openshift cluster appears to delete some of our service accounts. We're trying to track down the source of this (it may be an issue in OLM or some other component.) It causes, among other things, the package server to fail. Yes, we also encounter this issue in two clusters. I think it's better we create a new bug to trace this issue. And, I also encounter another issue which the APIService are deleted after a couple of hours. Maybe the same root cause, here: bug 1678606 For this bug 1666225, we will keep an eye on it, we can verify it if we cannot reproduce it in the coming test rounds, what do you suggest? *** This bug has been marked as a duplicate of bug 1679309 *** |
Created attachment 1520677 [details] catalog-operator-logs Description of problem: Failed to subscribe to mongodb-enterprise via OLM Version-Release number of selected component (if applicable): cluster version: 4.0.0-0.1 OLM version: 0.8.1 The OLM code source version: "io.openshift.source-repo-commit": "47482491fb29def1a3df05c3178b07de5761708f" How reproducible: Always Steps to Reproduce: 1.Create a subscription in openshift-operators project $ oc create -f mongodb-sub.yaml 2.Check csv and crd status 3. Actual results: csv is not existing Expected results: csv is existing, succeed in subscribe to mongodb-enterprise Additional info: Change the watching namespace to "openshift-marketplace" from "openshift-operator-lifecycle-manager", as below: mac:aws-ocp jianzhang$ oc get deployment packageserver -o yaml |grep command -A 6 - command: - /bin/package-server - -v=4 - --secure-port - "5443" - --global-namespace - openshift-marketplace cat mongodb-sub.yaml apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: generateName: mongodb-enterprise- namespace: openshift-operators spec: source: certified-operators sourceNamespace: openshift-marketplace name: mongodb-enterprise startingCSV: mongodboperator.v0.3.2 channel: preview