Bug 1666225

Summary:

Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection

Product:

OpenShift Container Platform

Reporter:

Dongbo Yan <dyan>

Component:

OLM

Assignee:

Evan Cordell <ecordell>

Status:

CLOSED DUPLICATE

QA Contact:

Jian Zhang <jiazha>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

4.1.0

CC:

akashem, anli, aravindh, ccoleman, dageoffr, ecordell, jeder, jfan, jforrest, jiazha, jmatthew, nhale, nstielau, scuppett, sponnaga, zitang

Target Milestone:

---

Keywords:

TestBlocker

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-40

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-02-20 21:05:30 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
catalog-operator-logs	none

Description Dongbo Yan 2019-01-15 09:33:57 UTC

Created attachment 1520677 [details]
catalog-operator-logs

Description of problem:
Failed to subscribe to mongodb-enterprise via OLM

Version-Release number of selected component (if applicable):
cluster version: 4.0.0-0.1

OLM version: 0.8.1
The OLM code source version:
"io.openshift.source-repo-commit": "47482491fb29def1a3df05c3178b07de5761708f"

How reproducible:
Always

Steps to Reproduce:
1.Create a subscription in openshift-operators project
 $ oc create -f mongodb-sub.yaml
2.Check csv and crd status
3.

Actual results:
csv is not existing

Expected results:
csv is existing, succeed in subscribe to mongodb-enterprise

Additional info:
Change the watching namespace to "openshift-marketplace" from "openshift-operator-lifecycle-manager", as below:
mac:aws-ocp jianzhang$ oc get deployment packageserver -o yaml |grep command -A 6
      - command:
        - /bin/package-server
        - -v=4
        - --secure-port
        - "5443"
        - --global-namespace
        - openshift-marketplace


cat mongodb-sub.yaml

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  generateName: mongodb-enterprise-
  namespace: openshift-operators
spec:
  source: certified-operators
  sourceNamespace: openshift-marketplace
  name: mongodb-enterprise
  startingCSV: mongodboperator.v0.3.2
  channel: preview

Comment 1 Jian Zhang 2019-01-16 08:26:11 UTC

Hi, Evan

I also encounter this issue when subscribing the Service Catalog. 

mac:aws-ocp jianzhang$ oc logs -f catalog-operator-7cc8654dd8-8t68m 
...
time="2019-01-16T08:13:57Z" level=info msg="couldn't get from queue" key=wewang1/frontend-1-xh2r5 queue=pod
time="2019-01-16T08:14:57Z" level=info msg="retrying kube-service-catalog/svcat-wdjp2"
E0116 08:14:57.796000       1 queueinformer_operator.go:155] Sync "kube-service-catalog/svcat-wdjp2" failed: {svcat alpha {community-operators openshift-marketplace}} not found: CatalogSource {community-operators openshift-marketplace} not found
...


But, we can get the "community-operators" CatalogSource object successfully.
mac:aws-ocp jianzhang$ oc get catalogsource -n openshift-marketplace
NAME                  NAME                  TYPE       PUBLISHER   AGE
certified-operators   Certified Operators   internal   Red Hat     5h
community-operators   Community Operators   internal   Red Hat     5h
redhat-operators      Red Hat Operators     internal   Red Hat     5h

Below are the detailed steps:

1, Change the global namespace to "openshift-marketplace", so that can get these packagemanifests:
mac:aws-ocp jianzhang$ oc get packagemanifest
NAME                      AGE
amq-streams               5h
packageserver             5h
couchbase-enterprise      53m
descheduler               53m
mongodb-enterprise        53m
mongodb-enterprise-test   53m
couchbase-enterprise      5h
dynatrace-monitoring      5h
mongodb-enterprise        5h
automationbroker          5h
cluster-logging           5h
descheduler               5h
etcd                      5h
federationv2              5h
jaeger                    5h
metering                  5h
prometheus                5h
svcat                     5h
templateservicebroker     5h

2, Create the "kube-service-catalog" project:
mac:aws-ocp jianzhang$ oc adm new-project kube-service-catalog
Created project kube-service-catalog
3, Create the operatorgroup:
mac:aws-ocp jianzhang$ cat og-all.yaml 
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: catalog-operators
  namespace: kube-service-catalog
spec:
  selector: {}

mac:aws-ocp jianzhang$ oc create -f  og-all.yaml

4, Subscribe the ServiceCatalog:
mac:aws-ocp jianzhang$ cat svcat.yaml 
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  generateName: svcat-
  namespace: kube-service-catalog
spec:
  source: community-operators
  sourceNamespace: openshift-marketplace
  name: svcat
  startingCSV: svcat.v0.1.34
  channel: alpha

But, no csv object generated.
mac:aws-ocp jianzhang$ oc get sub -n kube-service-catalog
NAME          PACKAGE   SOURCE                CHANNEL
svcat-qp84x   svcat     community-operators   alpha
mac:aws-ocp jianzhang$ oc get csv -n kube-service-catalog
No resources found.
mac:aws-ocp jianzhang$ oc get installplan -n kube-service-catalog
No resources found.

Comment 3 Jian Zhang 2019-01-18 10:53:18 UTC

Now, I used the latest version of the OLM, details in bug https://bugzilla.redhat.com/show_bug.cgi?id=1667027#c2
But, still encounter this issue:
E0118 09:25:58.774596       1 queueinformer_operator.go:155] Sync "service-catalog/etcd" failed: {etcd alpha {marketplace-enabled-operators-community service-catalog}} not found: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.30.20.73:50051: connect: no route to host"


Steps:
1, Create a project called "service-catalog".
2, Create an operator group, as below:
mac:project jianzhang$ cat og-default.yaml 
apiVersion: operators.coreos.com/v1alpha2
kind: OperatorGroup
metadata:
  name: catalog-operators
  namespace: service-catalog
spec:
  selector: {}

3, Access the Web console, Click "Catalog"-> "Operator Hub" -> "Show community operators"-> "etcd oeprator", select the "catalog-operators" OperatorGroup.
Results: Only the subscription created, no corresponding csv created. Check the catalog-operator logs, got the above logs.
So, add the "TestBlocker" label since it's blocking relevant testing.

Comment 4 Jian Zhang 2019-01-21 07:26:53 UTC

Test it with the latest OLM image. But, still failed.
[core@ip-10-0-3-254 ~]$ oc image info registry.svc.ci.openshift.org/openshift/origin-v4.0-2019-01-20-082408@sha256:04ce25d455d2ac2d424584a61c7b1ca5567ce6c561fae36208845ab3744645f9|grep io.openshift.build.commit.id
               io.openshift.build.commit.id=f3b9375590334b0a3bfc8a9acf13dc5bde05da58


Here are the error logs of the PackageServer when installing the Service Catalog:
I0121 06:59:19.896532       1 wrap.go:42] GET /healthz: (139.874µs) 200 [[kube-probe/1.11+] 10.130.0.1:47760]
E0121 06:59:21.069024       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=23, ErrCode=NO_ERROR, debug=""
I0121 06:59:21.069074       1 reflector.go:428] github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112: Watch close - *v1alpha1.CatalogSource total 0 items received
I0121 06:59:22.339121       1 wrap.go:42] GET /healthz: (156.955µs) 200 [[kube-probe/1.11+] 10.130.0.1:47898]
W0121 06:59:23.238273       1 reflector.go:341] github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112: watch of *v1alpha1.CatalogSource ended with: too old resource version: 200093 (204266)
I0121 06:59:24.238535       1 reflector.go:240] Listing and watching *v1alpha1.CatalogSource from github.com/operator-framework/operator-lifecycle-manager/pkg/lib/queueinformer/queueinformer_operator.go:112
time="2019-01-21T06:59:24Z" level=info msg="update detected, attempting to reset grpc connection" action="sync catalogsource" name=marketplace-enabled-operators-community namespace=kube-service-catalog
time="2019-01-21T06:59:24Z" level=info msg="grpc connection reset" action="sync catalogsource" name=marketplace-enabled-operators-community namespace=kube-service-catalog
time="2019-01-21T06:59:24Z" level=info msg="update detected, attempting to reset grpc connection" action="sync catalogsource" name=certified-operators namespace=openshift-marketplace
time="2019-01-21T06:59:24Z" level=info msg="grpc connection reset" action="sync catalogsource" name=certified-operators namespace=openshift-marketplace
...


Change status to "Urgent" since it's blocking the OLM/ServiceCatalog/ASB testing. If you have any workaround, please let me know.

Comment 5 Evan Cordell 2019-01-21 13:41:26 UTC

These commits addressed several bugs with CatalogSource syncing: https://github.com/operator-framework/operator-lifecycle-manager/pull/670/commits

It merged on friday and should be available in the latest master of openshift. I believe this issue should be fixed.

If there's still an issue, we'll need more information to debug it - specifically the subscription object and the installplan object (if created).

Comment 13 Jian Zhang 2019-01-29 02:03:36 UTC

Evan,

> It's possible that there's an underlying issue in openshift based on this log. For OLM to work properly, basics like networking and dns need to work as well. 

Yeah, I checked the networking and dns, seems like they worked well. And, I guess this error from this vendor file: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/vendor/k8s.io/apimachinery/pkg/watch/streamwatcher.go#L105
I think the marketplace team met this issue yesterday too. here: https://bugzilla.redhat.com/show_bug.cgi?id=1669992#c2
I'm not sure this bug belongs to which team. So, could you help transfer this bug to the appropriate team? Thanks very much!

> Can you please provide steps to reproduce from a fresh cluster?
This issue not always occurs, but often. The steps:
1) Create a new OCP 4.0
2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully.
3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often.

> this was labelled a testblocker because it was blocking service catalog testing - I see in the latest update that service catalog is installed on the cluster. Is this still a testblocker?

Yes, I label it with "testblocker" since we hit this issue many times that day. I will remove it, but I will leave the "beta2blocker" label because we should transfer this bug to the appropriate team to find the root cause and solve it.

Comment 14 Jian Zhang 2019-01-29 02:54:45 UTC

I just found a similar fixed PR submitted by Clayton on Kubernetes: https://github.com/kubernetes/kubernetes/pull/73277

@Clayton
Could you help to process this issue? Thanks very much!

Comment 17 Jian Zhang 2019-01-30 05:40:21 UTC

*** Bug 1669992 has been marked as a duplicate of this bug. ***

Comment 19 Anping Li 2019-02-14 08:17:28 UTC

Hit same issue with 4.0.0-0.alpha-2019-02-13-200735.  It is a fresh cluster, it is running about 1 hour.

Comment 20 Fan Jia 2019-02-14 08:57:23 UTC

This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while )

Comment 21 Nick Hale 2019-02-15 05:03:47 UTC

Jian and everyone else:

I've been looking into "Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection" and from what I've been told, it's an inconsequential log and is hidden in newer client releases.
https://coreos.slack.com/archives/CEKNRGF25/p1550177624280300 

> This issue not always occurs, but often. The steps:
> 1) Create a new OCP 4.0
> 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully.
In what namespace? in what operator-group? openshift-operators?

> 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often.
Subscribe where? What operator? These are important as we have not seen this issue once.

Anping Li
> Hit same issue with 4.0.0-0.alpha-2019-02-13-200735.  It is a fresh cluster, it is running about 1 hour.
Which issue? You can't install an operator after an hour? which operator and in what operator-group? Does the CatlogSource the new Subscription is referencing still exist?

Fan Jia
> This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while )
OLM does not delete CatalogSources, so there's no chance OLM is causing marketplace's issue (if that's what you're trying to say here).

Please, if you are using marketplace to generate CatalogSources in the openshift-operators namespace:
1. Make sure the CatalogSource has a status and that you can query for the packagemanifest you want to install in that namespace `kubectl -n openshift-operators get packagemanifests` *at the time you go to create a Subscription*
If marketplace's CatalogSources are being deleted for some totally unrelated reason, any new Subscriptions referencing those CatalogSources will not progress.
2. Ensure the Subscriptions `sourceName` and `sourceNamespace` fields point to the CatalogSource in the openshift-operators namespace
3. Check for InstallPlans, Subscriptions, and CSVs after you create a Susbcription
4. Make sure you're not trying to create a Subscription to the same operator in the same namespace

We need actual data (status and logs) if you are still seeing this.

Comment 22 Fan Jia 2019-02-15 06:04:19 UTC

(In reply to Nick Hale from comment #21)
> Jian and everyone else:
> 
> I've been looking into "Unable to decode an event from the watch stream:
> http2: server sent GOAWAY and closed the connection" and from what I've been
> told, it's an inconsequential log and is hidden in newer client releases.
> https://coreos.slack.com/archives/CEKNRGF25/p1550177624280300 
> 
> > This issue not always occurs, but often. The steps:
> > 1) Create a new OCP 4.0
> > 2) Subscribe an operator via the OLM, such as etcd-operator. It will be created successfully.
> In what namespace? in what operator-group? openshift-operators?
> 
> > 3) After the cluster running for hours subscribe to an operator again, will encounter this issue, often.
> Subscribe where? What operator? These are important as we have not seen this
> issue once.
> 
> Anping Li
> > Hit same issue with 4.0.0-0.alpha-2019-02-13-200735.  It is a fresh cluster, it is running about 1 hour.
> Which issue? You can't install an operator after an hour? which operator and
> in what operator-group? Does the CatlogSource the new Subscription is
> referencing still exist?
> 
> Fan Jia
> > This bug also impact marketplace's installed catalogsource&cm (The resources will be deleted after a while )
> OLM does not delete CatalogSources, so there's no chance OLM is causing
> marketplace's issue (if that's what you're trying to say here).
> 
> Please, if you are using marketplace to generate CatalogSources in the
> openshift-operators namespace:
> 1. Make sure the CatalogSource has a status and that you can query for the
> packagemanifest you want to install in that namespace `kubectl -n
> openshift-operators get packagemanifests` *at the time you go to create a
> Subscription*
> If marketplace's CatalogSources are being deleted for some totally unrelated
> reason, any new Subscriptions referencing those CatalogSources will not
> progress.
> 2. Ensure the Subscriptions `sourceName` and `sourceNamespace` fields point
> to the CatalogSource in the openshift-operators namespace
> 3. Check for InstallPlans, Subscriptions, and CSVs after you create a
> Susbcription
> 4. Make sure you're not trying to create a Subscription to the same operator
> in the same namespace
> 
> We need actual data (status and logs) if you are still seeing this.

Hi  Nick, miss my message, The marketplace have already found the mistake about the situation that I mentioned, It is not caused by this bug.

Comment 23 Anping Li 2019-02-15 07:28:42 UTC

@Nick, I filed a new bug to describe my issue. refer to https://bugzilla.redhat.com/show_bug.cgi?id=1677524.

Comment 24 Jian Zhang 2019-02-15 08:53:04 UTC

Nick,

Thanks for your information!

> I've been looking into "Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection" and from what I've been told, it's an inconsequential log and is hidden in newer client releases.

Agree, it as a vendor in the OLM component, I point out this in comment 13, but we can only find this kind of ERROR logs from the OLM. It's so weird that no other Error info when the csv cannot be installed. You can see details in comment 16.
Seems like it is more like a platform issue, I point out this in comment 14, but no response, correct me if I'm wrong.

> In what namespace? in what operator-group? openshift-operators?
openshift-operators, global-operators.

> Subscribe where? What operator? These are important as we have not seen this issue once.
You can subscribe it on the Web console(Operator Hub). Or you can use a subscription YAML file.
Any operators we provided. I tried etcd-operator, service-catalog, couchbase-operators, etc, all failed.

Comment 29 Jian Zhang 2019-02-19 10:00:39 UTC

Evan,

Thanks for your information!
> The stated problem of operators not installing is not something we can reproduce. We have used the installer to create clusters and successfully installed operators.

Yes, I understood. We cannot reproduce this issue always too. For this issue, in short, only the subscription object generated, no csv generated by following the basic steps that install an operator on Web.


> We have discovered a separate issue which may be exacerbating this bug report. It appears that after some time (a couple of hours), something in an openshift cluster appears to delete some of our service accounts. We're trying to track down the source of this (it may be an issue in OLM or some other component.) It causes, among other things, the package server to fail.

Yes, we also encounter this issue in two clusters. I think it's better we create a new bug to trace this issue.
And, I also encounter another issue which the APIService are deleted after a couple of hours. Maybe the same root cause, here: bug 1678606

For this bug 1666225, we will keep an eye on it, we can verify it if we cannot reproduce it in the coming test rounds, what do you suggest?

Comment 32 Aravindh Puthiyaparambil 2019-02-20 21:05:30 UTC


*** This bug has been marked as a duplicate of bug 1679309 ***