Bug 1979544

Summary:	olm Operator is in CrashLoopBackOff state with error "couldn't cleanup cross-namespace ownerreferences"
Product:	OpenShift Container Platform	Reporter:	Simon Reber <sreber>
Component:	OLM	Assignee:	Kevin Rizza <krizza>
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	akashem, davegord, dsover, kuiwang
Version:	4.7	Keywords:	FastFix
Target Milestone:	---
Target Release:	4.9.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1982252 (view as bug list)		Environment:
Last Closed:	2021-10-18 17:38:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1982252

Description Simon Reber 2021-07-06 11:19:54 UTC

Description of problem:

The OLM Operator is failing to start and ending up in CrashLoopBackOff state. The error reported is as following.

time="2021-07-05T14:26:12Z" level=fatal msg="couldn't cleanup cross-namespace ownerreferences" error="stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 263; INTERNAL_ERROR"

Looking at https://github.com/operator-framework/operator-lifecycle-manager/blob/20ded32d2260a8f1eeb594b9ec2147ad0134cfc6/cmd/olm/cleanup.go#L58 it seems we are querying all CSV in all namespaces to run the cross-namespace ownerreference clean-up. Considering that this activity with `oc get csv -A` takes about 49 seconds we are wondering whether we are hitting a timeout

Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.7.11

How reproducible:

 - N/A

Steps to Reproduce:
1. N/A

Actual results:

No Operator can be installed/updated or removed. This is causing massive disruption as general expected operations are not working

Expected results:

OLM Operator to work no matter how many objects are available and are allowing Operator related activity to succeed at all time

Additional info:

Comment 3 Abu Kashem 2021-07-06 14:35:31 UTC

from kube-apiserver log:

> 2021-07-05T14:24:24.798803918Z I0705 14:24:24.798763      21 trace.go:205] Trace[1543401196]: "List" url:/apis/operators.coreos.com/v1alpha1/clusterserviceversions,user-agent:olm/v0.0.0 (linux/amd64) kubernetes/$Format,client:100.72.3.230 (05-Jul-2021 14:23:24.795) (total time: 60003ms):
> 2021-07-05T14:24:24.798803918Z Trace[1543401196]: ---"Listing from storage done" 36101ms (14:24:00.896)
> 2021-07-05T14:24:24.798803918Z Trace[1543401196]: ---"Writing http response done" count:17165 23902ms (14:24:00.798)
> 2021-07-05T14:24:24.798803918Z Trace[1543401196]: [1m0.003158465s] [1m0.003158465s] END

- there are 17165 clusterserviceversions
- it takes 36s for the apiserver to get the list result from etcd.
- it takes ~24s for the apiserver to transform the response
- the apiserver has a hard timeout of 60s for non long running requests and "list/clusterserviceversions" is not a long running request.

does the copied csv have the same content size as the original CSV?

Comment 6 Kevin Rizza 2021-07-14 14:19:00 UTC

Upstream pull request https://github.com/operator-framework/operator-lifecycle-manager/pull/2241

This just landed downstream as part of rebasing master to upstream. Moving this to modified.

Comment 8 kuiwang 2021-07-15 00:58:37 UTC

Information from Kevin on PR on downstream.
--
There's no master pull request because we rebased downstream master with all of the upstream changes earlier today. See:

https://github.com/openshift/operator-framework-olm/pull/116
--

Comment 9 Jian Zhang 2021-07-15 09:23:23 UTC

[cloud-user@preserve-olm-env jian]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-07-14-204159   True        False         12m     Cluster version is 4.9.0-0.nightly-2021-07-14-204159

[cloud-user@preserve-olm-env jian]$ oc adm release info registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-07-14-204159 --commits|grep lifecycle
  operator-lifecycle-manager                     https://github.com/openshift/operator-framework-olm                         8740cee32bc0973361238df1ae8af3f87f7d6588

[cloud-user@preserve-olm-env jian]$ oc exec olm-operator-745bb9b9ff-2zm2k -- olm --version
OLM version: 0.18.3
git commit: 8740cee32bc0973361238df1ae8af3f87f7d6588

1, Install some operators for the cluster-scoped.

[cloud-user@preserve-olm-env jian]$ oc get sub -A
NAMESPACE                    NAME                     PACKAGE                  SOURCE                CHANNEL
default                      etcd                     etcd                     max-operators         singlenamespace-alpha
openshift-logging            cluster-logging          cluster-logging          redhat-operators      5.0
openshift-operators-redhat   elasticsearch-operator   elasticsearch-operator   qe-app-registry       stable-5.1
openshift-operators          aws-efs-operator         aws-efs-operator         community-operators   stable
openshift-operators          eap                      eap                      redhat-operators      alpha
openshift-operators          nfd                      nfd                      qe-app-registry       4.9
openshift-operators          rhacs-operator           rhacs-operator           redhat-operators      latest
openshift-update-service     cincinnati-operator      cincinnati-operator      qe-app-registry       v1


2, Create 1000 namespaces, but finally, 835 namespaces were created.

[cloud-user@preserve-olm-env jian]$ for l in {1..999}; do oc adm new-project "test$l";sleep 1; done;
Created project test768
Unable to connect to the server: net/http: TLS handshake timeout
Unable to connect to the server: net/http: TLS handshake timeout
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io test771)
Error from server (ServiceUnavailable): the server is currently unable to handle the request (post projects.project.openshift.io)
Created project test773
Created project test774
Error from server: rpc error: code = Unavailable desc = transport is closing
...

[cloud-user@preserve-olm-env jian]$ oc get ns|wc -l 
835

About 2000 CSV objects.
[cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 
Error from server: rpc error: code = Unavailable desc = transport is closing
2001

OLM pods looks good.
[cloud-user@preserve-olm-env jian]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-5689bf895f-tklxq   1/1     Running   0          150m
olm-operator-745bb9b9ff-4htzg       1/1     Running   0          150m
packageserver-7458488579-4smsb      1/1     Running   0          145m
packageserver-7458488579-rsm4s      1/1     Running   0          145m

One master node is not ready.
[cloud-user@preserve-olm-env jian]$ oc get nodes
NAME                                       STATUS     ROLES    AGE    VERSION
ci-ln-lhgyzzk-f76d1-5r5rn-master-0         Ready      master   152m   v1.21.1+e8c1003
ci-ln-lhgyzzk-f76d1-5r5rn-master-1         NotReady   master   152m   v1.21.1+e8c1003
ci-ln-lhgyzzk-f76d1-5r5rn-master-2         Ready      master   153m   v1.21.1+e8c1003
ci-ln-lhgyzzk-f76d1-5r5rn-worker-a-nnw5l   Ready      worker   147m   v1.21.1+e8c1003
ci-ln-lhgyzzk-f76d1-5r5rn-worker-b-kgz4v   Ready      worker   146m   v1.21.1+e8c1003
ci-ln-lhgyzzk-f76d1-5r5rn-worker-c-f5bxw   Ready      worker   146m   v1.21.1+e8c1003

3, Delete operators, but the master crashed. Only 8 operators were installed. It should NOT lead to the master crash. 
[cloud-user@preserve-olm-env jian]$ oc delete csv --all -n openshift-operators
Unable to connect to the server: net/http: TLS handshake timeout

We only removed the "cleanup" operation in https://github.com/operator-framework/operator-lifecycle-manager/pull/2241/files so that olm-operator can startup well. But, this copied CSV design caused too many requests to the master even if there are no many resources used.
 

I will verify this bug since no olm-operator pods crashed(as below), but, I agree/suggest that we should improve this copied CSV design so that decreases unnecessary pressure to the master. 

1, Reinstall another cluster, repeat step 1-3, but in this time create 400 namespaces.

[cloud-user@preserve-olm-env jian]$ oc get sub -A
NAMESPACE                    NAME                              PACKAGE                           SOURCE             CHANNEL
openshift-compliance         compliance-operator               compliance-operator               qe-app-registry    4.6
openshift-logging            cluster-logging                   cluster-logging                   qe-app-registry    stable-5.1
openshift-operators-redhat   elasticsearch-operator            elasticsearch-operator            qe-app-registry    stable
openshift-operators          jaeger-product                    jaeger-product                    redhat-operators   stable
openshift-operators          nfd                               nfd                               qe-app-registry    4.9
openshift-operators          openshift-gitops-operator         openshift-gitops-operator         redhat-operators   stable
openshift-operators          openshift-pipelines-operator-rh   openshift-pipelines-operator-rh   redhat-operators   stable
openshift-operators          servicemeshoperator               servicemeshoperator               redhat-operators   stable

[cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l
418


[cloud-user@preserve-olm-env jian]$ for l in {1..399}; do oc adm new-project "test$l";sleep 1; done;
Created project test1
...

[cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l
2812


[cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-5689bf895f-nr6wg   1/1     Running   0          3h2m
olm-operator-745bb9b9ff-sqwzv       1/1     Running   0          3h2m
packageserver-58b8577d8d-gf2qc      1/1     Running   0          176m
packageserver-58b8577d8d-lbhjd      1/1     Running   0          176m

2, Install and uninstall some operators, the OLM pods work well, no pods crashed.

[cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l
3280
[cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-5689bf895f-nr6wg   1/1     Running   0          3h5m
olm-operator-745bb9b9ff-sqwzv       1/1     Running   0          3h5m
packageserver-58b8577d8d-gf2qc      1/1     Running   0          179m
packageserver-58b8577d8d-lbhjd      1/1     Running   0          179m

Comment 12 errata-xmlrpc 2021-10-18 17:38:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759