Description of problem: The OLM Operator is failing to start and ending up in CrashLoopBackOff state. The error reported is as following. time="2021-07-05T14:26:12Z" level=fatal msg="couldn't cleanup cross-namespace ownerreferences" error="stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 263; INTERNAL_ERROR" Looking at https://github.com/operator-framework/operator-lifecycle-manager/blob/20ded32d2260a8f1eeb594b9ec2147ad0134cfc6/cmd/olm/cleanup.go#L58 it seems we are querying all CSV in all namespaces to run the cross-namespace ownerreference clean-up. Considering that this activity with `oc get csv -A` takes about 49 seconds we are wondering whether we are hitting a timeout Version-Release number of selected component (if applicable): - OpenShift Container Platform 4.7.11 How reproducible: - N/A Steps to Reproduce: 1. N/A Actual results: No Operator can be installed/updated or removed. This is causing massive disruption as general expected operations are not working Expected results: OLM Operator to work no matter how many objects are available and are allowing Operator related activity to succeed at all time Additional info:
from kube-apiserver log: > 2021-07-05T14:24:24.798803918Z I0705 14:24:24.798763 21 trace.go:205] Trace[1543401196]: "List" url:/apis/operators.coreos.com/v1alpha1/clusterserviceversions,user-agent:olm/v0.0.0 (linux/amd64) kubernetes/$Format,client:100.72.3.230 (05-Jul-2021 14:23:24.795) (total time: 60003ms): > 2021-07-05T14:24:24.798803918Z Trace[1543401196]: ---"Listing from storage done" 36101ms (14:24:00.896) > 2021-07-05T14:24:24.798803918Z Trace[1543401196]: ---"Writing http response done" count:17165 23902ms (14:24:00.798) > 2021-07-05T14:24:24.798803918Z Trace[1543401196]: [1m0.003158465s] [1m0.003158465s] END - there are 17165 clusterserviceversions - it takes 36s for the apiserver to get the list result from etcd. - it takes ~24s for the apiserver to transform the response - the apiserver has a hard timeout of 60s for non long running requests and "list/clusterserviceversions" is not a long running request. does the copied csv have the same content size as the original CSV?
Upstream pull request https://github.com/operator-framework/operator-lifecycle-manager/pull/2241 This just landed downstream as part of rebasing master to upstream. Moving this to modified.
Information from Kevin on PR on downstream. -- There's no master pull request because we rebased downstream master with all of the upstream changes earlier today. See: https://github.com/openshift/operator-framework-olm/pull/116 --
[cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-07-14-204159 True False 12m Cluster version is 4.9.0-0.nightly-2021-07-14-204159 [cloud-user@preserve-olm-env jian]$ oc adm release info registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-07-14-204159 --commits|grep lifecycle operator-lifecycle-manager https://github.com/openshift/operator-framework-olm 8740cee32bc0973361238df1ae8af3f87f7d6588 [cloud-user@preserve-olm-env jian]$ oc exec olm-operator-745bb9b9ff-2zm2k -- olm --version OLM version: 0.18.3 git commit: 8740cee32bc0973361238df1ae8af3f87f7d6588 1, Install some operators for the cluster-scoped. [cloud-user@preserve-olm-env jian]$ oc get sub -A NAMESPACE NAME PACKAGE SOURCE CHANNEL default etcd etcd max-operators singlenamespace-alpha openshift-logging cluster-logging cluster-logging redhat-operators 5.0 openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable-5.1 openshift-operators aws-efs-operator aws-efs-operator community-operators stable openshift-operators eap eap redhat-operators alpha openshift-operators nfd nfd qe-app-registry 4.9 openshift-operators rhacs-operator rhacs-operator redhat-operators latest openshift-update-service cincinnati-operator cincinnati-operator qe-app-registry v1 2, Create 1000 namespaces, but finally, 835 namespaces were created. [cloud-user@preserve-olm-env jian]$ for l in {1..999}; do oc adm new-project "test$l";sleep 1; done; Created project test768 Unable to connect to the server: net/http: TLS handshake timeout Unable to connect to the server: net/http: TLS handshake timeout Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io test771) Error from server (ServiceUnavailable): the server is currently unable to handle the request (post projects.project.openshift.io) Created project test773 Created project test774 Error from server: rpc error: code = Unavailable desc = transport is closing ... [cloud-user@preserve-olm-env jian]$ oc get ns|wc -l 835 About 2000 CSV objects. [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l Error from server: rpc error: code = Unavailable desc = transport is closing 2001 OLM pods looks good. [cloud-user@preserve-olm-env jian]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-5689bf895f-tklxq 1/1 Running 0 150m olm-operator-745bb9b9ff-4htzg 1/1 Running 0 150m packageserver-7458488579-4smsb 1/1 Running 0 145m packageserver-7458488579-rsm4s 1/1 Running 0 145m One master node is not ready. [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-lhgyzzk-f76d1-5r5rn-master-0 Ready master 152m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-master-1 NotReady master 152m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-master-2 Ready master 153m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-worker-a-nnw5l Ready worker 147m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-worker-b-kgz4v Ready worker 146m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-worker-c-f5bxw Ready worker 146m v1.21.1+e8c1003 3, Delete operators, but the master crashed. Only 8 operators were installed. It should NOT lead to the master crash. [cloud-user@preserve-olm-env jian]$ oc delete csv --all -n openshift-operators Unable to connect to the server: net/http: TLS handshake timeout We only removed the "cleanup" operation in https://github.com/operator-framework/operator-lifecycle-manager/pull/2241/files so that olm-operator can startup well. But, this copied CSV design caused too many requests to the master even if there are no many resources used. I will verify this bug since no olm-operator pods crashed(as below), but, I agree/suggest that we should improve this copied CSV design so that decreases unnecessary pressure to the master. 1, Reinstall another cluster, repeat step 1-3, but in this time create 400 namespaces. [cloud-user@preserve-olm-env jian]$ oc get sub -A NAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-compliance compliance-operator compliance-operator qe-app-registry 4.6 openshift-logging cluster-logging cluster-logging qe-app-registry stable-5.1 openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable openshift-operators jaeger-product jaeger-product redhat-operators stable openshift-operators nfd nfd qe-app-registry 4.9 openshift-operators openshift-gitops-operator openshift-gitops-operator redhat-operators stable openshift-operators openshift-pipelines-operator-rh openshift-pipelines-operator-rh redhat-operators stable openshift-operators servicemeshoperator servicemeshoperator redhat-operators stable [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 418 [cloud-user@preserve-olm-env jian]$ for l in {1..399}; do oc adm new-project "test$l";sleep 1; done; Created project test1 ... [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 2812 [cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-5689bf895f-nr6wg 1/1 Running 0 3h2m olm-operator-745bb9b9ff-sqwzv 1/1 Running 0 3h2m packageserver-58b8577d8d-gf2qc 1/1 Running 0 176m packageserver-58b8577d8d-lbhjd 1/1 Running 0 176m 2, Install and uninstall some operators, the OLM pods work well, no pods crashed. [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 3280 [cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-5689bf895f-nr6wg 1/1 Running 0 3h5m olm-operator-745bb9b9ff-sqwzv 1/1 Running 0 3h5m packageserver-58b8577d8d-gf2qc 1/1 Running 0 179m packageserver-58b8577d8d-lbhjd 1/1 Running 0 179m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759