Bug 1979544
Summary: | olm Operator is in CrashLoopBackOff state with error "couldn't cleanup cross-namespace ownerreferences" | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Simon Reber <sreber> | |
Component: | OLM | Assignee: | Kevin Rizza <krizza> | |
OLM sub component: | OLM | QA Contact: | Jian Zhang <jiazha> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | akashem, davegord, dsover, kuiwang | |
Version: | 4.7 | Keywords: | FastFix | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1982252 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:38:02 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1982252 |
Description
Simon Reber
2021-07-06 11:19:54 UTC
from kube-apiserver log:
> 2021-07-05T14:24:24.798803918Z I0705 14:24:24.798763 21 trace.go:205] Trace[1543401196]: "List" url:/apis/operators.coreos.com/v1alpha1/clusterserviceversions,user-agent:olm/v0.0.0 (linux/amd64) kubernetes/$Format,client:100.72.3.230 (05-Jul-2021 14:23:24.795) (total time: 60003ms):
> 2021-07-05T14:24:24.798803918Z Trace[1543401196]: ---"Listing from storage done" 36101ms (14:24:00.896)
> 2021-07-05T14:24:24.798803918Z Trace[1543401196]: ---"Writing http response done" count:17165 23902ms (14:24:00.798)
> 2021-07-05T14:24:24.798803918Z Trace[1543401196]: [1m0.003158465s] [1m0.003158465s] END
- there are 17165 clusterserviceversions
- it takes 36s for the apiserver to get the list result from etcd.
- it takes ~24s for the apiserver to transform the response
- the apiserver has a hard timeout of 60s for non long running requests and "list/clusterserviceversions" is not a long running request.
does the copied csv have the same content size as the original CSV?
Upstream pull request https://github.com/operator-framework/operator-lifecycle-manager/pull/2241 This just landed downstream as part of rebasing master to upstream. Moving this to modified. Information from Kevin on PR on downstream. -- There's no master pull request because we rebased downstream master with all of the upstream changes earlier today. See: https://github.com/openshift/operator-framework-olm/pull/116 -- [cloud-user@preserve-olm-env jian]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-07-14-204159 True False 12m Cluster version is 4.9.0-0.nightly-2021-07-14-204159 [cloud-user@preserve-olm-env jian]$ oc adm release info registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-07-14-204159 --commits|grep lifecycle operator-lifecycle-manager https://github.com/openshift/operator-framework-olm 8740cee32bc0973361238df1ae8af3f87f7d6588 [cloud-user@preserve-olm-env jian]$ oc exec olm-operator-745bb9b9ff-2zm2k -- olm --version OLM version: 0.18.3 git commit: 8740cee32bc0973361238df1ae8af3f87f7d6588 1, Install some operators for the cluster-scoped. [cloud-user@preserve-olm-env jian]$ oc get sub -A NAMESPACE NAME PACKAGE SOURCE CHANNEL default etcd etcd max-operators singlenamespace-alpha openshift-logging cluster-logging cluster-logging redhat-operators 5.0 openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable-5.1 openshift-operators aws-efs-operator aws-efs-operator community-operators stable openshift-operators eap eap redhat-operators alpha openshift-operators nfd nfd qe-app-registry 4.9 openshift-operators rhacs-operator rhacs-operator redhat-operators latest openshift-update-service cincinnati-operator cincinnati-operator qe-app-registry v1 2, Create 1000 namespaces, but finally, 835 namespaces were created. [cloud-user@preserve-olm-env jian]$ for l in {1..999}; do oc adm new-project "test$l";sleep 1; done; Created project test768 Unable to connect to the server: net/http: TLS handshake timeout Unable to connect to the server: net/http: TLS handshake timeout Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io test771) Error from server (ServiceUnavailable): the server is currently unable to handle the request (post projects.project.openshift.io) Created project test773 Created project test774 Error from server: rpc error: code = Unavailable desc = transport is closing ... [cloud-user@preserve-olm-env jian]$ oc get ns|wc -l 835 About 2000 CSV objects. [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l Error from server: rpc error: code = Unavailable desc = transport is closing 2001 OLM pods looks good. [cloud-user@preserve-olm-env jian]$ oc get pods NAME READY STATUS RESTARTS AGE catalog-operator-5689bf895f-tklxq 1/1 Running 0 150m olm-operator-745bb9b9ff-4htzg 1/1 Running 0 150m packageserver-7458488579-4smsb 1/1 Running 0 145m packageserver-7458488579-rsm4s 1/1 Running 0 145m One master node is not ready. [cloud-user@preserve-olm-env jian]$ oc get nodes NAME STATUS ROLES AGE VERSION ci-ln-lhgyzzk-f76d1-5r5rn-master-0 Ready master 152m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-master-1 NotReady master 152m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-master-2 Ready master 153m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-worker-a-nnw5l Ready worker 147m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-worker-b-kgz4v Ready worker 146m v1.21.1+e8c1003 ci-ln-lhgyzzk-f76d1-5r5rn-worker-c-f5bxw Ready worker 146m v1.21.1+e8c1003 3, Delete operators, but the master crashed. Only 8 operators were installed. It should NOT lead to the master crash. [cloud-user@preserve-olm-env jian]$ oc delete csv --all -n openshift-operators Unable to connect to the server: net/http: TLS handshake timeout We only removed the "cleanup" operation in https://github.com/operator-framework/operator-lifecycle-manager/pull/2241/files so that olm-operator can startup well. But, this copied CSV design caused too many requests to the master even if there are no many resources used. I will verify this bug since no olm-operator pods crashed(as below), but, I agree/suggest that we should improve this copied CSV design so that decreases unnecessary pressure to the master. 1, Reinstall another cluster, repeat step 1-3, but in this time create 400 namespaces. [cloud-user@preserve-olm-env jian]$ oc get sub -A NAMESPACE NAME PACKAGE SOURCE CHANNEL openshift-compliance compliance-operator compliance-operator qe-app-registry 4.6 openshift-logging cluster-logging cluster-logging qe-app-registry stable-5.1 openshift-operators-redhat elasticsearch-operator elasticsearch-operator qe-app-registry stable openshift-operators jaeger-product jaeger-product redhat-operators stable openshift-operators nfd nfd qe-app-registry 4.9 openshift-operators openshift-gitops-operator openshift-gitops-operator redhat-operators stable openshift-operators openshift-pipelines-operator-rh openshift-pipelines-operator-rh redhat-operators stable openshift-operators servicemeshoperator servicemeshoperator redhat-operators stable [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 418 [cloud-user@preserve-olm-env jian]$ for l in {1..399}; do oc adm new-project "test$l";sleep 1; done; Created project test1 ... [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 2812 [cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-5689bf895f-nr6wg 1/1 Running 0 3h2m olm-operator-745bb9b9ff-sqwzv 1/1 Running 0 3h2m packageserver-58b8577d8d-gf2qc 1/1 Running 0 176m packageserver-58b8577d8d-lbhjd 1/1 Running 0 176m 2, Install and uninstall some operators, the OLM pods work well, no pods crashed. [cloud-user@preserve-olm-env jian]$ oc get csv -A|wc -l 3280 [cloud-user@preserve-olm-env jian]$ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-5689bf895f-nr6wg 1/1 Running 0 3h5m olm-operator-745bb9b9ff-sqwzv 1/1 Running 0 3h5m packageserver-58b8577d8d-gf2qc 1/1 Running 0 179m packageserver-58b8577d8d-lbhjd 1/1 Running 0 179m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |