Description of problem: 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade on a semi-loaded (750 projects) cluster gets stuck at 77% with olm-operator crash looping. It never makes further progress. The nodes are still at kube 1.14. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.12 True True 64m Unable to apply 4.3.0-0.nightly-2019-12-18-105351: the cluster operator operator-lifecycle-manager has not yet successfully rolled out NAME STATUS ROLES AGE VERSION ip-10-0-133-5.us-west-2.compute.internal Ready worker 172m v1.14.6+cebabbf4a ip-10-0-134-174.us-west-2.compute.internal Ready master 178m v1.14.6+cebabbf4a ip-10-0-148-50.us-west-2.compute.internal Ready master 178m v1.14.6+cebabbf4a ip-10-0-149-188.us-west-2.compute.internal Ready worker 172m v1.14.6+cebabbf4a ip-10-0-164-69.us-west-2.compute.internal Ready worker 172m v1.14.6+cebabbf4a ip-10-0-172-6.us-west-2.compute.internal Ready master 179m v1.14.6+cebabbf4a [root@ip-172-31-53-199 ~]# oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-656855fb7d-2rskq 1/1 Running 2 58m olm-operator-67bb757fd6-k5p6w 0/1 CrashLoopBackOff 15 58m packageserver-7fc6b44b44-5lj5d 1/1 Running 0 177m packageserver-7fc6b44b44-krfvt 1/1 Running 0 177m [root@ip-172-31-53-199 ~]# oc logs -n openshift-operator-lifecycle-manager olm-operator-67bb757fd6-k5p6w time="2019-12-18T17:30:34Z" level=info msg="log level info" time="2019-12-18T17:30:34Z" level=info msg="TLS keys set, using https for metrics" time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config" time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config" time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy API available - setting up watch for Proxy type" time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy query will be used to fetch cluster proxy configuration" time="2019-12-18T17:30:34Z" level=info msg="connection established. cluster-version: v1.16.2" time="2019-12-18T17:30:34Z" level=info msg="operator ready" time="2019-12-18T17:30:34Z" level=info msg="starting informers..." time="2019-12-18T17:30:34Z" level=info msg="informers started" time="2019-12-18T17:30:34Z" level=info msg="waiting for caches to sync..." Version-Release number of selected component (if applicable): 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 How reproducible: Unknown. 1/1 so far. Other upgrades in progress. Additional info: Will update with must-gather location
We've upgraded larger clusters in past 4.x releases - but it does seem content related. Upgrading an empty cluster works fine. Project spec (per project): 6 builds 10 imagestreams 2 deployments (with 0 replicas) 10 secrets 5 routes 10 configmaps plus the default OpenShift serviceaccounts, secrets etc created for eachj project
Re-tried on the 750 project cluster and hit the same issue. Let me know if bisecting or getting logs at a higher debug level would help.
> reason: OOMKilled We will need to adjust the resourcelimits OLM has cache requirements that scale with the number of namespaces.
This is a blocker for upgrade at scale
How much memory are you using at 750 namespaces?
Verified on 4.3.0-0.nightly-2020-01-02-081435 - cluster upgrade described in this bz upgraded successfully from 4.2.12 -> 4.3.0-0.nightly-2020-01-02-081435 - verified after upgrade that the olm-operator deployment has not resource limits
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062