Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1784915

Summary: [Upgrade] 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade stuck at 77% - olm-operator crash looping
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: OLMAssignee: Evan Cordell <ecordell>
OLM sub component: OLM QA Contact: Mike Fiedler <mifiedle>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: ccoleman, dsover, ecordell, erich, lmohanty, wking
Version: 4.3.0Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1785674 (view as bug list) Environment:
Last Closed: 2020-01-23 11:19:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1785674    
Bug Blocks:    

Description Mike Fiedler 2019-12-18 17:35:41 UTC
Description of problem:

4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade on a semi-loaded (750 projects) cluster gets stuck at 77% with olm-operator crash looping.  It never makes further progress.

The nodes are still at kube 1.14.

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        True          64m     Unable to apply 4.3.0-0.nightly-2019-12-18-105351: the cluster operator operator-lifecycle-manager has not yet successfully rolled out
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-133-5.us-west-2.compute.internal     Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-134-174.us-west-2.compute.internal   Ready    master   178m   v1.14.6+cebabbf4a
ip-10-0-148-50.us-west-2.compute.internal    Ready    master   178m   v1.14.6+cebabbf4a
ip-10-0-149-188.us-west-2.compute.internal   Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-164-69.us-west-2.compute.internal    Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-172-6.us-west-2.compute.internal     Ready    master   179m   v1.14.6+cebabbf4a


[root@ip-172-31-53-199 ~]# oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-656855fb7d-2rskq   1/1     Running            2          58m
olm-operator-67bb757fd6-k5p6w       0/1     CrashLoopBackOff   15         58m
packageserver-7fc6b44b44-5lj5d      1/1     Running            0          177m
packageserver-7fc6b44b44-krfvt      1/1     Running            0          177m

[root@ip-172-31-53-199 ~]# oc logs -n openshift-operator-lifecycle-manager olm-operator-67bb757fd6-k5p6w
time="2019-12-18T17:30:34Z" level=info msg="log level info"
time="2019-12-18T17:30:34Z" level=info msg="TLS keys set, using https for metrics"
time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config"
time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config"
time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy API  available - setting up watch for Proxy type"
time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy query will be used to fetch cluster proxy configuration"
time="2019-12-18T17:30:34Z" level=info msg="connection established. cluster-version: v1.16.2"
time="2019-12-18T17:30:34Z" level=info msg="operator ready"
time="2019-12-18T17:30:34Z" level=info msg="starting informers..."
time="2019-12-18T17:30:34Z" level=info msg="informers started"
time="2019-12-18T17:30:34Z" level=info msg="waiting for caches to sync..."


Version-Release number of selected component (if applicable): 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351


How reproducible: Unknown.  1/1 so far.  Other upgrades in progress.



Additional info:

Will update with must-gather location

Comment 3 Mike Fiedler 2019-12-18 19:58:00 UTC
We've upgraded larger clusters in past 4.x releases - but it does seem content related.   Upgrading an empty cluster works fine.    

Project spec (per project):

6 builds
10 imagestreams
2 deployments (with 0 replicas)
10 secrets
5 routes
10 configmaps

plus the default OpenShift serviceaccounts, secrets etc created for eachj project

Comment 4 Mike Fiedler 2019-12-18 20:56:52 UTC
Re-tried on the 750 project cluster and hit the same issue.   Let me know if bisecting or getting logs at a higher debug level would help.

Comment 6 Evan Cordell 2019-12-19 16:02:28 UTC
> reason: OOMKilled

We will need to adjust the resourcelimits 

OLM has cache requirements that scale with the number of namespaces.

Comment 7 Mike Fiedler 2019-12-19 17:38:16 UTC
This is a blocker for upgrade at scale

Comment 8 Clayton Coleman 2019-12-19 18:12:27 UTC
How much memory are you using at 750 namespaces?

Comment 13 Mike Fiedler 2020-01-02 15:58:42 UTC
Verified on 4.3.0-0.nightly-2020-01-02-081435
   - cluster upgrade described in this bz upgraded successfully from 4.2.12 -> 4.3.0-0.nightly-2020-01-02-081435
   - verified after upgrade that the olm-operator deployment has not resource limits

Comment 15 errata-xmlrpc 2020-01-23 11:19:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062