Bug 1784915 - [Upgrade] 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade stuck at 77% - olm-operator crash looping
Summary: [Upgrade] 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade stuck at 77% - ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.3.0
Assignee: Evan Cordell
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On: 1785674
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-18 17:35 UTC by Mike Fiedler
Modified: 2020-01-23 11:19 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1785674 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:19:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 1208 0 None closed [release-4.3] Bug 1784915: fix(deploy): remove resource limits 2020-01-31 13:44:29 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:19:54 UTC

Description Mike Fiedler 2019-12-18 17:35:41 UTC
Description of problem:

4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade on a semi-loaded (750 projects) cluster gets stuck at 77% with olm-operator crash looping.  It never makes further progress.

The nodes are still at kube 1.14.

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        True          64m     Unable to apply 4.3.0-0.nightly-2019-12-18-105351: the cluster operator operator-lifecycle-manager has not yet successfully rolled out
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-133-5.us-west-2.compute.internal     Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-134-174.us-west-2.compute.internal   Ready    master   178m   v1.14.6+cebabbf4a
ip-10-0-148-50.us-west-2.compute.internal    Ready    master   178m   v1.14.6+cebabbf4a
ip-10-0-149-188.us-west-2.compute.internal   Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-164-69.us-west-2.compute.internal    Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-172-6.us-west-2.compute.internal     Ready    master   179m   v1.14.6+cebabbf4a


[root@ip-172-31-53-199 ~]# oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-656855fb7d-2rskq   1/1     Running            2          58m
olm-operator-67bb757fd6-k5p6w       0/1     CrashLoopBackOff   15         58m
packageserver-7fc6b44b44-5lj5d      1/1     Running            0          177m
packageserver-7fc6b44b44-krfvt      1/1     Running            0          177m

[root@ip-172-31-53-199 ~]# oc logs -n openshift-operator-lifecycle-manager olm-operator-67bb757fd6-k5p6w
time="2019-12-18T17:30:34Z" level=info msg="log level info"
time="2019-12-18T17:30:34Z" level=info msg="TLS keys set, using https for metrics"
time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config"
time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config"
time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy API  available - setting up watch for Proxy type"
time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy query will be used to fetch cluster proxy configuration"
time="2019-12-18T17:30:34Z" level=info msg="connection established. cluster-version: v1.16.2"
time="2019-12-18T17:30:34Z" level=info msg="operator ready"
time="2019-12-18T17:30:34Z" level=info msg="starting informers..."
time="2019-12-18T17:30:34Z" level=info msg="informers started"
time="2019-12-18T17:30:34Z" level=info msg="waiting for caches to sync..."


Version-Release number of selected component (if applicable): 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351


How reproducible: Unknown.  1/1 so far.  Other upgrades in progress.



Additional info:

Will update with must-gather location

Comment 3 Mike Fiedler 2019-12-18 19:58:00 UTC
We've upgraded larger clusters in past 4.x releases - but it does seem content related.   Upgrading an empty cluster works fine.    

Project spec (per project):

6 builds
10 imagestreams
2 deployments (with 0 replicas)
10 secrets
5 routes
10 configmaps

plus the default OpenShift serviceaccounts, secrets etc created for eachj project

Comment 4 Mike Fiedler 2019-12-18 20:56:52 UTC
Re-tried on the 750 project cluster and hit the same issue.   Let me know if bisecting or getting logs at a higher debug level would help.

Comment 6 Evan Cordell 2019-12-19 16:02:28 UTC
> reason: OOMKilled

We will need to adjust the resourcelimits 

OLM has cache requirements that scale with the number of namespaces.

Comment 7 Mike Fiedler 2019-12-19 17:38:16 UTC
This is a blocker for upgrade at scale

Comment 8 Clayton Coleman 2019-12-19 18:12:27 UTC
How much memory are you using at 750 namespaces?

Comment 13 Mike Fiedler 2020-01-02 15:58:42 UTC
Verified on 4.3.0-0.nightly-2020-01-02-081435
   - cluster upgrade described in this bz upgraded successfully from 4.2.12 -> 4.3.0-0.nightly-2020-01-02-081435
   - verified after upgrade that the olm-operator deployment has not resource limits

Comment 15 errata-xmlrpc 2020-01-23 11:19:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.