1784915 – [Upgrade] 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade stuck at 77% - olm-operator crash looping

Bug 1784915 - [Upgrade] 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade stuck at 77% - olm-operator crash looping

Summary: [Upgrade] 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade stuck at 77% - ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Evan Cordell
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:	1785674
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-18 17:35 UTC by Mike Fiedler
Modified:	2020-01-23 11:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1785674 (view as bug list)
Environment:
Last Closed:	2020-01-23 11:19:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 1208	0	None	closed	[release-4.3] Bug 1784915: fix(deploy): remove resource limits	2020-01-31 13:44:29 UTC
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:19:54 UTC

Description Mike Fiedler 2019-12-18 17:35:41 UTC

Description of problem:

4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351 upgrade on a semi-loaded (750 projects) cluster gets stuck at 77% with olm-operator crash looping.  It never makes further progress.

The nodes are still at kube 1.14.

NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.12    True        True          64m     Unable to apply 4.3.0-0.nightly-2019-12-18-105351: the cluster operator operator-lifecycle-manager has not yet successfully rolled out
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-133-5.us-west-2.compute.internal     Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-134-174.us-west-2.compute.internal   Ready    master   178m   v1.14.6+cebabbf4a
ip-10-0-148-50.us-west-2.compute.internal    Ready    master   178m   v1.14.6+cebabbf4a
ip-10-0-149-188.us-west-2.compute.internal   Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-164-69.us-west-2.compute.internal    Ready    worker   172m   v1.14.6+cebabbf4a
ip-10-0-172-6.us-west-2.compute.internal     Ready    master   179m   v1.14.6+cebabbf4a


[root@ip-172-31-53-199 ~]# oc get pods -n openshift-operator-lifecycle-manager
NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-656855fb7d-2rskq   1/1     Running            2          58m
olm-operator-67bb757fd6-k5p6w       0/1     CrashLoopBackOff   15         58m
packageserver-7fc6b44b44-5lj5d      1/1     Running            0          177m
packageserver-7fc6b44b44-krfvt      1/1     Running            0          177m

[root@ip-172-31-53-199 ~]# oc logs -n openshift-operator-lifecycle-manager olm-operator-67bb757fd6-k5p6w
time="2019-12-18T17:30:34Z" level=info msg="log level info"
time="2019-12-18T17:30:34Z" level=info msg="TLS keys set, using https for metrics"
time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config"
time="2019-12-18T17:30:34Z" level=info msg="Using in-cluster kube client config"
time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy API  available - setting up watch for Proxy type"
time="2019-12-18T17:30:34Z" level=info msg="OpenShift Proxy query will be used to fetch cluster proxy configuration"
time="2019-12-18T17:30:34Z" level=info msg="connection established. cluster-version: v1.16.2"
time="2019-12-18T17:30:34Z" level=info msg="operator ready"
time="2019-12-18T17:30:34Z" level=info msg="starting informers..."
time="2019-12-18T17:30:34Z" level=info msg="informers started"
time="2019-12-18T17:30:34Z" level=info msg="waiting for caches to sync..."


Version-Release number of selected component (if applicable): 4.2.12 -> 4.3.0-0.nightly-2019-12-18-105351


How reproducible: Unknown.  1/1 so far.  Other upgrades in progress.



Additional info:

Will update with must-gather location

Comment 3 Mike Fiedler 2019-12-18 19:58:00 UTC

We've upgraded larger clusters in past 4.x releases - but it does seem content related.   Upgrading an empty cluster works fine.    

Project spec (per project):

6 builds
10 imagestreams
2 deployments (with 0 replicas)
10 secrets
5 routes
10 configmaps

plus the default OpenShift serviceaccounts, secrets etc created for eachj project

Comment 4 Mike Fiedler 2019-12-18 20:56:52 UTC

Re-tried on the 750 project cluster and hit the same issue.   Let me know if bisecting or getting logs at a higher debug level would help.

Comment 6 Evan Cordell 2019-12-19 16:02:28 UTC

> reason: OOMKilled

We will need to adjust the resourcelimits 

OLM has cache requirements that scale with the number of namespaces.

Comment 7 Mike Fiedler 2019-12-19 17:38:16 UTC

This is a blocker for upgrade at scale

Comment 8 Clayton Coleman 2019-12-19 18:12:27 UTC

How much memory are you using at 750 namespaces?

Comment 13 Mike Fiedler 2020-01-02 15:58:42 UTC

Verified on 4.3.0-0.nightly-2020-01-02-081435
   - cluster upgrade described in this bz upgraded successfully from 4.2.12 -> 4.3.0-0.nightly-2020-01-02-081435
   - verified after upgrade that the olm-operator deployment has not resource limits

Comment 15 errata-xmlrpc 2020-01-23 11:19:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.