Bug 2006773

Summary:	OLM unable to recover from failing operator release
Product:	OpenShift Container Platform	Reporter:	Nico Schieder <nschiede>
Component:	OLM	Assignee:	Kevin Rizza <krizza>
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	jgwosdz, nschiede, patmarti, sirkal
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-23 13:14:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nico Schieder 2021-09-22 11:33:22 UTC

Description of problem:
OLM is unable to recover automatically after a release failed to roll out.

Context for the problem:
I am part of the MT-SRE team and we are rolling out operator updates to a fleet of OpenShift Dedicated clusters.
The operators that we are responsible for are managed and SRE-ed by Red Hat.

As we rollout updates to a whole fleet of clusters, requiring manual intervention on every single cluster might take us days or weeks to recover from this situation, so it is very important to us to get this case handled within OLM.

Within SRE teams this situation is commonly refered to as the "OLM Dance", required to get services back into an operational state.

Version-Release number of selected component (if applicable):
Provider: AWS
OpenShift version: 4.8.11
Update channel: stable-4.8

How reproducible: Always

Steps to Reproduce:
1. Successfully install an operator on a cluster with Subscription set to Automatic approval. (e.g. v0.2.0)
2. Release (e.g. v0.3.0) an Operator Bundle using the wrong image digests or have readiness/liveness probes that never pass.
3. Observe OLM failure to rollout new release
4. Release a new version (e.g. v0.3.1) skipping the previous version (v0.3.0) and providing an upgrade path from the broken release.

Actual results:
OLM is stuck waiting for manual intervention to reinstall the Operator.

Expected results:
OLM is rolling back to the previous working version v0.2.0, after progressDeadline is exceeded.
OLM is rolling forward when a new upgrade path is available from the failed v0.3.0 to v0.3.1.

Additional info:

There is a longer Google Doc that explains the whole situation:
https://docs.google.com/document/d/125zm8jEhpNF-z1IMoqMMd3IJSTVIwUMylPIgnPpgJLM

The affected OpenShift dedicated cluster is also still available for debugging - if required.

Edit:

OLM is explicitly asking for manual reinstall in it's logs:

$ oc logs -n openshift-operator-lifecycle-manager olm-operator-7bfd55d5c7-79jcw
time="2021-09-22T10:35:02Z" level=info msg="addon-operator.v0.2.0 replaced by addon-operator.v0.3.0"
time="2021-09-22T10:35:02Z" level=warning msg="needs reinstall: deployment addon-operator-manager not ready before timeout: deployment \"addon-operator-manager\" exceeded its progress deadline" csv=addon-operator.v0.3.0 id=6WSIG namespace=redhat-addon-operator phase=Failed strategy=deployment