2006773 – OLM unable to recover from failing operator release

Bug 2006773 - OLM unable to recover from failing operator release

Summary: OLM unable to recover from failing operator release

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Kevin Rizza
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-22 11:33 UTC by Nico Schieder
Modified:	2021-09-24 14:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-23 13:14:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	MTSRE-177	0	None	None	None	2021-09-22 11:33:22 UTC
Red Hat Issue Tracker	MTSRE-182	0	None	None	None	2021-09-22 11:33:22 UTC

Description Nico Schieder 2021-09-22 11:33:22 UTC

Description of problem:
OLM is unable to recover automatically after a release failed to roll out.

Context for the problem:
I am part of the MT-SRE team and we are rolling out operator updates to a fleet of OpenShift Dedicated clusters.
The operators that we are responsible for are managed and SRE-ed by Red Hat.

As we rollout updates to a whole fleet of clusters, requiring manual intervention on every single cluster might take us days or weeks to recover from this situation, so it is very important to us to get this case handled within OLM.

Within SRE teams this situation is commonly refered to as the "OLM Dance", required to get services back into an operational state.

Version-Release number of selected component (if applicable):
Provider: AWS
OpenShift version: 4.8.11
Update channel: stable-4.8

How reproducible: Always

Steps to Reproduce:
1. Successfully install an operator on a cluster with Subscription set to Automatic approval. (e.g. v0.2.0)
2. Release (e.g. v0.3.0) an Operator Bundle using the wrong image digests or have readiness/liveness probes that never pass.
3. Observe OLM failure to rollout new release
4. Release a new version (e.g. v0.3.1) skipping the previous version (v0.3.0) and providing an upgrade path from the broken release.

Actual results:
OLM is stuck waiting for manual intervention to reinstall the Operator.

Expected results:
OLM is rolling back to the previous working version v0.2.0, after progressDeadline is exceeded.
OLM is rolling forward when a new upgrade path is available from the failed v0.3.0 to v0.3.1.

Additional info:

There is a longer Google Doc that explains the whole situation:
https://docs.google.com/document/d/125zm8jEhpNF-z1IMoqMMd3IJSTVIwUMylPIgnPpgJLM

The affected OpenShift dedicated cluster is also still available for debugging - if required.

Edit:

OLM is explicitly asking for manual reinstall in it's logs:

$ oc logs -n openshift-operator-lifecycle-manager olm-operator-7bfd55d5c7-79jcw
time="2021-09-22T10:35:02Z" level=info msg="addon-operator.v0.2.0 replaced by addon-operator.v0.3.0"
time="2021-09-22T10:35:02Z" level=warning msg="needs reinstall: deployment addon-operator-manager not ready before timeout: deployment \"addon-operator-manager\" exceeded its progress deadline" csv=addon-operator.v0.3.0 id=6WSIG namespace=redhat-addon-operator phase=Failed strategy=deployment

Note You need to log in before you can comment on or make changes to this bug.