Bug 2001505

Summary: Forever pending auto upgrade in case of breaking API changes
Product: OpenShift Container Platform Reporter: fvaleri
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: unspecified Keywords: Reopened
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-10 17:00:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description fvaleri 2021-09-06 09:04:30 UTC
Description of problem:

Most of the customers use the default stable channel with automatic upgrades (even in production) and we are seeing a lot of support cases opened with an operator pending upgrade due to API breaking changes (removal of old deprecated CRD versions).

Now the OperatorHub correctly shows an error about this and the existing custom resources are not affected as the upgrade remains pending. I think the upgrade process should not start at all in such cases, as there are manual conversion steps which are required. They are usually described in the product's documentation/release notes and it would be good to have that link along with that error message. We could provide an optional "manualStepsUrl" field inside the operator's manifest (CSV). That way, the middleware product team would be in charge of specifying that manual steps are required before upgrading. These conversion steps can't be automated as the operator is not a singleton service.

Version-Release number of selected component (if applicable):

- OpenShift 4.7.4
- OLM 0.17.0

How reproducible:

Install AMQ Streams 1.7 channel using the OperatorHub, then change the Subscription to stable channel and after few seconds.

Actual results:

Forever pending upgrade with the following warning message:

risk of data loss updating kafkarebalances.kafka.strimzi.io: new CRD removes version v1alpha1 that is lised version on the existing CRD

Expected results:

Only the warning is shown, without upgrade process start.

Additional info:

https://issues.redhat.com/browse/ENTMQST-3237

Comment 2 fvaleri 2021-09-08 16:01:58 UTC
Hi, I'm reopening this, please double check.

The bug is that the OLM upgrade process remains pending, even after applying the required CR conversion steps (no more resources using the OLD CRD version, which can now be removed safely).

Comment 3 fvaleri 2021-09-08 16:04:41 UTC
The only way to recover from this is to uninstall the old operator and the new (pending) operators. After this, you need to reinstall the new one.

More details here:

https://access.redhat.com/solutions/6273981

Comment 4 Kevin Rizza 2021-09-10 17:00:38 UTC
As I described before, this is because OLM does not and cannot know how to autorecover in specific situations like this because of the way the InstallPlan resource reconciles. It is a book of record run once operation that does not attempt to retry in the case of failures. This does mean that in any arbitrary case once the installplan is in a failed state that OLM will not recover after the cluster is put into a configuration that would allow the install to succeed. From OLM's perspective, it tried to install, it got partway through the operation, and it failed. Part of the reinstall process today involves undoing the existing OLM install steps and then reinstalling from scratch.

We have some future proposals about how to make OLM more declarative, but we are not currently able to track a failure condition like this as a bug that can be trivially fixed with the existing OLM control plane. Semantically, it requires new installation concepts and a new API. See https://github.com/operator-framework/rukpak#rukpak for the beginnings of some of that work, but it will most likely be several openshift releases before that replaces the current install workflow.