Bug 2001505 - Forever pending auto upgrade in case of breaking API changes
Summary: Forever pending auto upgrade in case of breaking API changes
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: All
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Kevin Rizza
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-06 09:04 UTC by fvaleri
Modified: 2021-09-10 17:00 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-10 17:00:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description fvaleri 2021-09-06 09:04:30 UTC
Description of problem:

Most of the customers use the default stable channel with automatic upgrades (even in production) and we are seeing a lot of support cases opened with an operator pending upgrade due to API breaking changes (removal of old deprecated CRD versions).

Now the OperatorHub correctly shows an error about this and the existing custom resources are not affected as the upgrade remains pending. I think the upgrade process should not start at all in such cases, as there are manual conversion steps which are required. They are usually described in the product's documentation/release notes and it would be good to have that link along with that error message. We could provide an optional "manualStepsUrl" field inside the operator's manifest (CSV). That way, the middleware product team would be in charge of specifying that manual steps are required before upgrading. These conversion steps can't be automated as the operator is not a singleton service.

Version-Release number of selected component (if applicable):

- OpenShift 4.7.4
- OLM 0.17.0

How reproducible:

Install AMQ Streams 1.7 channel using the OperatorHub, then change the Subscription to stable channel and after few seconds.

Actual results:

Forever pending upgrade with the following warning message:

risk of data loss updating kafkarebalances.kafka.strimzi.io: new CRD removes version v1alpha1 that is lised version on the existing CRD

Expected results:

Only the warning is shown, without upgrade process start.

Additional info:

https://issues.redhat.com/browse/ENTMQST-3237

Comment 2 fvaleri 2021-09-08 16:01:58 UTC
Hi, I'm reopening this, please double check.

The bug is that the OLM upgrade process remains pending, even after applying the required CR conversion steps (no more resources using the OLD CRD version, which can now be removed safely).

Comment 3 fvaleri 2021-09-08 16:04:41 UTC
The only way to recover from this is to uninstall the old operator and the new (pending) operators. After this, you need to reinstall the new one.

More details here:

https://access.redhat.com/solutions/6273981

Comment 4 Kevin Rizza 2021-09-10 17:00:38 UTC
As I described before, this is because OLM does not and cannot know how to autorecover in specific situations like this because of the way the InstallPlan resource reconciles. It is a book of record run once operation that does not attempt to retry in the case of failures. This does mean that in any arbitrary case once the installplan is in a failed state that OLM will not recover after the cluster is put into a configuration that would allow the install to succeed. From OLM's perspective, it tried to install, it got partway through the operation, and it failed. Part of the reinstall process today involves undoing the existing OLM install steps and then reinstalling from scratch.

We have some future proposals about how to make OLM more declarative, but we are not currently able to track a failure condition like this as a bug that can be trivially fixed with the existing OLM control plane. Semantically, it requires new installation concepts and a new API. See https://github.com/operator-framework/rukpak#rukpak for the beginnings of some of that work, but it will most likely be several openshift releases before that replaces the current install workflow.


Note You need to log in before you can comment on or make changes to this bug.