2097557 – can not upgrade. Incorrect reading of olm.maxOpenShiftVersion

Bug 2097557 - can not upgrade. Incorrect reading of olm.maxOpenShiftVersion

Summary: can not upgrade. Incorrect reading of olm.maxOpenShiftVersion

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Per da Silva
QA Contact:	kuiwang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2114574
TreeView+	depends on / blocked

Reported:	2022-06-16 00:47 UTC by jroche
Modified:	2023-01-17 19:50 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, Operator Lifecycle Manager (OLM) would attempt to update namespaces to apply a label, even if the label was present on the namespace. Consequently, the update requests increased the workload in API and etcd services. With this update, OLM compares existing labels against the expected labels on a namespace before issuing an update. As a result, OLM no longer attempts to make unnecessary update requests on namespaces. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2105045[BZ#2105045])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:50:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift operator-framework-olm pull 346	0	None	open	Bug 2097557: use env var for OCP version instead of clusterversion status	2022-08-02 15:34:03 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:50:34 UTC

Description jroche 2022-06-16 00:47:25 UTC

Description of problem:
Cluster attempted to upgrade from 4.9.29 -> 4.10.15. The upgrade could not progress because of:

Last Transition Time:  2022-06-15T16:08:33Z
    Message:               ClusterServiceVersions blocking cluster upgrade: redhat-rhoam-operator/managed-api-service.v1.22.0 is incompatible with OpenShift minor versions greater than 4.10
    Reason:                IncompatibleOperatorsInstalled
    Status:                False
    Type:                  Upgradeable

The mentioned CSV has olm.maxOpenShiftVersion of 4.10:
    {
      "type": "olm.maxOpenShiftVersion",
      "value": "4.10"
    }


Actual results:
The upgrade is blocked on this olm.maxOpenShiftVersion

Expected results:
The upgrade should not be blocked on this olm.maxOpenShiftVersion

Additional info:
The particular customer has successfully upgraded twice on different clusters on the same OCP edge and with managed-api-service.v1.22.0 installed

must-gather will be attached in a private comment

Comment 2 jroche 2022-06-16 01:03:26 UTC

One question also, should the operator-lifecycle-manager CO be degraded when Upgradeable is false?

Comment 3 jroche 2022-06-16 02:49:11 UTC

We cancelled the upgrade using `oc adm upgrade --clear=true` which has had he effect of resolving the Upgradeable: false condition.
The customer would like to reschedule the upgrade, would like to know what you think was the issue and whether a retry would work. ty

Comment 4 W. Trevor King 2022-06-17 04:58:19 UTC

The issue here is that OLM is using ClusterVersion's status.desired [1] to compute the next 4.y [2]. So:

1. Cluster is running 4.9.
2. Update to 4.10 requested.
3. Cluster-version operator is mulling over whether 4.10 is a good idea. 4.9 CVOs set status.desired to point at the requested target while they do this. More recent CVO, including 4.10.7 (tombstoned [3]), 4.10.8 and later, leave status.desired alone while considering an update request, see bug 2064991 and bug 1826115.
4. Cluster-version operator fails the first round of preconditions on EtcdRecentBackup, waiting for etcd to perform the pre-minor-update snapshot.
5. Meanwhile, OLM is looking at ClusterVersion's status.desired, notices the 4.10 version, knows it has some operators that are not compatible with 4.11, and sets Upgradeable=False in its ClusterOperator.
6. Cluster-version operator comes back around for a new round of precondition checks. Now etcd's ClusterOperator has RecentBackup=True, so we pass that precondition. But because OLM is Upgradeable=False, and the CVO interprets Upgradeable=False from a 4.9 operator as "please don't go to 4.10", we fail prechecks again.

So the 4.9 OLM is trying to say "don't go to 4.11", but the CVO is hearing "don't go to 4.10". But because we'll never ask a 4.9 OLM to go straight to 4.11, all the 4.9 OLM has to be concerned about is compat with 4.10. One possible fix would be similar to bug 2097431, pivoting to using the RELEASE_VERSION environment variable [3] to figure out which version OLM is, instead of looking at ClusterVersion.

One possible way to unstick minor updates out of impacted versions is:

1. Request the update to 4.10.
2. Wait until 'oc get -o clusteroperator etcd | grep -5 RecentBackup' shows a RecentBackup=True condition.
3. 'oc adm upgrade --clear' to give up on the update.
4. Give OLM time to cool off, checking 'oc get -o yaml clusterversion version | grep -5 Upgradeable' until there are no Upgradeable conditions (possibly taking steps to address any Upgradeable=False conditions that are not OLM complaining about 4.11).
5. Request the update to 4.10 again.

That should get you a fresh round of 4.10 preconditions while RecentBackup=True (etcd does not seem to expire this on "not recent any more" very quickly, at least in 4.9.29). And you'll also be going through the preconditions before OLM has time to get worried about 4.11.

[1]: https://github.com/openshift/operator-framework-olm/blame/7f8ad598528b2d029fac23dac6d860c433cbf962/staging/operator-lifecycle-manager/pkg/controller/operators/openshift/helpers.go#L171-L189
[2]: https://github.com/openshift/operator-framework-olm/blame/7f8ad598528b2d029fac23dac6d860c433cbf962/staging/operator-lifecycle-manager/pkg/controller/operators/openshift/helpers.go#L132
[3]: https://github.com/openshift/operator-framework-olm/blob/7f8ad598528b2d029fac23dac6d860c433cbf962/manifests/0000_50_olm_07-olm-operator.deployment.yaml#L79-L80

Comment 5 jroche 2022-06-17 05:36:24 UTC

Thanks Trevor. So this is a fix for OLM.
Is it a matter of retrying the upgrade again for the customer?

Comment 14 Per da Silva 2022-06-24 13:40:07 UTC

setting to blocker- since we have a workaround and a path forward

Comment 23 errata-xmlrpc 2023-01-17 19:50:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.