2091806 – Cluster upgrade stuck due to "resource deletions in progress"

Bug 2091806 - Cluster upgrade stuck due to "resource deletions in progress"

Summary: Cluster upgrade stuck due to "resource deletions in progress"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.9
Hardware:	All
OS:	Other
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jack Ottofaro
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:	2064991
Blocks:
TreeView+	depends on / blocked

Reported:	2022-05-31 05:52 UTC by Paul Webster
Modified:	2022-08-09 14:01 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-09 14:00:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 786	0	None	open	Bug 2091806: pkg/cvo: Separate payload load from payload apply	2022-06-06 16:30:42 UTC
Red Hat Product Errata	RHSA-2022:5879	0	None	None	None	2022-08-09 14:01:27 UTC

Description Paul Webster 2022-05-31 05:52:24 UTC

Description of problem:
While attempting to upgrade cluster from 4.6.17 to 4.10.9 through the upgrade path detailed below, the final upgrade from 4.9.28 to 4.10.9 was blocked with message:

Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule "openshift-authentication-operator/authentication-operator",rolebinding "openshift-machine-api/machine-api-termination-handler",PrometheusRule "openshift-kube-apiserver/kube-apiserver",role "openshift-machine-api/machine-api-termination-handler" 

The issue was eventually resolved by resetting the cluster upgrade using the command:

$ oc adm upgrade --clear

The upgrade from 4.9.28 to 4.10.9 was later reattempted and completed successfully.

Upgrade path taken:
4.6.17 -> 4.6.41 -> 4.7.43 -> 4.8.36 -> 4.9.28 -> 4.10.9

Version-Release number of the following components:
OCP 4.9.28
VSphere

How reproducible:
N/A

Steps to Reproduce:
Customer upgraded cluster as per the upgrade path described above

Actual results:
See description above

Expected results:
Cluster upgrade to progress normally

Additional info:
Must-gathers from before and after upgrade available from the case

Comment 1 Jack Ottofaro 2022-06-01 13:56:27 UTC

(In reply to Paul Webster from comment #0)
> Description of problem:
> While attempting to upgrade cluster from 4.6.17 to 4.10.9 through the
> upgrade path detailed below, the final upgrade from 4.9.28 to 4.10.9 was
> blocked with message:
> 
> Cluster minor level upgrades are not allowed while resource deletions are in
> progress; resources=PrometheusRule
> "openshift-authentication-operator/authentication-operator",rolebinding
> "openshift-machine-api/machine-api-termination-handler",PrometheusRule
> "openshift-kube-apiserver/kube-apiserver",role
> "openshift-machine-api/machine-api-termination-handler" 
> 
> The issue was eventually resolved by resetting the cluster upgrade using the
> command:
> 
> $ oc adm upgrade --clear
> 
Any idea how long they waited on the first upgrade request before "clear"ing it?

Comment 2 Jack Ottofaro 2022-06-01 17:10:14 UTC

This is a known issue and will require a back port of https://bugzilla.redhat.com/show_bug.cgi?id=1822752 to fix.

Comment 4 liujia 2022-07-22 04:21:03 UTC

Reproduced on path 4.8.36 -> 4.9.28 -> 4.10.9

1. Trigger upgrade from 4.8.36 to 4.9.28.

2. Monitor above upgrade, once it finishes, trigger a new upgrade to 4.10(w/o --force) immediately while there is still Upgradeable=False condition (It’s a very short period before it run into ResourceDeletesInProgress status, if we did not trigger the upgrade in this period, then no issue)

3. After trigger the upgrade w/o--force while upgradeable=false, no upgrade will happen as expected and it will prompt `it may not be safe to apply this update` error due to Upgradeable=False.
# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.28    True        True          23m     Unable to apply 4.10.23: it may not be safe to apply this update

4. Do nothing to wait for ResourceDeletes(>30min), the ResourceDeletes does not complete with the above status stuck(unexpected)
# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.10.9: it may not be safe to apply this update

Upgradeable=False

  Reason: ResourceDeletesInProgress
  Message: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule "openshift-kube-apiserver/kube-apiserver"

5. Run `oc adm upgrade –clear` to cancel the update to 4.10.9 due to Upgradeable=False and then re-trigger the update, upgrade start successfully.

Comment 5 liujia 2022-07-25 01:22:40 UTC

Verified on 4.8.46 -> 4.9.0-0.nightly-2022-07-21-221241 -> 4.10.24

At the beginning, it still prompts the error due to we trigger upgrade while upgradeable=false.
# ./oc adm upgrade
info: An upgrade is in progress. Unable to apply 4.10.24: it may not be safe to apply this update

Upgradeable=False

  Reason: ResourceDeletesInProgress
  Message: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=PrometheusRule "openshift-kube-apiserver/kube-apiserver"

Do nothing to wait for ResourceDeletes, after several minutes, the ResourceDeletes complete and the upgrade starts successfully and succeeds finally.
# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2022-07-21-221241   True        True          4m57s   Working towards 4.10.24: 95 of 773 done (12% complete)

Comment 8 errata-xmlrpc 2022-08-09 14:00:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.9.45 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5879

Note You need to log in before you can comment on or make changes to this bug.