1974598 – Sub-optimal cluster destroy strategy

Bug 1974598 - Sub-optimal cluster destroy strategy

Summary: Sub-optimal cluster destroy strategy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Martin André
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-22 07:07 UTC by Martin André
Modified:	2021-10-18 17:36 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: destroy functions used a sub-optimal strategy: each destroy function exits on the first conflict and it is expected to retry on a next iteration. Consequence: the destroy command was unnecessary expensive and may leave resources behind that may otherwise have been cleaned by the installer in case of a conflict. Fix: destroy functions try to delete all resources and ignore the ones that have conflicts Result: cluster deletion is faster and removes all the resources it can remove.
Clone Of:
Environment:
Last Closed:	2021-10-18 17:35:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 5004	0	None	open	Bug 1974598: OpenStack: Optimize cluster deletion	2021-06-22 07:07:35 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:36:15 UTC

Description Martin André 2021-06-22 07:07:01 UTC

The strategy used for destroying cluster on OpenStack platform is sub-optimal.
Each destroy function exits on the first conflict and it is expected to retry on a next iteration, hoping that in the mean time the conflict that prevented removal of the resource is fixed.

This strategy can be quite expensive as there is a exponential backoff between retries.

This can also be problematic with Kuryr deployments where conflicts happens more frequently, often requiring manual intervention (due to OpenStack bugs). This means that until someone looks at the cluster and fixes the conflicts, there may be *a lot* of leftover resources.

By adopting a different strategy where destroy functions try to delete all resources and ignore the ones that have conflicts we can solve these two issues. On the next iteration there will be less conflicts, and hopefully less iterations in total. It also means that the destroy command removes everything that it can remove in the case of stuck Kuryr deployment, and consumes less useless resources.

Comment 3 Udi Shkalim 2021-06-29 12:38:37 UTC

Verified on Kuryr 1UPI:
[cloud-user@installer-host ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-06-28-221420   True        False         102m    Cluster version is 4.9.0-0.nightly-2021-06-28-221420

(shiftstack) [cloud-user@installer-host ~]$ openshift-install --log-level debug destroy cluster --dir ostest/
DEBUG OpenShift Installer 4.9.0-0.nightly-2021-06-28-221420
.
.
.
INFO Time elapsed: 14m40s

Log attached.

Comment 8 errata-xmlrpc 2021-10-18 17:35:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.