Bug 1974598 - Sub-optimal cluster destroy strategy
Summary: Sub-optimal cluster destroy strategy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Martin André
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-22 07:07 UTC by Martin André
Modified: 2021-10-18 17:36 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: destroy functions used a sub-optimal strategy: each destroy function exits on the first conflict and it is expected to retry on a next iteration. Consequence: the destroy command was unnecessary expensive and may leave resources behind that may otherwise have been cleaned by the installer in case of a conflict. Fix: destroy functions try to delete all resources and ignore the ones that have conflicts Result: cluster deletion is faster and removes all the resources it can remove.
Clone Of:
Environment:
Last Closed: 2021-10-18 17:35:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 5004 0 None open Bug 1974598: OpenStack: Optimize cluster deletion 2021-06-22 07:07:35 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:36:15 UTC

Description Martin André 2021-06-22 07:07:01 UTC
The strategy used for destroying cluster on OpenStack platform is sub-optimal.
Each destroy function exits on the first conflict and it is expected to retry on a next iteration, hoping that in the mean time the conflict that prevented removal of the resource is fixed.

This strategy can be quite expensive as there is a exponential backoff between retries.

This can also be problematic with Kuryr deployments where conflicts happens more frequently, often requiring manual intervention (due to OpenStack bugs). This means that until someone looks at the cluster and fixes the conflicts, there may be *a lot* of leftover resources.

By adopting a different strategy where destroy functions try to delete all resources and ignore the ones that have conflicts we can solve these two issues. On the next iteration there will be less conflicts, and hopefully less iterations in total. It also means that the destroy command removes everything that it can remove in the case of stuck Kuryr deployment, and consumes less useless resources.

Comment 3 Udi Shkalim 2021-06-29 12:38:37 UTC
Verified on Kuryr 1UPI:
[cloud-user@installer-host ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-06-28-221420   True        False         102m    Cluster version is 4.9.0-0.nightly-2021-06-28-221420

(shiftstack) [cloud-user@installer-host ~]$ openshift-install --log-level debug destroy cluster --dir ostest/
DEBUG OpenShift Installer 4.9.0-0.nightly-2021-06-28-221420
.
.
.
INFO Time elapsed: 14m40s

Log attached.

Comment 8 errata-xmlrpc 2021-10-18 17:35:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.