Bug 1974598

Summary: Sub-optimal cluster destroy strategy
Product: OpenShift Container Platform Reporter: Martin André <m.andre>
Component: InstallerAssignee: Martin André <m.andre>
Installer sub component: OpenShift on OpenStack QA Contact: Udi Shkalim <ushkalim>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: rlobillo, ushkalim
Version: 4.6Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: destroy functions used a sub-optimal strategy: each destroy function exits on the first conflict and it is expected to retry on a next iteration. Consequence: the destroy command was unnecessary expensive and may leave resources behind that may otherwise have been cleaned by the installer in case of a conflict. Fix: destroy functions try to delete all resources and ignore the ones that have conflicts Result: cluster deletion is faster and removes all the resources it can remove.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:35:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin André 2021-06-22 07:07:01 UTC
The strategy used for destroying cluster on OpenStack platform is sub-optimal.
Each destroy function exits on the first conflict and it is expected to retry on a next iteration, hoping that in the mean time the conflict that prevented removal of the resource is fixed.

This strategy can be quite expensive as there is a exponential backoff between retries.

This can also be problematic with Kuryr deployments where conflicts happens more frequently, often requiring manual intervention (due to OpenStack bugs). This means that until someone looks at the cluster and fixes the conflicts, there may be *a lot* of leftover resources.

By adopting a different strategy where destroy functions try to delete all resources and ignore the ones that have conflicts we can solve these two issues. On the next iteration there will be less conflicts, and hopefully less iterations in total. It also means that the destroy command removes everything that it can remove in the case of stuck Kuryr deployment, and consumes less useless resources.

Comment 3 Udi Shkalim 2021-06-29 12:38:37 UTC
Verified on Kuryr 1UPI:
[cloud-user@installer-host ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-06-28-221420   True        False         102m    Cluster version is 4.9.0-0.nightly-2021-06-28-221420

(shiftstack) [cloud-user@installer-host ~]$ openshift-install --log-level debug destroy cluster --dir ostest/
DEBUG OpenShift Installer 4.9.0-0.nightly-2021-06-28-221420
.
.
.
INFO Time elapsed: 14m40s

Log attached.

Comment 8 errata-xmlrpc 2021-10-18 17:35:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759