Bug 1740935

Summary: AWS 'destroy cluster' does not attempt to re-delete the same ARN
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium    
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-19 17:57:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-08-13 21:54:59 UTC
Abhinav pointed out that the installer's deletion code records already-deleted ARNs and does not attempt to re-delete them [1].  That's fine for instances and such where ARNs contain a unique-to-each-account ID that will not turn up again.  But S3 buckets, IAM users and roles, and possibly other resources contain ARNs which do not contain such AWS-generated unique IDs.

We should consider whether or not we want to query for and re-delete those unrandomized ARNs.  For example, if the race in bug 1740933 had gone a bit further before the instance termination, it might have been:

1. The installer removes a bucket
2. The still-running registry operator tries to self-heal and creates a new bucket.
3. The registry operator tags the new bucket (this diverges from the bug 1740933 case).
4. The installer terminates the instance where the registry operator was running.
5. The installer leaks the new, tagged bucket.

Terminating cluster-owned instances first (my proposed fix for bug 1740933) provides some protection from this issue.  But there is still a window of exposure to resources created by actors besides cluster-owned instances (e.g. a cluster admin could manually create a resource with a cluster-owned tag and an ARN matching a just-deleted resource).  If we do want the teardown logic to attempt to re-delete that resource, we'd need to make additional Describe* calls to the AWS API to check for re-created resources, and then delete them if they existed.

We cannot rely on the tagging API's results, because those lag behind the other services due to AWS's eventual-consistency approach, so the tag API's responses may include resources that no longer exist (and may perhaps also not include just-tagged resources which do exist).

Is it worth (optionally?) paying the additional Describe* call cost to protect ourselves from leaking resources which are created by actors besides cluster-owned instances which have ARNs that match a resource which the teardown logic just deleted?

This issue is for 4.2, but the underlying race also impacts 4.1.z (at least as of 4.1.11).

[1]: https://github.com/openshift/installer/pull/2169#issuecomment-518869011

Comment 1 Scott Dodson 2019-08-19 17:57:59 UTC
Obviated by Bug 1725287, marking as a dupe.

*** This bug has been marked as a duplicate of bug 1740933 ***