Bug 1740935 - AWS 'destroy cluster' does not attempt to re-delete the same ARN
Summary: AWS 'destroy cluster' does not attempt to re-delete the same ARN
Status: CLOSED DUPLICATE of bug 1740933
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.3.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
Depends On:
TreeView+ depends on / blocked
Reported: 2019-08-13 21:54 UTC by W. Trevor King
Modified: 2019-08-19 17:57 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-08-19 17:57:59 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description W. Trevor King 2019-08-13 21:54:59 UTC
Abhinav pointed out that the installer's deletion code records already-deleted ARNs and does not attempt to re-delete them [1].  That's fine for instances and such where ARNs contain a unique-to-each-account ID that will not turn up again.  But S3 buckets, IAM users and roles, and possibly other resources contain ARNs which do not contain such AWS-generated unique IDs.

We should consider whether or not we want to query for and re-delete those unrandomized ARNs.  For example, if the race in bug 1740933 had gone a bit further before the instance termination, it might have been:

1. The installer removes a bucket
2. The still-running registry operator tries to self-heal and creates a new bucket.
3. The registry operator tags the new bucket (this diverges from the bug 1740933 case).
4. The installer terminates the instance where the registry operator was running.
5. The installer leaks the new, tagged bucket.

Terminating cluster-owned instances first (my proposed fix for bug 1740933) provides some protection from this issue.  But there is still a window of exposure to resources created by actors besides cluster-owned instances (e.g. a cluster admin could manually create a resource with a cluster-owned tag and an ARN matching a just-deleted resource).  If we do want the teardown logic to attempt to re-delete that resource, we'd need to make additional Describe* calls to the AWS API to check for re-created resources, and then delete them if they existed.

We cannot rely on the tagging API's results, because those lag behind the other services due to AWS's eventual-consistency approach, so the tag API's responses may include resources that no longer exist (and may perhaps also not include just-tagged resources which do exist).

Is it worth (optionally?) paying the additional Describe* call cost to protect ourselves from leaking resources which are created by actors besides cluster-owned instances which have ARNs that match a resource which the teardown logic just deleted?

This issue is for 4.2, but the underlying race also impacts 4.1.z (at least as of 4.1.11).

[1]: https://github.com/openshift/installer/pull/2169#issuecomment-518869011

Comment 1 Scott Dodson 2019-08-19 17:57:59 UTC
Obviated by Bug 1725287, marking as a dupe.

*** This bug has been marked as a duplicate of bug 1740933 ***

Note You need to log in before you can comment on or make changes to this bug.