Bug 1740933 - AWS 'destroy cluster' can leak resources due to races between cluster deletion and in-cluster resource creation
Summary: AWS 'destroy cluster' can leak resources due to races between cluster deletio...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.0
Assignee: W. Trevor King
QA Contact: sheng.lao
URL:
Whiteboard:
: 1740935 1743313 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-13 21:41 UTC by W. Trevor King
Modified: 2019-10-16 06:36 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:35:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 2169 0 None None None 2019-08-13 21:42:02 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:36:06 UTC

Description W. Trevor King 2019-08-13 21:41:21 UTC
CI has turned up some races like:

1. The installer removes a bucket
2. The still-running registry operator tries to self-heal and creates a new bucket, but before it can push tags...
3. The installer terminates the instance where the registry operator was running.
4. The installer leaks the new, untagged bucket.

This has been happening in CI in jobs like [1,2], where CloudFormation logs show:

  12:56:32, registry operator creates the bucket
  14:31:10, installer deletes the bucket
  14:31:11, registry operator creates the bucket again
  14:31:11, installer starts requesting instance termination

And it could theoretically happen to any resource (not just buckets) which we discover by tag, but which is not tagged atomically on creation.  This issue is for 4.2, but the underlying race also impacts 4.1.z (at least as of 4.1.11).

Suggested fix is terminating instances first, and only moving ahead to delete other resources after there are no longer any cluster-owned instances that are running or shutting down [3].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454
[2]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-marketplace/232/pull-ci-operator-framework-operator-marketplace-master-e2e-aws-upgrade/454/artifacts/e2e-aws-upgrade/installer/.openshift_install.log
[3]: https://github.com/openshift/installer/pull/2169

Comment 1 W. Trevor King 2019-08-19 16:32:39 UTC
*** Bug 1743313 has been marked as a duplicate of this bug. ***

Comment 2 Scott Dodson 2019-08-19 17:57:59 UTC
*** Bug 1740935 has been marked as a duplicate of this bug. ***

Comment 4 W. Trevor King 2019-09-05 03:30:00 UTC
I'm not sure what validation looks like for this bug.  Ideally we stop leaking registry buckets in CI (at least from 4.2 clusters that will have the new code).  But narrowly it might just be "I looked and saw the instances terminated first", like [1]:

time="2019-09-05T00:15:08Z" level=debug msg="OpenShift Installer unreleased-master-1687-g9c59c82b8f8631d082eaa4276e1595c95a581c4a-dirty"
time="2019-09-05T00:15:08Z" level=debug msg="Built from commit 9c59c82b8f8631d082eaa4276e1595c95a581c4a"
time="2019-09-05T00:15:08Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/ci-op-j16mjwc5-1d3f3-gpzqh\":\"owned\"}"
time="2019-09-05T00:15:08Z" level=debug msg=Terminated instance=i-01d82808356254e87
time="2019-09-05T00:15:09Z" level=info msg=Disassociated instance=i-06a364cefe8e22247 name=ci-op-j16mjwc5-1d3f3-gpzqh-master-profile role=ci-op-j16mjwc5-1d3f3-gpzqh-master-role
time="2019-09-05T00:15:09Z" level=info msg=Deleted InstanceProfileName=ci-op-j16mjwc5-1d3f3-gpzqh-master-profile arn="arn:aws:iam::460538899914:instance-profile/ci-op-j16mjwc5-1d3f3-gpzqh-master-profile" instance=i-06a364cefe8e22247
time="2019-09-05T00:15:09Z" level=info msg=Terminating instance=i-06a364cefe8e22247
time="2019-09-05T00:15:09Z" level=info msg=Terminating instance=i-0d1f65cc7f2141bae
time="2019-09-05T00:15:09Z" level=info msg=Disassociated instance=i-0f5205e4c5e40e39d name=ci-op-j16mjwc5-1d3f3-gpzqh-worker-profile role=ci-op-j16mjwc5-1d3f3-gpzqh-worker-role
time="2019-09-05T00:15:09Z" level=info msg=Deleted InstanceProfileName=ci-op-j16mjwc5-1d3f3-gpzqh-worker-profile arn="arn:aws:iam::460538899914:instance-profile/ci-op-j16mjwc5-1d3f3-gpzqh-worker-profile" instance=i-0f5205e4c5e40e39d
time="2019-09-05T00:15:09Z" level=info msg=Terminating instance=i-0f5205e4c5e40e39d
time="2019-09-05T00:15:09Z" level=info msg=Terminating instance=i-0102c7a760fd719a2
time="2019-09-05T00:15:10Z" level=info msg=Terminating instance=i-0a6d5ce45bedab363
time="2019-09-05T00:15:10Z" level=info msg=Terminating instance=i-0a57eb385a5c0226a
time="2019-09-05T00:15:10Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"c106fc46-8afb-4285-b3b2-5faf76fbda72\"}"
...
time="2019-09-05T00:15:40Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/ci-op-j16mjwc5-1d3f3-gpzqh\":\"owned\"}"
time="2019-09-05T00:15:40Z" level=debug msg=Terminated instance=i-06a364cefe8e22247
time="2019-09-05T00:15:40Z" level=debug msg=Terminated instance=i-0d1f65cc7f2141bae
time="2019-09-05T00:15:40Z" level=debug msg=Terminated instance=i-0f5205e4c5e40e39d
time="2019-09-05T00:15:40Z" level=debug msg=Terminated instance=i-0a6d5ce45bedab363
time="2019-09-05T00:15:40Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"c106fc46-8afb-4285-b3b2-5faf76fbda72\"}"
...
time="2019-09-05T00:17:00Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/ci-op-j16mjwc5-1d3f3-gpzqh\":\"owned\"}"
time="2019-09-05T00:17:00Z" level=debug msg=Terminated instance=i-0a57eb385a5c0226a
time="2019-09-05T00:17:00Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"c106fc46-8afb-4285-b3b2-5faf76fbda72\"}"
time="2019-09-05T00:17:10Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/ci-op-j16mjwc5-1d3f3-gpzqh\":\"owned\"}"
time="2019-09-05T00:17:11Z" level=debug msg=Terminated instance=i-0102c7a760fd719a2
time="2019-09-05T00:17:11Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"c106fc46-8afb-4285-b3b2-5faf76fbda72\"}"
time="2019-09-05T00:17:11Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-op-j16mjwc5-1d3f3-gpzqh\":\"owned\"}"
time="2019-09-05T00:17:11Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:image/ami-04c4612436f0cde76" id=ami-04c4612436f0cde76
...

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2169/pull-ci-openshift-installer-master-e2e-aws/7522/artifacts/e2e-aws/installer/.openshift_install.log

Comment 6 sheng.lao 2019-09-06 09:39:46 UTC
PR is not merged to day's nightly build

Comment 7 sheng.lao 2019-09-09 05:25:25 UTC
Verified with version 4.2.0-0.nightly-2019-09-08-232045

Comment 8 errata-xmlrpc 2019-10-16 06:35:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.