Bug 1747519 - Installer gets stuck deleting owned VPCs on AWS when there are untagged subnets or security groups
Summary: Installer gets stuck deleting owned VPCs on AWS when there are untagged subne...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: W. Trevor King
QA Contact: sheng.lao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-30 16:59 UTC by W. Trevor King
Modified: 2019-10-16 06:39 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:39:32 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift installer pull 2214 None None None 2019-08-30 16:59:57 UTC
Github openshift installer pull 2346 None None None 2019-09-11 21:37:35 UTC
Red Hat Product Errata RHBA-2019:2922 None None None 2019-10-16 06:39:41 UTC

Description W. Trevor King 2019-08-30 16:59:06 UTC
Description of problem:

Sometimes CI leaks untagged security groups and subnets (I'm not clear on how). Because we are allowed to remove all resources from within a cluster-owned VPC, we should have *ByVPC walkers to remove these indirectly-owned resources.

[1] is an example of teardown that gets stuck on this.  Removing that cluster with the code from installer#2214 gives:

$ AWS_PROFILE=ci ./vpc-delete-via-installer 
destroy {"Name":"ci-op-lz6psxgq-60667-kcdnr-vpc","expirationDate":"2019-08-30T09:12+0000","kubernetes.io/cluster/ci-op-lz6psxgq-60667-kcdnr":"owned","name":"ci-op-lz6psxgq-60667-kcdnr","openshift_creationDate":"2019-08-30T05:20:05.181253+00:00"}
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-097641f89b0988267
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-003f9ca4e80d830d7
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-0909490297aebc554
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-0171e9662756b3ee9
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:dhcp-options/dopt-06f969a9588954be7" id=dopt-06f969a9588954be7

showing that the issue was blocking subnets, and that the #2214 allows for succesful deletion of those subnets.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ipi-deprovision/8147

Comment 2 W. Trevor King 2019-09-10 03:54:19 UTC
Verification for this probably looks like:

1. Create an AWS cluster.
2. Remove tags from a subnet inside the cluster.
3. Remove tags from a security group inside the cluster.
4. Run 'openshift-install destroy cluster' and see that it succeeds without getting hung up on the untagged resources.

Comment 5 W. Trevor King 2019-09-11 20:07:45 UTC
Sheng points out this getting stuck on:

time="2019-09-11T04:06:20-04:00" level=debug msg="deleting EC2 security group sg-036d0959040324247: DependencyViolation: resource sg-036d0959040324247 has a dependent object\n\tstatus code: 400, request id: d3f0a9b0-3b9f-406e-b769-3fc9c1738d33" arn="arn:aws:ec2:ap-northeast-1:301721915996:vpc/vpc-0603e90cd7607e32c"

But it works for me:

$ openshift-install --dir wking create cluster
# manually remove the kubernetes.io/cluster/... tag from a security group.
$ aws ec2 describe-security-groups --group-ids sg-016e5d178041ae01d | jq -r '.SecurityGroups[].Tags[]'
{
  "Value": "wking-tqgwc-master-sg",
  "Key": "Name"
}
$ openshift-install --dir wking destroy cluster
INFO Disassociated                                 instance=i-0fd71078121225b41 name=wking-tqgwc-bootstrap-profile role=wking-tqgwc-bootstrap-role
...
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:security-group/sg-01f1419b07006ffbc" id=sg-01f1419b07006ffbc
...
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f table=rtb-00e31e33b92f3089e
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f security group=sg-016e5d178041ae01d
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f security group=sg-0dd6861ba70f34a36
...
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:dhcp-options/dopt-0ef63583430009721" id=dopt-0ef63583430009721
$ echo $?
0

I'll try again with an installer extracted from registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-11-012246 and all the subnet/security-group tags removed...

Comment 6 W. Trevor King 2019-09-11 21:37:04 UTC
I reproduced by removing kubernetes.io/cluster/... tags from *all* my security groups.  Patch submitted to fix this case.

Comment 8 sheng.lao 2019-09-16 06:40:18 UTC
It's Passwd, I verified it with  4.2.0-0.nightly-2019-09-15-221449

Comment 9 errata-xmlrpc 2019-10-16 06:39:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.