Bug 1747519

Summary: Installer gets stuck deleting owned VPCs on AWS when there are untagged subnets or security groups
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: sheng.lao <shlao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bleanhar
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:39:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-08-30 16:59:06 UTC
Description of problem:

Sometimes CI leaks untagged security groups and subnets (I'm not clear on how). Because we are allowed to remove all resources from within a cluster-owned VPC, we should have *ByVPC walkers to remove these indirectly-owned resources.

[1] is an example of teardown that gets stuck on this.  Removing that cluster with the code from installer#2214 gives:

$ AWS_PROFILE=ci ./vpc-delete-via-installer 
destroy {"Name":"ci-op-lz6psxgq-60667-kcdnr-vpc","expirationDate":"2019-08-30T09:12+0000","kubernetes.io/cluster/ci-op-lz6psxgq-60667-kcdnr":"owned","name":"ci-op-lz6psxgq-60667-kcdnr","openshift_creationDate":"2019-08-30T05:20:05.181253+00:00"}
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-097641f89b0988267
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-003f9ca4e80d830d7
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-0909490297aebc554
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-0171e9662756b3ee9
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5
INFO Deleted                                       arn="arn:aws:ec2:us-east-1:460538899914:dhcp-options/dopt-06f969a9588954be7" id=dopt-06f969a9588954be7

showing that the issue was blocking subnets, and that the #2214 allows for succesful deletion of those subnets.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ipi-deprovision/8147

Comment 2 W. Trevor King 2019-09-10 03:54:19 UTC
Verification for this probably looks like:

1. Create an AWS cluster.
2. Remove tags from a subnet inside the cluster.
3. Remove tags from a security group inside the cluster.
4. Run 'openshift-install destroy cluster' and see that it succeeds without getting hung up on the untagged resources.

Comment 5 W. Trevor King 2019-09-11 20:07:45 UTC
Sheng points out this getting stuck on:

time="2019-09-11T04:06:20-04:00" level=debug msg="deleting EC2 security group sg-036d0959040324247: DependencyViolation: resource sg-036d0959040324247 has a dependent object\n\tstatus code: 400, request id: d3f0a9b0-3b9f-406e-b769-3fc9c1738d33" arn="arn:aws:ec2:ap-northeast-1:301721915996:vpc/vpc-0603e90cd7607e32c"

But it works for me:

$ openshift-install --dir wking create cluster
# manually remove the kubernetes.io/cluster/... tag from a security group.
$ aws ec2 describe-security-groups --group-ids sg-016e5d178041ae01d | jq -r '.SecurityGroups[].Tags[]'
{
  "Value": "wking-tqgwc-master-sg",
  "Key": "Name"
}
$ openshift-install --dir wking destroy cluster
INFO Disassociated                                 instance=i-0fd71078121225b41 name=wking-tqgwc-bootstrap-profile role=wking-tqgwc-bootstrap-role
...
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:security-group/sg-01f1419b07006ffbc" id=sg-01f1419b07006ffbc
...
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f table=rtb-00e31e33b92f3089e
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f security group=sg-016e5d178041ae01d
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f security group=sg-0dd6861ba70f34a36
...
INFO Deleted                                       arn="arn:aws:ec2:us-west-2:269733383066:dhcp-options/dopt-0ef63583430009721" id=dopt-0ef63583430009721
$ echo $?
0

I'll try again with an installer extracted from registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-11-012246 and all the subnet/security-group tags removed...

Comment 6 W. Trevor King 2019-09-11 21:37:04 UTC
I reproduced by removing kubernetes.io/cluster/... tags from *all* my security groups.  Patch submitted to fix this case.

Comment 8 sheng.lao 2019-09-16 06:40:18 UTC
It's Passwd, I verified it with  4.2.0-0.nightly-2019-09-15-221449

Comment 9 errata-xmlrpc 2019-10-16 06:39:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922