Bug 1747519
| Summary: | Installer gets stuck deleting owned VPCs on AWS when there are untagged subnets or security groups | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Installer | Assignee: | W. Trevor King <wking> |
| Installer sub component: | openshift-installer | QA Contact: | sheng.lao <shlao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | bleanhar |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.2.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-10-16 06:39:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Verification for this probably looks like: 1. Create an AWS cluster. 2. Remove tags from a subnet inside the cluster. 3. Remove tags from a security group inside the cluster. 4. Run 'openshift-install destroy cluster' and see that it succeeds without getting hung up on the untagged resources. Sheng points out this getting stuck on:
time="2019-09-11T04:06:20-04:00" level=debug msg="deleting EC2 security group sg-036d0959040324247: DependencyViolation: resource sg-036d0959040324247 has a dependent object\n\tstatus code: 400, request id: d3f0a9b0-3b9f-406e-b769-3fc9c1738d33" arn="arn:aws:ec2:ap-northeast-1:301721915996:vpc/vpc-0603e90cd7607e32c"
But it works for me:
$ openshift-install --dir wking create cluster
# manually remove the kubernetes.io/cluster/... tag from a security group.
$ aws ec2 describe-security-groups --group-ids sg-016e5d178041ae01d | jq -r '.SecurityGroups[].Tags[]'
{
"Value": "wking-tqgwc-master-sg",
"Key": "Name"
}
$ openshift-install --dir wking destroy cluster
INFO Disassociated instance=i-0fd71078121225b41 name=wking-tqgwc-bootstrap-profile role=wking-tqgwc-bootstrap-role
...
INFO Deleted arn="arn:aws:ec2:us-west-2:269733383066:security-group/sg-01f1419b07006ffbc" id=sg-01f1419b07006ffbc
...
INFO Deleted arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f table=rtb-00e31e33b92f3089e
INFO Deleted arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f security group=sg-016e5d178041ae01d
INFO Deleted arn="arn:aws:ec2:us-west-2:269733383066:vpc/vpc-028809e7b57efd24f" id=vpc-028809e7b57efd24f security group=sg-0dd6861ba70f34a36
...
INFO Deleted arn="arn:aws:ec2:us-west-2:269733383066:dhcp-options/dopt-0ef63583430009721" id=dopt-0ef63583430009721
$ echo $?
0
I'll try again with an installer extracted from registry.svc.ci.openshift.org/ocp/release:4.2.0-0.nightly-2019-09-11-012246 and all the subnet/security-group tags removed...
I reproduced by removing kubernetes.io/cluster/... tags from *all* my security groups. Patch submitted to fix this case. It's Passwd, I verified it with 4.2.0-0.nightly-2019-09-15-221449 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |
Description of problem: Sometimes CI leaks untagged security groups and subnets (I'm not clear on how). Because we are allowed to remove all resources from within a cluster-owned VPC, we should have *ByVPC walkers to remove these indirectly-owned resources. [1] is an example of teardown that gets stuck on this. Removing that cluster with the code from installer#2214 gives: $ AWS_PROFILE=ci ./vpc-delete-via-installer destroy {"Name":"ci-op-lz6psxgq-60667-kcdnr-vpc","expirationDate":"2019-08-30T09:12+0000","kubernetes.io/cluster/ci-op-lz6psxgq-60667-kcdnr":"owned","name":"ci-op-lz6psxgq-60667-kcdnr","openshift_creationDate":"2019-08-30T05:20:05.181253+00:00"} INFO Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-097641f89b0988267 INFO Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-003f9ca4e80d830d7 INFO Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-0909490297aebc554 INFO Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 subnet=subnet-0171e9662756b3ee9 INFO Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-000d53425754ba1b5" id=vpc-000d53425754ba1b5 INFO Deleted arn="arn:aws:ec2:us-east-1:460538899914:dhcp-options/dopt-06f969a9588954be7" id=dopt-06f969a9588954be7 showing that the issue was blocking subnets, and that the #2214 allows for succesful deletion of those subnets. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ipi-deprovision/8147