Created attachment 1723269 [details] OpenShift install log (full log install/multiple destroy attempts) Version: $ openshift-install version 4.4.8 (but also verified on 4.4.27 and seen on 4.6 RC4) Platform: aws Please specify: * IPI What happened? * Ran `openshift-install destroy cluster`. That went well until it got to deleting networking. Log at that point: time="2020-10-13T22:21:00Z" level=info msg=Deleted NAT gateway=nat-0026a87b668041282 arn="arn:aws:ec2:us-east-1:719622469867:vpc/vpc-0aa36367dfe03e5e6" id=vpc-0aa36367dfe03e5e6 time="2020-10-13T22:21:00Z" level=info msg=Deleted NAT gateway=nat-028cbecd03bdf1626 arn="arn:aws:ec2:us-east-1:719622469867:vpc/vpc-0aa36367dfe03e5e6" id=vpc-0aa36367dfe03e5e6 time="2020-10-13T22:21:00Z" level=debug msg="deleting EC2 network interface eni-057add48dda1dde52: InvalidParameterValue: Network interface 'eni-057add48dda1dde52' is currently in use.\n\tstatus code: 400, request id: 7f943940-8f0e-4ba3-bb48-d14863b081b8" arn="arn:aws:ec2:us-east-1:719622469867:vpc/vpc-0aa36367dfe03e5e6" time="2020-10-13T22:21:00Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}" time="2020-10-13T22:21:00Z" level=debug msg="search for IAM roles" time="2020-10-13T22:21:01Z" level=debug msg="search for IAM users" time="2020-10-13T22:21:08Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}" time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The dhcpOptions 'dopt-08305742d568fa01e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 12e799b2-e69c-452f-a4d2-6eafe3512b88" arn="arn:aws:ec2:us-east-1:719622469867:dhcp-options/dopt-08305742d568fa01e" time="2020-10-13T22:21:08Z" level=debug msg="detaching from vpc-0aa36367dfe03e5e6: DependencyViolation: Network vpc-0aa36367dfe03e5e6 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.\n\tstatus code: 400, request id: 2139d449-05c6-498a-a874-9e5b5277eef4" arn="arn:aws:ec2:us-east-1:719622469867:internet-gateway/igw-0abbc8088b0edfe2d" time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The subnet 'subnet-0d6c2d80a4e83af8e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 6a0fcd4a-457e-4b74-b990-8697efe6e5d8" arn="arn:aws:ec2:us-east-1:719622469867:subnet/subnet-0d6c2d80a4e83af8e" Ran the destroy again. Same problem. Ran it again some time later: Different error. Also ran it using 4.4.27. Same error: time="2020-10-21T17:25:07Z" level=debug msg="OpenShift Installer 4.4.27" time="2020-10-21T17:25:07Z" level=debug msg="Built from commit 7aa8003c040735c125ca750774a6d8a49189570f" time="2020-10-21T17:25:07Z" level=info msg="Credentials loaded from the \"default\" profile in file \"/home/ec2-user/.aws/credentials\"" time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}" time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}" time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}" time="2020-10-21T17:25:07Z" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z04530461JO5HLHD0RO9Y\n\tstatus code: 404, request id: 8982874d-1080-4544-b53b-cd6d78da55e0" arn="arn:aws:route53:::hostedzone/Z04530461JO5HLHD0RO9Y" time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}" time="2020-10-21T17:25:07Z" level=debug msg="search for IAM roles" time="2020-10-21T17:26:46Z" level=debug msg="search for IAM users" time="2020-10-21T17:31:31Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}" time="2020-10-21T17:31:31Z" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z04530461JO5HLHD0RO9Y\n\tstatus code: 404, request id: 9d893a6a-3dfb-4e99-9191-2b368622417c" arn="arn:aws:route53:::hostedzone/Z04530461JO5HLHD0RO9Y" time="2020-10-21T17:31:31Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}"
Staebler pointed out the NoSuchHostedZone is stale tags or in-progress deletion, and the fix is adding code to [1] to treat NoSuchHostedZone as "already done, success". Compare [2]. [1]: https://github.com/openshift/installer/blob/31dba4362530c3afaacf20e3a7971a10c1d9f288/pkg/destroy/aws/aws.go#L1912-L1917 [2]: https://github.com/openshift/installer/blob/31dba4362530c3afaacf20e3a7971a10c1d9f288/pkg/destroy/aws/aws.go#L774-L775
> time="2020-10-21T17:31:31Z" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z04530461JO5HLHD0RO9Y\n\tstatus code: 404, request id: 9d893a6a-3dfb-4e99-9191-2b368622417c" arn="arn:aws:route53:::hostedzone/Z04530461JO5HLHD0RO9Y" the resourcetaggingapi is returning a resource that seems to not exist. this causes the installer to retry the delete and failing continuously, I think we already skip not-found errors as success and we can probably do the same here. > time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The dhcpOptions 'dopt-08305742d568fa01e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 12e799b2-e69c-452f-a4d2-6eafe3512b88" arn="arn:aws:ec2:us-east-1:719622469867:dhcp-options/dopt-08305742d568fa01e" time="2020-10-13T22:21:08Z" level=debug msg="detaching from vpc-0aa36367dfe03e5e6: DependencyViolation: Network vpc-0aa36367dfe03e5e6 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.\n\tstatus code: 400, request id: 2139d449-05c6-498a-a874-9e5b5277eef4" arn="arn:aws:ec2:us-east-1:719622469867:internet-gateway/igw-0abbc8088b0edfe2d" time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The subnet 'subnet-0d6c2d80a4e83af8e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 6a0fcd4a-457e-4b74-b990-8697efe6e5d8" arn="arn:aws:ec2:us-east-1:719622469867:subnet/subnet-0d6c2d80a4e83af8e" The installer is trying to delete certain networking resources and cannot because AWS thinks there are still resources that need to be cleaned up before that networking resources can be removed. The installer can't help here, and the best thing is to wait for AWS to eventually remove the resources at it's own pace, you need to let the installer continue to try for as long as it takes. There are also cases where users created resources can stop cleanup and the user is expected to remove those resources using the aws console or some such before the installer can move forward successfully. ^^ this is not a bug, just how things work today due to aws limitations. setting this medium and 4.7 target for the first section related to route53 resource.
@Wolfgang Per your original title, seems the issue happened occasionally. I did following steps[1] 3 times today, but can not reproduce this issue against v4.4.8. [1] reproduce steps: 1. trigger a normal IPI on AWS, succeed. 2. destroy cluster by using `openshift-install destroy cluster --dir xxx` command, succeed. 3. destroy cluster again using above command, succeed. Are the above steps correct? Please correct me if something is not right. Thanks.
The race that got closed is pretty narrow, since the hosted zone being deleted still had to exist at [1] but be gone by the time we finish deleting record sets. You could try and delete the private zone when you see the first "deleting public zone..." or "deleting record set..." message logged, and hope you win the race and remove the zone before the installer tries to delete the private zone. Or probably just say the comment 4 test was sufficient to rule out serious breakage, verify this bug, and we'll open a new one if we see NoSuchHostedZone crop up again. [1]: https://github.com/openshift/installer/pull/4477/files#diff-e66de401616d33d0c7efd3816f0495936ac3a4592a6a84b7981f3b0d0bc47831R1823
Still can not reproduce this issue on OCP 4.4. Based on steps in comment 4, I removed private hosted zone before executing destroy command, and there is no `NoSuchHostedZone` error in the logs against OCP 4.7, per comment 5, mark this bug as VERIFIED. OCP version: 4.7.0-0.nightly-2020-12-20-031835
@yunjiang I haven't seen this in a while. But yes, the steps outlined are what would have triggered it.
Wolfgang, that's great, as my previous comment, this bug has been marked VERIFIED. Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633