Bug 1890228 - AWS: destroy stuck on route53 hosted zone not found
Summary: AWS: destroy stuck on route53 hosted zone not found
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Russell Teague
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-21 17:43 UTC by Wolfgang Kulhanek
Modified: 2021-02-24 15:27 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The 'destroy cluster' command of openshift-install requires the removal of cluster object the installer initially created, in this instance, hosted zones. Consequence: If the hosted zone is already removed, the installer will hang while attempting to successfully remove the hosted zone. Fix: Added logic to skip the removal of the object if the object is already removed. Result: openshift-install destroy cluster completes successfully and does not hang when deleting hosted zone objects.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:27:22 UTC
Target Upstream Version:


Attachments (Terms of Use)
OpenShift install log (full log install/multiple destroy attempts) (3.50 MB, text/plain)
2020-10-21 17:43 UTC, Wolfgang Kulhanek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4477 0 None closed Bug 1890228: pkg/destroy/aws: Pass destroy if HostedZone does not exist 2021-02-17 18:29:28 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:27:46 UTC

Description Wolfgang Kulhanek 2020-10-21 17:43:16 UTC
Created attachment 1723269 [details]
OpenShift install log (full log install/multiple destroy attempts)

Version:

$ openshift-install version
4.4.8 (but also verified on 4.4.27 and seen on 4.6 RC4)

Platform:

aws

Please specify:
* IPI

What happened?

* Ran `openshift-install destroy cluster`. That went well until it got to deleting networking.

Log at that point:

time="2020-10-13T22:21:00Z" level=info msg=Deleted NAT gateway=nat-0026a87b668041282 arn="arn:aws:ec2:us-east-1:719622469867:vpc/vpc-0aa36367dfe03e5e6" id=vpc-0aa36367dfe03e5e6
time="2020-10-13T22:21:00Z" level=info msg=Deleted NAT gateway=nat-028cbecd03bdf1626 arn="arn:aws:ec2:us-east-1:719622469867:vpc/vpc-0aa36367dfe03e5e6" id=vpc-0aa36367dfe03e5e6
time="2020-10-13T22:21:00Z" level=debug msg="deleting EC2 network interface eni-057add48dda1dde52: InvalidParameterValue: Network interface 'eni-057add48dda1dde52' is currently in use.\n\tstatus code: 400, request id: 7f943940-8f0e-4ba3-bb48-d14863b081b8" arn="arn:aws:ec2:us-east-1:719622469867:vpc/vpc-0aa36367dfe03e5e6"
time="2020-10-13T22:21:00Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}"
time="2020-10-13T22:21:00Z" level=debug msg="search for IAM roles"
time="2020-10-13T22:21:01Z" level=debug msg="search for IAM users"
time="2020-10-13T22:21:08Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}"
time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The dhcpOptions 'dopt-08305742d568fa01e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 12e799b2-e69c-452f-a4d2-6eafe3512b88" arn="arn:aws:ec2:us-east-1:719622469867:dhcp-options/dopt-08305742d568fa01e"
time="2020-10-13T22:21:08Z" level=debug msg="detaching from vpc-0aa36367dfe03e5e6: DependencyViolation: Network vpc-0aa36367dfe03e5e6 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.\n\tstatus code: 400, request id: 2139d449-05c6-498a-a874-9e5b5277eef4" arn="arn:aws:ec2:us-east-1:719622469867:internet-gateway/igw-0abbc8088b0edfe2d"
time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The subnet 'subnet-0d6c2d80a4e83af8e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 6a0fcd4a-457e-4b74-b990-8697efe6e5d8" arn="arn:aws:ec2:us-east-1:719622469867:subnet/subnet-0d6c2d80a4e83af8e"



Ran the destroy again. Same problem.
Ran it again some time later: Different error. Also ran it using 4.4.27. Same error:

time="2020-10-21T17:25:07Z" level=debug msg="OpenShift Installer 4.4.27"
time="2020-10-21T17:25:07Z" level=debug msg="Built from commit 7aa8003c040735c125ca750774a6d8a49189570f"
time="2020-10-21T17:25:07Z" level=info msg="Credentials loaded from the \"default\" profile in file \"/home/ec2-user/.aws/credentials\""
time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}"
time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}"
time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}"
time="2020-10-21T17:25:07Z" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z04530461JO5HLHD0RO9Y\n\tstatus code: 404, request id: 8982874d-1080-4544-b53b-cd6d78da55e0" arn="arn:aws:route53:::hostedzone/Z04530461JO5HLHD0RO9Y"
time="2020-10-21T17:25:07Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}"
time="2020-10-21T17:25:07Z" level=debug msg="search for IAM roles"
time="2020-10-21T17:26:46Z" level=debug msg="search for IAM users"
time="2020-10-21T17:31:31Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/cluster-johnson-c643-8qgs5\":\"owned\"}"
time="2020-10-21T17:31:31Z" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z04530461JO5HLHD0RO9Y\n\tstatus code: 404, request id: 9d893a6a-3dfb-4e99-9191-2b368622417c" arn="arn:aws:route53:::hostedzone/Z04530461JO5HLHD0RO9Y"
time="2020-10-21T17:31:31Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"21394bad-2813-4b12-ab18-ae80a3d3270a\"}"

Comment 1 W. Trevor King 2020-10-21 17:48:45 UTC
Staebler pointed out the NoSuchHostedZone is stale tags or in-progress deletion, and the fix is adding code to [1] to treat NoSuchHostedZone as "already done, success".  Compare [2].

[1]: https://github.com/openshift/installer/blob/31dba4362530c3afaacf20e3a7971a10c1d9f288/pkg/destroy/aws/aws.go#L1912-L1917
[2]: https://github.com/openshift/installer/blob/31dba4362530c3afaacf20e3a7971a10c1d9f288/pkg/destroy/aws/aws.go#L774-L775

Comment 2 Abhinav Dahiya 2020-10-21 17:54:59 UTC
> time="2020-10-21T17:31:31Z" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z04530461JO5HLHD0RO9Y\n\tstatus code: 404, request id: 9d893a6a-3dfb-4e99-9191-2b368622417c" arn="arn:aws:route53:::hostedzone/Z04530461JO5HLHD0RO9Y"

the resourcetaggingapi is returning a resource that seems to not exist. this causes the installer to retry the delete and failing continuously, I think we already skip not-found errors as success and we can probably do the same here.

> time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The dhcpOptions 'dopt-08305742d568fa01e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 12e799b2-e69c-452f-a4d2-6eafe3512b88" arn="arn:aws:ec2:us-east-1:719622469867:dhcp-options/dopt-08305742d568fa01e"
time="2020-10-13T22:21:08Z" level=debug msg="detaching from vpc-0aa36367dfe03e5e6: DependencyViolation: Network vpc-0aa36367dfe03e5e6 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.\n\tstatus code: 400, request id: 2139d449-05c6-498a-a874-9e5b5277eef4" arn="arn:aws:ec2:us-east-1:719622469867:internet-gateway/igw-0abbc8088b0edfe2d"
time="2020-10-13T22:21:08Z" level=debug msg="DependencyViolation: The subnet 'subnet-0d6c2d80a4e83af8e' has dependencies and cannot be deleted.\n\tstatus code: 400, request id: 6a0fcd4a-457e-4b74-b990-8697efe6e5d8" arn="arn:aws:ec2:us-east-1:719622469867:subnet/subnet-0d6c2d80a4e83af8e"

The installer is trying to delete certain networking resources and cannot because AWS thinks there are still resources that need to be cleaned up before that networking resources can be removed. The installer can't help here, and the best thing is to wait for AWS to eventually remove the resources at it's own pace, you need to let the installer continue to try for as long as it takes. There are also cases where users created resources can stop cleanup and the user is expected to remove those resources using the aws console or some such before the installer can move forward successfully.

^^ this is not a bug, just how things work today due to aws limitations.


setting this medium and 4.7 target for the first section related to route53 resource.

Comment 4 Yunfei Jiang 2020-12-14 09:52:46 UTC
@Wolfgang

Per your original title, seems the issue happened occasionally. I did following steps[1] 3 times today, but can not reproduce this issue against v4.4.8.

[1] reproduce steps:
1. trigger a normal IPI on AWS, succeed.
2. destroy cluster by using `openshift-install destroy cluster --dir xxx` command, succeed.
3. destroy cluster again using above command, succeed.


Are the above steps correct? Please correct me if something is not right. Thanks.

Comment 5 W. Trevor King 2020-12-15 04:55:48 UTC
The race that got closed is pretty narrow, since the hosted zone being deleted still had to exist at [1] but be gone by the time we finish deleting record sets.  You could try and delete the private zone when you see the first "deleting public zone..." or "deleting record set..." message logged, and hope you win the race and remove the zone before the installer tries to delete the private zone.  Or probably just say the comment 4 test was sufficient to rule out serious breakage, verify this bug, and we'll open a new one if we see NoSuchHostedZone crop up again.

[1]: https://github.com/openshift/installer/pull/4477/files#diff-e66de401616d33d0c7efd3816f0495936ac3a4592a6a84b7981f3b0d0bc47831R1823

Comment 7 Yunfei Jiang 2020-12-23 09:06:05 UTC
Still can not reproduce this issue on OCP 4.4.

Based on steps in comment 4, I removed private hosted zone before executing destroy command, and there is no `NoSuchHostedZone` error in the logs against OCP 4.7, per comment 5, mark this bug as VERIFIED.

OCP version: 4.7.0-0.nightly-2020-12-20-031835

Comment 8 Wolfgang Kulhanek 2021-01-05 15:23:56 UTC
@yunjiang@redhat.com I haven't seen this in a while. But yes, the steps outlined are what would have triggered it.

Comment 9 Yunfei Jiang 2021-01-06 01:10:01 UTC
Wolfgang, that's great, as my previous comment, this bug has been marked VERIFIED. Thanks.

Comment 11 errata-xmlrpc 2021-02-24 15:27:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.