Bug 1817201 - deprovision failure loop: "NoSuchHostedZone: The specified hosted zone does not exist"
Summary: deprovision failure loop: "NoSuchHostedZone: The specified hosted zone does n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Abhinav Dahiya
QA Contact: wang lin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-25 19:11 UTC by Greg Sheremeta
Modified: 2020-07-13 17:24 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: AWS API use to fetch resources for clusters is extremely slow in reacting to previously deleted resources, and therefore deleting already deleted hosted zones would cause failures. Consequence: The destroy would loop until the AWS APIs remove the HostedZone from it's reponse. Fix: skip notfound error for hostedzone Result: the destroy exists faster and doesn't loop un-necessarily.
Clone Of:
Environment:
Last Closed: 2020-07-13 17:23:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3359 0 None closed Bug 1817201: Fix intermittent deprovision loop on NoSuchHostedZone error 2020-12-02 10:24:58 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:23:59 UTC

Description Greg Sheremeta 2020-03-25 19:11:06 UTC
Description of problem:

We're seeing failed deprovisions in Hive when it's doing a reinstall attempt. Three people have duplicated it. It started happening after we revendored here: https://github.com/openshift/hive/commit/006bd95a0b0f1a584660d273c834208c81312daa


Version-Release number of the following components:
name = "github.com/openshift/installer"
revision = "344e38f31fb65dfd27184bee420e6ec0043618b7"

How reproducible:
seen 3 times in 7 days

Steps to Reproduce:
1. run a deprovision from hive

Actual results:

Example log:


time="2020-03-25T08:18:08Z" level=info msg="cleaning up from past install attempts" installID=rp47fzkv
time="2020-03-25T08:18:08Z" level=debug msg="object does not exist" installID=rp47fzkv object=uhc-production-1c4c58srvi32mupraikj2ab9mpf46l41/ci-cluster-v4-3-1-2lhgz-admin-kubeconfig
time="2020-03-25T08:18:08Z" level=debug msg="object does not exist" installID=rp47fzkv object=uhc-production-1c4c58srvi32mupraikj2ab9mpf46l41/ci-cluster-v4-3-1-2lhgz-admin-password
time="2020-03-25T08:18:08Z" level=info msg="InfraID set from failed install, running deprovison" installID=rp47fzkv
time="2020-03-25T08:18:08Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv
time="2020-03-25T08:18:08Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv
time="2020-03-25T08:18:09Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: cc18e157-9eed-4974-a557-e71daef2f207" arn="arn:aws:route53:::hostedzone/Z08948482KQ2Y8KA4WGV4" installID=rp47fzkv
time="2020-03-25T08:18:09Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:000251746788:natgateway/nat-0884a2ce3eb28c4d4" id=nat-0884a2ce3eb28c4d4 installID=rp47fzkv
time="2020-03-25T08:18:09Z" level=debug msg="search for IAM roles" installID=rp47fzkv
time="2020-03-25T08:18:09Z" level=debug msg="search for IAM users" installID=rp47fzkv
time="2020-03-25T08:18:19Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv
time="2020-03-25T08:18:19Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: 32e1f670-4eee-4343-a103-9f0e17044fb7" arn="arn:aws:route53:::hostedzone/Z08948482KQ2Y8KA4WGV4" installID=rp47fzkv
time="2020-03-25T08:18:19Z" level=debug msg="search for IAM roles" installID=rp47fzkv
time="2020-03-25T08:18:19Z" level=debug msg="search for IAM users" installID=rp47fzkv
time="2020-03-25T08:18:29Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv
time="2020-03-25T08:18:29Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: 22037362-b722-45fa-86b4-672bd275c27b" arn="arn:aws:route53:::hostedzone/Z08948482KQ2Y8KA4WGV4" installID=rp47fzkv

"NoSuchHostedZone: The specified hosted zone does not exist" loops infinitely

Matthew Staebler: "Could this be a hostedzone that is found during the tag search but has already been deleted? Is this another place where the code should be handling a not-found resource by ignoring it?
https://github.com/openshift/installer/blob/master/pkg/destroy/aws/aws.go#L431-L436 "

Expected results:
successful deprovision.

Comment 1 Matthew Staebler 2020-03-25 20:10:36 UTC
Another example:

time="2020-03-25T18:54:17Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: b268c983-3fc7-4713-ab42-b0e08e2d1f68" arn="arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67"
time="2020-03-25T18:54:17Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"082606cb-eb7f-4252-9eea-789829c4cbb9\"}"
time="2020-03-25T18:54:17Z" level=debug msg="search for IAM roles"
time="2020-03-25T18:54:17Z" level=debug msg="search for IAM users"
time="2020-03-25T18:54:26Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/maggie-aws-hive01-9db6j\":\"owned\"}"
time="2020-03-25T18:54:27Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: a61b9e3f-cc10-4b2a-b466-8d12a0f4ac8d" arn="arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67"
time="2020-03-25T18:54:27Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"082606cb-eb7f-4252-9eea-789829c4cbb9\"}"
time="2020-03-25T18:54:27Z" level=debug msg="search for IAM roles"
time="2020-03-25T18:54:27Z" level=debug msg="search for IAM users"
time="2020-03-25T18:54:36Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/maggie-aws-hive01-9db6j\":\"owned\"}"
time="2020-03-25T18:54:37Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: 2c0164df-3bce-47f4-bb54-08d0e217421b" arn="arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67"
time="2020-03-25T18:54:37Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"082606cb-eb7f-4252-9eea-789829c4cbb9\"}"




$ aws resourcegroupstaggingapi get-resources --tag-filters Key=kubernetes.io/cluster/maggie-aws-hive01-9db6j --region=us-east-1
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:snapshot/snap-03650d08781456a74
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:route-table/rtb-0abd09f2b8e4cc86f
TAGS    Name    maggie-aws-hive01-9db6j-public
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:natgateway/nat-070ff7c5e9b7380e2
TAGS    Name    maggie-aws-hive01-9db6j-nat-us-east-1d
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:volume/vol-088d16b32b5dafaae
TAGS    Name    maggie-aws-hive01-9db6j-master-2-vol
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:volume/vol-0f965732c61b43c4a
TAGS    Name    maggie-aws-hive01-9db6j-bootstrap-vol
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:volume/vol-0926f7c3c691adc23
TAGS    Name    maggie-aws-hive01-9db6j-master-1-vol
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:instance/i-0e91cded8df0af9ee
TAGS    Name    maggie-aws-hive01-9db6j-bootstrap
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:instance/i-0493dcf0d679aa52e
TAGS    Name    maggie-aws-hive01-9db6j-master-2
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:instance/i-002683c885b7e79cb
TAGS    Name    maggie-aws-hive01-9db6j-master-0
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:volume/vol-0a62695d21df07e8b
TAGS    Name    maggie-aws-hive01-9db6j-master-0-vol
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67
TAGS    Name    maggie-aws-hive01-9db6j-int
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned
RESOURCETAGMAPPINGLIST  arn:aws:ec2:us-east-1:902449478968:instance/i-04b4a0ccdf625d6aa
TAGS    Name    maggie-aws-hive01-9db6j-master-1
TAGS    kubernetes.io/cluster/maggie-aws-hive01-9db6j   owned



$ aws route53 list-hosted-zones
HOSTEDZONES     327B2AED-3A9A-F755-8A1E-D1ED81EB372B    /hostedzone/ZNCXRQW8GCNFO       dev09.red-chesterfield.com.     28
CONFIG  Public hosted zone for dev09 account subdomain. False
HOSTEDZONES     terraform-20200310214858064400000005    /hostedzone/Z04457922WSW2NJCM5D8S       magchen-ocp-cluster.dev09.red-chesterfield.com. 9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200323153442763800000005    /hostedzone/Z01751702KHU8R3KTZ9AN       dhaiduce-mycluster01.dev09.red-chesterfield.com.        9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200323183646202600000005    /hostedzone/Z01884941HOPIVM5VN2NC       dhaiduce-mycluster02.dev09.red-chesterfield.com.        9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200325030641672200000005    /hostedzone/Z08614613AAXYZC326PKF       cqu-ocp44-aws.dev09.red-chesterfield.com.       9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-2020031020493091290000000e    /hostedzone/Z0482200299NVTA2Z6IQ4       ecai-ocp435.dev09.red-chesterfield.com. 9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200324212348082800000005    /hostedzone/Z08967342FKQACEKL2VGF       steady-lioness.dev09.red-chesterfield.com.      9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200309175852882600000005    /hostedzone/Z03892471RYCFZKC81WKK       stcannon-mycluster.dev09.red-chesterfield.com.  9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200310220008668400000005    /hostedzone/Z05197701TPZOTT7LZSG9       maggiec-ocp44.dev09.red-chesterfield.com.       8
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200316082206994800000005    /hostedzone/Z05370655Z30889NN0LZ        acm01-song.dev09.red-chesterfield.com.  8
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200316101034502800000005    /hostedzone/Z052942330E35O6PZMS0R       song-acmhub.dev09.red-chesterfield.com. 8
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200317035743455300000005    /hostedzone/Z061340325HT5B1IHND1E       song-acm01.dev09.red-chesterfield.com.  8
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200319095805476300000005    /hostedzone/Z0353991ETTWZ1RTJX8N        cquklu-ocp44-aws.dev09.red-chesterfield.com.    8
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200312180050702200000005    /hostedzone/Z01201501DU3UGR2GQ0ZL       magchen-acm-ocp.dev09.red-chesterfield.com.     9
CONFIG  Managed by Terraform    True
HOSTEDZONES     terraform-20200324014009102700000005    /hostedzone/Z02178123FNMOVIEQQ6NN       changliang-ocp44-aws.dev09.red-chesterfield.com.        9
CONFIG  Managed by Terraform    True



$ aws route53 get-hosted-zone --id=Z10077981CCB0R8V0ZW67
An error occurred (NoSuchHostedZone) when calling the GetHostedZone operation: No hosted zone found with ID: Z10077981CCB0R8V0ZW67

Comment 2 Matthew Staebler 2020-03-25 20:11:38 UTC
This does look to me like a scenario where the hosted zone has been deleted but is still showing up in the listed of tagged resources.

Comment 3 Devan Goodwin 2020-03-27 12:50:26 UTC
I have filed a PR with an *attempt* to fix: https://github.com/openshift/installer/pull/3359
I cannot actually reproduce the problem.

Comment 6 Devan Goodwin 2020-04-01 12:54:37 UTC
We've noticed something this morning, Hive is seeing this problem quite often in our stage cluster, we are not seeing it in prod. In stage we have this commit to pickup deprovision code from installer:

https://github.com/openshift/hive/commit/006bd95a0b0f1a584660d273c834208c81312daa

Suspect that somewhere in that update we picked up a change in the installer where this issue was introduced. This may indicate that it's not just AWS doing weird mismatches in API queries, though the above raw AWS commands seem to indicate that it is. In any case something may be up in the installer that introduced between the commits in the above referenced vendoring.

I was able to reproduce locally yesterday and tested the fix in the above PR, which has merged, and it was sucessful.

Comment 7 Matthew Staebler 2020-04-01 13:18:48 UTC
Hive is seeing this problem where other uses are not because there are circumstances where Hive runs the uninstaller multiple times for the same cluster. Most users run the uninstaller once, the cluster is cleaned up, and the user never attempts to run the uninstaller again. Hive, on the other hand, runs the uninstaller once after a failed installation and again before starting a new installation.

When the uninstaller is run the first time, the HostedZone is deleted successfully. The uninstaller records the ARN for the HostedZone as one that has been successfully deleted. The uninstaller then ignores that ARN even though it continues to be returned in the list of tagged resources. That is the scenario that most user experience. When Hive runs the uninstaller a subsequent time, the next uninstaller instance does not have the record of successfully deleted ARNs and so attempts to delete the HostedZone again. This leads to a NoSuchHostedZone error until AWS finally fully cleans up the HostedZone.

This error is also seen on CI, where I suspect that they also run the uninstall a second time for the cluster as part of final cleanup.

Comment 8 Noam Manos 2020-04-02 15:28:39 UTC
I'm having similar failure on OCP Installer 4.3.1, while running openshift-install destroy cluster:
NoSuchHostedZone: No hosted zone found with ID status code: 404

The main side effect problem, is that this error loops forever, and the Installer process does not exit.

./openshift-install destroy cluster --log-level debug --dir nmanos-cluster-a

level=debug msg="OpenShift Installer v4.3.1"
level=debug msg="Built from commit 2055609f95b19322ee6cfdd0bea73399297c4a3e"
level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}"
level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}"
level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}"
level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}"
level=debug msg="no deletions from us-east-2, removing client"
level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}"
level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z01217872DWDAM7GGKXXF\n\tstatus code: 404, request id: 6c102ad7-48be-4939-bed7-3ebb0c120af3" arn="arn:aws:route53:::hostedzone/Z01217872DWDAM7GGKXXF"
level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}"
level=debug msg="search for IAM roles"
level=debug msg="search for IAM users"
level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}"
level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z01217872DWDAM7GGKXXF\n\tstatus code: 404, request id: 96887ba2-044d-4f4b-af9a-a785d8c3cf9b" arn="arn:aws:route53:::hostedzone/Z01217872DWDAM7GGKXXF"
level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}"
level=debug msg="search for IAM roles"
level=debug msg="search for IAM users"
level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}"
level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z01217872DWDAM7GGKXXF\n\tstatus code: 404, request id: b17429f0-353d-47e9-bdf6-d38dee064107" arn="arn:aws:route53:::hostedzone/Z01217872DWDAM7GGKXXF"
...
...
[Loops forever ...]

Comment 9 wang lin 2020-04-14 08:58:39 UTC
The issue has fixed.

Comment 11 errata-xmlrpc 2020-07-13 17:23:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.