Description of problem: We're seeing failed deprovisions in Hive when it's doing a reinstall attempt. Three people have duplicated it. It started happening after we revendored here: https://github.com/openshift/hive/commit/006bd95a0b0f1a584660d273c834208c81312daa Version-Release number of the following components: name = "github.com/openshift/installer" revision = "344e38f31fb65dfd27184bee420e6ec0043618b7" How reproducible: seen 3 times in 7 days Steps to Reproduce: 1. run a deprovision from hive Actual results: Example log: time="2020-03-25T08:18:08Z" level=info msg="cleaning up from past install attempts" installID=rp47fzkv time="2020-03-25T08:18:08Z" level=debug msg="object does not exist" installID=rp47fzkv object=uhc-production-1c4c58srvi32mupraikj2ab9mpf46l41/ci-cluster-v4-3-1-2lhgz-admin-kubeconfig time="2020-03-25T08:18:08Z" level=debug msg="object does not exist" installID=rp47fzkv object=uhc-production-1c4c58srvi32mupraikj2ab9mpf46l41/ci-cluster-v4-3-1-2lhgz-admin-password time="2020-03-25T08:18:08Z" level=info msg="InfraID set from failed install, running deprovison" installID=rp47fzkv time="2020-03-25T08:18:08Z" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv time="2020-03-25T08:18:08Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv time="2020-03-25T08:18:09Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: cc18e157-9eed-4974-a557-e71daef2f207" arn="arn:aws:route53:::hostedzone/Z08948482KQ2Y8KA4WGV4" installID=rp47fzkv time="2020-03-25T08:18:09Z" level=info msg=Deleted arn="arn:aws:ec2:us-east-1:000251746788:natgateway/nat-0884a2ce3eb28c4d4" id=nat-0884a2ce3eb28c4d4 installID=rp47fzkv time="2020-03-25T08:18:09Z" level=debug msg="search for IAM roles" installID=rp47fzkv time="2020-03-25T08:18:09Z" level=debug msg="search for IAM users" installID=rp47fzkv time="2020-03-25T08:18:19Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv time="2020-03-25T08:18:19Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: 32e1f670-4eee-4343-a103-9f0e17044fb7" arn="arn:aws:route53:::hostedzone/Z08948482KQ2Y8KA4WGV4" installID=rp47fzkv time="2020-03-25T08:18:19Z" level=debug msg="search for IAM roles" installID=rp47fzkv time="2020-03-25T08:18:19Z" level=debug msg="search for IAM users" installID=rp47fzkv time="2020-03-25T08:18:29Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/ci-cluster-v4-3-tzd2g\":\"owned\"}" installID=rp47fzkv time="2020-03-25T08:18:29Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: 22037362-b722-45fa-86b4-672bd275c27b" arn="arn:aws:route53:::hostedzone/Z08948482KQ2Y8KA4WGV4" installID=rp47fzkv "NoSuchHostedZone: The specified hosted zone does not exist" loops infinitely Matthew Staebler: "Could this be a hostedzone that is found during the tag search but has already been deleted? Is this another place where the code should be handling a not-found resource by ignoring it? https://github.com/openshift/installer/blob/master/pkg/destroy/aws/aws.go#L431-L436 " Expected results: successful deprovision.
Another example: time="2020-03-25T18:54:17Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: b268c983-3fc7-4713-ab42-b0e08e2d1f68" arn="arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67" time="2020-03-25T18:54:17Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"082606cb-eb7f-4252-9eea-789829c4cbb9\"}" time="2020-03-25T18:54:17Z" level=debug msg="search for IAM roles" time="2020-03-25T18:54:17Z" level=debug msg="search for IAM users" time="2020-03-25T18:54:26Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/maggie-aws-hive01-9db6j\":\"owned\"}" time="2020-03-25T18:54:27Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: a61b9e3f-cc10-4b2a-b466-8d12a0f4ac8d" arn="arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67" time="2020-03-25T18:54:27Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"082606cb-eb7f-4252-9eea-789829c4cbb9\"}" time="2020-03-25T18:54:27Z" level=debug msg="search for IAM roles" time="2020-03-25T18:54:27Z" level=debug msg="search for IAM users" time="2020-03-25T18:54:36Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/maggie-aws-hive01-9db6j\":\"owned\"}" time="2020-03-25T18:54:37Z" level=debug msg="NoSuchHostedZone: The specified hosted zone does not exist.\n\tstatus code: 404, request id: 2c0164df-3bce-47f4-bb54-08d0e217421b" arn="arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67" time="2020-03-25T18:54:37Z" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"082606cb-eb7f-4252-9eea-789829c4cbb9\"}" $ aws resourcegroupstaggingapi get-resources --tag-filters Key=kubernetes.io/cluster/maggie-aws-hive01-9db6j --region=us-east-1 RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:snapshot/snap-03650d08781456a74 TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:route-table/rtb-0abd09f2b8e4cc86f TAGS Name maggie-aws-hive01-9db6j-public TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:natgateway/nat-070ff7c5e9b7380e2 TAGS Name maggie-aws-hive01-9db6j-nat-us-east-1d TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:volume/vol-088d16b32b5dafaae TAGS Name maggie-aws-hive01-9db6j-master-2-vol TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:volume/vol-0f965732c61b43c4a TAGS Name maggie-aws-hive01-9db6j-bootstrap-vol TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:volume/vol-0926f7c3c691adc23 TAGS Name maggie-aws-hive01-9db6j-master-1-vol TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:instance/i-0e91cded8df0af9ee TAGS Name maggie-aws-hive01-9db6j-bootstrap TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:instance/i-0493dcf0d679aa52e TAGS Name maggie-aws-hive01-9db6j-master-2 TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:instance/i-002683c885b7e79cb TAGS Name maggie-aws-hive01-9db6j-master-0 TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:volume/vol-0a62695d21df07e8b TAGS Name maggie-aws-hive01-9db6j-master-0-vol TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:route53:::hostedzone/Z10077981CCB0R8V0ZW67 TAGS Name maggie-aws-hive01-9db6j-int TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned RESOURCETAGMAPPINGLIST arn:aws:ec2:us-east-1:902449478968:instance/i-04b4a0ccdf625d6aa TAGS Name maggie-aws-hive01-9db6j-master-1 TAGS kubernetes.io/cluster/maggie-aws-hive01-9db6j owned $ aws route53 list-hosted-zones HOSTEDZONES 327B2AED-3A9A-F755-8A1E-D1ED81EB372B /hostedzone/ZNCXRQW8GCNFO dev09.red-chesterfield.com. 28 CONFIG Public hosted zone for dev09 account subdomain. False HOSTEDZONES terraform-20200310214858064400000005 /hostedzone/Z04457922WSW2NJCM5D8S magchen-ocp-cluster.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200323153442763800000005 /hostedzone/Z01751702KHU8R3KTZ9AN dhaiduce-mycluster01.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200323183646202600000005 /hostedzone/Z01884941HOPIVM5VN2NC dhaiduce-mycluster02.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200325030641672200000005 /hostedzone/Z08614613AAXYZC326PKF cqu-ocp44-aws.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-2020031020493091290000000e /hostedzone/Z0482200299NVTA2Z6IQ4 ecai-ocp435.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200324212348082800000005 /hostedzone/Z08967342FKQACEKL2VGF steady-lioness.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200309175852882600000005 /hostedzone/Z03892471RYCFZKC81WKK stcannon-mycluster.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200310220008668400000005 /hostedzone/Z05197701TPZOTT7LZSG9 maggiec-ocp44.dev09.red-chesterfield.com. 8 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200316082206994800000005 /hostedzone/Z05370655Z30889NN0LZ acm01-song.dev09.red-chesterfield.com. 8 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200316101034502800000005 /hostedzone/Z052942330E35O6PZMS0R song-acmhub.dev09.red-chesterfield.com. 8 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200317035743455300000005 /hostedzone/Z061340325HT5B1IHND1E song-acm01.dev09.red-chesterfield.com. 8 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200319095805476300000005 /hostedzone/Z0353991ETTWZ1RTJX8N cquklu-ocp44-aws.dev09.red-chesterfield.com. 8 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200312180050702200000005 /hostedzone/Z01201501DU3UGR2GQ0ZL magchen-acm-ocp.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True HOSTEDZONES terraform-20200324014009102700000005 /hostedzone/Z02178123FNMOVIEQQ6NN changliang-ocp44-aws.dev09.red-chesterfield.com. 9 CONFIG Managed by Terraform True $ aws route53 get-hosted-zone --id=Z10077981CCB0R8V0ZW67 An error occurred (NoSuchHostedZone) when calling the GetHostedZone operation: No hosted zone found with ID: Z10077981CCB0R8V0ZW67
This does look to me like a scenario where the hosted zone has been deleted but is still showing up in the listed of tagged resources.
I have filed a PR with an *attempt* to fix: https://github.com/openshift/installer/pull/3359 I cannot actually reproduce the problem.
We've noticed something this morning, Hive is seeing this problem quite often in our stage cluster, we are not seeing it in prod. In stage we have this commit to pickup deprovision code from installer: https://github.com/openshift/hive/commit/006bd95a0b0f1a584660d273c834208c81312daa Suspect that somewhere in that update we picked up a change in the installer where this issue was introduced. This may indicate that it's not just AWS doing weird mismatches in API queries, though the above raw AWS commands seem to indicate that it is. In any case something may be up in the installer that introduced between the commits in the above referenced vendoring. I was able to reproduce locally yesterday and tested the fix in the above PR, which has merged, and it was sucessful.
Hive is seeing this problem where other uses are not because there are circumstances where Hive runs the uninstaller multiple times for the same cluster. Most users run the uninstaller once, the cluster is cleaned up, and the user never attempts to run the uninstaller again. Hive, on the other hand, runs the uninstaller once after a failed installation and again before starting a new installation. When the uninstaller is run the first time, the HostedZone is deleted successfully. The uninstaller records the ARN for the HostedZone as one that has been successfully deleted. The uninstaller then ignores that ARN even though it continues to be returned in the list of tagged resources. That is the scenario that most user experience. When Hive runs the uninstaller a subsequent time, the next uninstaller instance does not have the record of successfully deleted ARNs and so attempts to delete the HostedZone again. This leads to a NoSuchHostedZone error until AWS finally fully cleans up the HostedZone. This error is also seen on CI, where I suspect that they also run the uninstall a second time for the cluster as part of final cleanup.
I'm having similar failure on OCP Installer 4.3.1, while running openshift-install destroy cluster: NoSuchHostedZone: No hosted zone found with ID status code: 404 The main side effect problem, is that this error loops forever, and the Installer process does not exit. ./openshift-install destroy cluster --log-level debug --dir nmanos-cluster-a level=debug msg="OpenShift Installer v4.3.1" level=debug msg="Built from commit 2055609f95b19322ee6cfdd0bea73399297c4a3e" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}" level=debug msg="search for and delete matching instances by tag matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}" level=debug msg="search for and delete matching resources by tag in us-east-2 matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}" level=debug msg="no deletions from us-east-2, removing client" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z01217872DWDAM7GGKXXF\n\tstatus code: 404, request id: 6c102ad7-48be-4939-bed7-3ebb0c120af3" arn="arn:aws:route53:::hostedzone/Z01217872DWDAM7GGKXXF" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}" level=debug msg="search for IAM roles" level=debug msg="search for IAM users" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z01217872DWDAM7GGKXXF\n\tstatus code: 404, request id: 96887ba2-044d-4f4b-af9a-a785d8c3cf9b" arn="arn:aws:route53:::hostedzone/Z01217872DWDAM7GGKXXF" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"openshiftClusterID\":\"57aece2a-7654-4548-bcbf-22b0e6dca823\"}" level=debug msg="search for IAM roles" level=debug msg="search for IAM users" level=debug msg="search for and delete matching resources by tag in us-east-1 matching aws.Filter{\"kubernetes.io/cluster/nmanos-cluster-a-qzmqd\":\"owned\"}" level=debug msg="NoSuchHostedZone: No hosted zone found with ID: Z01217872DWDAM7GGKXXF\n\tstatus code: 404, request id: b17429f0-353d-47e9-bdf6-d38dee064107" arn="arn:aws:route53:::hostedzone/Z01217872DWDAM7GGKXXF" ... ... [Loops forever ...]
The issue has fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409