Bug 1672374

Summary: destroy cluster with other region in AWS
Product: OpenShift Container Platform Reporter: jooho lee <jlee>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: crawford, jlee, mifiedle, wking
Version: 4.1.0Keywords: Reopened
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: installer destroy attempts to delete resources from a cluster with the same name in us-east-1 Consequence: installer deletes resources that it should not delete and fails when attempting to delete others Fix: only filter resources to delete based on the openshiftClusterID and not on the cluster name Result: only resources for the cluster being destroyed are deleted, and the installer does not block on deleting resources from other regions
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:42:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    

Description jooho lee 2019-02-04 19:10:43 UTC
Description of problem:

I created a cluster in us-east-2 and try to destroy the cluster. From the log, I found installer try to find objects in us-east-1 to delete.

The installer never finish.

~~~
DEBUG deleting arn:aws:ec2:us-east-1:694280550618:natgateway/nat-0fc9cf4d52070a706: NatGatewayNotFound: The Nat Gateway nat-0fc9cf4d52070a706 was not found
	status code: 400, request id: 8224afdb-f9b4-4def-bb92-5cd767548427 
~~~

metadata.json
~~~
{"clusterName":"ocp4","clusterID":"02917b1f-2752-45d2-b511-f3f25762eacb","aws":{"region":"us-east-2","identifier":[{"openshiftClusterID":"02917b1f-2752-45d2-b511-f3f25762eacb"},{"kubernetes.io/cluster/ocp4":"owned"}]}}

~~~
Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1. create a cluster in us-east-2
2. delete the cluster
3.

Actual results:
Try to find some object in us-east-1

Expected results:
Do not find any objects in us-east-1 

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 4 Alex Crawford 2019-02-13 22:25:59 UTC
Have you specified the AWS_REGION when destroying the cluster? I believe that is necessary.

Comment 5 Alex Crawford 2019-02-13 22:27:25 UTC
Matthew, can you take a look at this? Ideally, we'd remember in which region we installed the cluster so that destroy doesn't require AWS_REGION.

Comment 6 Matthew Staebler 2019-02-13 22:53:20 UTC
Fixed by https://github.com/openshift/installer/pull/1170.

Comment 7 Matthew Staebler 2019-02-14 02:32:55 UTC
The underlying issue is that the destroyer used to search for all resources that are tagged with the cluster name. The destroyer always has to search through us-east-1 since that is the only way to find resources that are global--as opposed to tied to a region. So, if there were a cluster in us-east-2 and us-east-1 with the same name, then the destroyer would attempt to delete some resources that belonged to the cluster in us-east-1. The destroyer has since been changed to only search for resources that are tagged with the appropriate openshiftClusterID.

Comment 8 Matthew Staebler 2019-02-14 02:34:17 UTC
> Have you specified the AWS_REGION when destroying the cluster? I believe that is necessary.

It is not necessary to specify the region. The region is determined from the metadata.json file.

Comment 9 Johnny Liu 2019-02-14 03:05:32 UTC
QE have also hit similar issue in https://bugzilla.redhat.com/show_bug.cgi?id=1674440#c0

Comment 10 W. Trevor King 2019-02-14 07:26:34 UTC
> Fixed by https://github.com/openshift/installer/pull/1170.

With that code released with installer 0.12.0, can we close this?

Comment 11 jooho lee 2019-02-14 14:29:24 UTC
This issue happens with 0.12.0.

As matthew said, the metadata.json file has the AWS region information. My question is why the installer tries to check another region also when it destroys the cluster.

From my understanding, the installer does not need to check other regions because of the region information from metadata.json file.

Comment 12 W. Trevor King 2019-02-14 14:44:20 UTC
> From my understanding, the installer does not need to check other regions because of the region information from metadata.json file.

As mentioned in comment 7, the destroyer needs to check us-east-1 too for cross-region resources like Route 53 zones.  I dunno why your account has NAT gateways in another region matching your cluster name or ID though.  Still, we should be able to add a NatGatewayNotFound handler to deleteEC2NATGateway (like [1], but for NAT gateways).

[1]: https://github.com/openshift/installer/pull/1250

Comment 13 Matthew Staebler 2019-02-16 00:02:45 UTC
This bug is not fixed. The installer still attempts to delete resources from us-east-1 that have the tag "kubernetes.io/cluster/<cluster-name>: owned".

Comment 14 Matthew Staebler 2019-02-18 19:20:10 UTC
Tracked by https://jira.coreos.com/projects/CORS/issues/CORS-922

Comment 15 W. Trevor King 2019-02-27 05:46:15 UTC
[1], which just went out with v0.13.0 [2], uses uniquified cluster names when creating resources and tags.  So the deleter will still look in us-east-1 as well as the cluster's region for the reasons given in comment 7, but it should no longer be accidentally matching resources belonging to other clusters which had been using the same cluster name.

[1]: https://github.com/openshift/installer/pull/1280
[2]: https://github.com/openshift/installer/releases/tag/v0.13.0

Comment 17 Johnny Liu 2019-03-07 10:40:26 UTC
Verified this bug with v4.0.16-1-dirty installer extracted from 4.0.0-0.nightly-2019-03-06-074438, and PASS.

Install two cluster (cluster-1 and cluster-2) with same cluster name.
Installer would name all resource using an uniq string including infraID.

[root@preserve-jialiu-ansible 20190307]# cat demo1/metadata.json 
{"clusterName":"qe-jialiu","clusterID":"073cf6c0-126e-45c5-afc7-96ded57458c4","infraID":"qe-jialiu-wcpnw","aws":{"region":"us-east-2","identifier":[{"kubernetes.io/cluster/qe-jialiu-wcpnw":"owned"},{"openshiftClusterID":"073cf6c0-126e-45c5-afc7-96ded57458c4"}]}}

[root@preserve-jialiu-ansible 20190307]# cat  demo2/metadata.json 
{"clusterName":"qe-jialiu","clusterID":"769e5dbc-6b67-486e-ab63-d49e6d14aec6","infraID":"qe-jialiu-hzfxg","aws":{"region":"us-east-2","identifier":[{"kubernetes.io/cluster/qe-jialiu-hzfxg":"owned"},{"openshiftClusterID":"769e5dbc-6b67-486e-ab63-d49e6d14aec6"}]}} 

Destroy cluster-2, run oc command against cluster-1, cluster-1 is still working well.

Comment 20 errata-xmlrpc 2019-06-04 10:42:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758