Bug 2070744

Summary: openshift-install destroy in us-gov-west-1 results in infinite loop - AWS govcloud
Product: OpenShift Container Platform Reporter: Mike Murphy <micmurph>
Component: InstallerAssignee: Aditya Narayanaswamy <anarayan>
Installer sub component: openshift-installer QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact: Mike Pytlak <mpytlak>
Severity: low    
Priority: low CC: anarayan, apjagtap, aygarg, dfitzmau, mpytlak, padillon, rerussel, vlaad, yunjiang
Version: 4.9   
Target Milestone: ---   
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, uninstalling an AWS cluster that was deployed to the `us-gov-west-1` region failed because AWS resources could not be untagged. This resulted in the process going into an infinite loop, where the installation program tried to untag the resources. This update prevents the retry. As a result, uninstalling the cluster succeeds. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2070744[*BZ#2070744*])
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-17 22:46:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Murphy 2022-03-31 19:58:45 UTC
Description of problem:

During destruction of the cluster with the openshift-install binary, it results in an infinite loop with untagging resources for Route53.

INFO untag shared resources: InvalidParameterException: Invocation of UntagResources for this resource is not supported in this region
DEBUG Search for and remove tags in us-gov-west-1 matching kubernetes.io/cluster/test-cluster-bsmt4: shared

Version-Release number of selected component (if applicable):
openshift-install version
./openshift-install 4.9.25

How reproducible:


Steps to Reproduce:
1. Deployment in us-gov-west-1
2. - install-config specifies a hostedZone pointing to a Route53 record that already exists



Actual results:

Stuck in a loop and will not go past trying to untag:

INFO untag shared resources: InvalidParameterException: Invocation of UntagResources for this resource is not supported in this region
DEBUG Search for and remove tags in us-gov-west-1 matching kubernetes.io/cluster/test-cluster-bsmt4: shared


Expected results:

Untag the route53 hosted zone and continue destroying the cluster.

Additional info:

The installer is able to tag the resource fine, but it is unable to destroy the cluster due to hanging up in the untagging of the Route53 hosted zone. We have to manually go in to untag the Route53 hosted zone (or use the AWS CLI) before it can move on with the tear-down of the cluster.

When the hostedZone is specified, the installer always gets stuck in a loop trying to untag the Route53 record. If we don't specify the hostedZone (i.e. the installer creates the hosted zone), it is able to successfully destroy. However, this does not work for customers case, since the hosted zone needs to be created and tied to their internal DNS.

Code snippet from installer:

[1]https://github.com/openshift/installer/blob/beefeacda123ed41ad8f486aa5f7435e2133e8ee/pkg/destroy/aws/aws.go#L731
[2]https://github.com/openshift/installer/blob/beefeacda123ed41ad8f486aa5f7435e2133e8ee/pkg/destroy/aws/aws.go#L184

Comment 1 Mike Murphy 2022-04-01 15:12:01 UTC
As for the openshift-installer, the specific infinite loop is in this block here: https://github.com/openshift/installer/blob/beefeacda123ed41ad8f486aa5f7435e2133e8ee/pkg/destroy/aws/shared.go#L113
On line 113, it gets the InvalidParameterException seen above, which logs as an info (DEBUG) message and continues the loop. This results in the resource never getting untagged, which never gets out of the loop on line 59.

Comment 2 Apoorva Jagtap 2022-04-18 01:00:21 UTC
Hello,

On ticket 03187907, the team has received a response from the AWS support regarding the same issue. As per their analysis this behavior, i.e. untagging resources in ‘us-gov-west-1’ AWS region via SDK failing with the error [0], whereas the untag is successful via AWS CLI is observed due to an already known bug at AWS's end.

However, the team would still like the infinite loop to be resolved in the `openshift-installer` binary. It might be more efficient to report an error instead of the loop going forever.

[0] 
~~~
InvalidParameterException: Invocation of UntagResources for this resource is not supported in this region
~~~

Comment 5 Yunfei Jiang 2022-10-08 03:12:23 UTC
Destroy process still went into an infinite loop:

level=debug msg=listing AWS hosted zones "yunjiang-bz1.qe.devcluster.openshift.com." (page 0) arn=arn:aws-us-gov:route53:::hostedzone/Z10189021N8AASF3CAGVR id=Z10189021N8AASF3CAGVR
level=debug msg=listing AWS hosted zones "qe.devcluster.openshift.com." (page 0) arn=arn:aws-us-gov:route53:::hostedzone/Z10189021N8AASF3CAGVR id=Z10189021N8AASF3CAGVR
level=debug msg=listing AWS hosted zones "devcluster.openshift.com." (page 0) arn=arn:aws-us-gov:route53:::hostedzone/Z10189021N8AASF3CAGVR id=Z10189021N8AASF3CAGVR
level=debug msg=listing AWS hosted zones "openshift.com." (page 0) arn=arn:aws-us-gov:route53:::hostedzone/Z10189021N8AASF3CAGVR id=Z10189021N8AASF3CAGVR
level=debug msg=listing AWS hosted zones "com." (page 0) arn=arn:aws-us-gov:route53:::hostedzone/Z10189021N8AASF3CAGVR id=Z10189021N8AASF3CAGVR
level=info msg=Cleaned record sets from hosted zone arn=arn:aws-us-gov:route53:::hostedzone/Z10189021N8AASF3CAGVR id=Z10189021N8AASF3CAGVR
level=debug msg=Nothing to clean for shared ec2 resource arn=arn:aws-us-gov:ec2:us-gov-west-1:225746144451:subnet/subnet-0798181b46953e88f
level=debug msg=Nothing to clean for shared ec2 resource arn=arn:aws-us-gov:ec2:us-gov-west-1:225746144451:subnet/subnet-0bf50e17442d665f1
level=info msg=untag shared resources: InvalidParameterException: Invocation of UntagResources for this resource is not supported in this region
level=debug msg=Search for and remove tags in us-gov-west-1 matching kubernetes.io/cluster/yunjiang-bz1-fb5nr: shared
level=debug msg=Nothing to clean for shared ec2 resource arn=arn:aws-us-gov:ec2:us-gov-west-1:225746144451:vpc/vpc-00295a4ddaf8a691a


OCP Version 4.12.0-0.nightly-2022-09-26-111919

Comment 6 Rex Russell 2022-12-19 02:23:05 UTC
Is this bug still being worked? Can we please get a status? 
Thank you.

Comment 12 Yunfei Jiang 2023-02-08 05:26:53 UTC
verified on 4.13.0-0.nightly-2023-02-07-064924, PASS.

Comment 14 errata-xmlrpc 2023-05-17 22:46:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1326