Bug 1713336 - openshift-install destroy cluster hangs indefinitely
Summary: openshift-install destroy cluster hangs indefinitely
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.5.0
Assignee: Jeremiah Stuever
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-23 12:30 UTC by Michael Gugino
Modified: 2020-07-13 17:11 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:11:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3196 0 None closed Bug 1713336: aws destroy: show warnings when things fail to delete after 5 minutes. 2020-11-11 15:39:12 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:11:19 UTC

Description Michael Gugino 2019-05-23 12:30:56 UTC
Description of problem:
When calling `openshift-install destroy cluster` and you have attached a policy to the machine-api IAM role, the installer will hang indefinitely with no output (other components will be deleted, but installer gets stuck on IAM).

Version-Release number of selected component (if applicable):
$ ./openshift-install version
./openshift-install v4.1.0-201905171742-dirty
built from commit 6ba66dbb6c2c53e1901a6d167d1c813bbbf27f4d
release image quay.io/openshift-release-dev/ocp-release@sha256:dc67ad5edd91ca48402309fe0629593e5ae3333435ef8d0bc52c2b62ca725021

How reproducible:
100%

Steps to Reproduce:
1.  Create cluster
2.  Attach IAM policy to user
3.  Attempt to delete cluster

Actual results:
No output from the installer after a certain point.


Expected results:
Even if we can't handle modified IAM users, we should fail after a reasonable amount of retries and show a message.  I had the script running overnight for destroy and it never exited.

Additional info:
Logs:
time="2019-05-23T08:21:09-04:00" level=debug msg="search for IAM roles"
time="2019-05-23T08:21:09-04:00" level=debug msg="search for IAM users"
time="2019-05-23T08:21:10-04:00" level=debug msg="delete IAM roles and users"
time="2019-05-23T08:21:11-04:00" level=debug msg="DeleteConflict: Cannot delete entity, must detach all policies first.\n\tstatus code: 409, request id: 3ff7b07c-7d55-11e9-999c-f3a267015376" arn="arn:aws:iam::269733383066:user/mgugino-dev-rxlx7-openshift-machine-api-8jtff"

Comment 3 Jeremiah Stuever 2020-02-26 22:27:52 UTC
We will not be changing the infinite looping behavior. I recommend you use the timeout command to limit this as needed. A patch to bubble up the errors as warnings after 5 minutes is in the works.

timeout 30m openshift-install destroy cluster

Comment 7 Johnny Liu 2020-03-09 04:46:12 UTC
Reproduce this on 4.4.
[root@preserve-jialiu-ansible ~]# openshift-install version
openshift-install 4.4.0-0.nightly-2020-03-08-235004
built from commit f371355517f9da267c295e11c01cd3dfc54b39d4
release image registry.svc.ci.openshift.org/ocp/release@sha256:edaa0ec52c2d42bbcce96873f40fc343446786fde8b42ef29bfe6964c2f72658

DEBUG OpenShift Installer 4.4.0-0.nightly-2020-03-08-235004 
DEBUG Built from commit f371355517f9da267c295e11c01cd3dfc54b39d4 
INFO Credentials loaded from the "default" profile in file "/root/.aws/credentials" 
DEBUG search for and delete matching instances by tag matching aws.Filter{"kubernetes.io/cluster/geliu450309-whsk6":"owned"} 
DEBUG Terminated                                    instance=i-00f60cbc4133437b4
DEBUG Terminated                                    instance=i-0a610c9fb41ab452e
DEBUG Terminated                                    instance=i-0994045ee8c68335e
DEBUG Terminated                                    instance=i-008049cfac85bdff5
DEBUG Terminated                                    instance=i-03015ec27bd9114a7
DEBUG Terminated                                    instance=i-0495fb55c11e67f2c
DEBUG search for and delete matching resources by tag in us-east-2 matching aws.Filter{"kubernetes.io/cluster/geliu450309-whsk6":"owned"} 
INFO Deleted                                       arn="arn:aws:ec2:us-east-2:301721915996:natgateway/nat-0f0179e84f4df3d76" id=nat-0f0179e84f4df3d76
INFO Deleted                                       arn="arn:aws:ec2:us-east-2:301721915996:natgateway/nat-0de988808b6ac202c" id=nat-0de988808b6ac202c
INFO Deleted                                       arn="arn:aws:ec2:us-east-2:301721915996:natgateway/nat-0ec620659cabc1d3a" id=nat-0ec620659cabc1d3a
DEBUG search for and delete matching resources by tag in us-east-1 matching aws.Filter{"kubernetes.io/cluster/geliu450309-whsk6":"owned"} 
DEBUG no deletions from us-east-1, removing client 
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: 2540a068-0015-4515-b0a6-cbe7da8b5b64  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for and delete matching resources by tag in us-east-2 matching aws.Filter{"kubernetes.io/cluster/geliu450309-whsk6":"owned"} 
DEBUG no deletions from us-east-2, removing client 
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: 35342b2c-5ae4-4c11-800b-d99ef680e8c0  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: ca7e11ae-8d2d-4188-94eb-2e575e189482  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshif


Hang here for ever without any WARNING message, only DEBUG message.

Verified this bug with 4.5.0-0.nightly-2020-03-06-190457.

DEBUG OpenShift Installer 4.5.0-0.nightly-2020-03-06-190457 
DEBUG Built from commit c9fb0963e6a058137ca07817cc181d49cf931a59 
<--snip-->
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: d1469848-5ddb-4ceb-9642-b86e1267c677  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: 5de6aa5d-92e1-460e-9f35-a529db07bbb4  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: ba5f3cb5-13c5-461a-bfd9-78b7878ab109  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: fffee2c6-42ed-42bc-bcc9-0a56cb7ca9d9  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
DEBUG DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: 6c5e269c-b5f3-4ac0-9244-f3578c3031b8  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users                         
DEBUG delete IAM roles and users                   
WARNING DeleteConflict: Cannot delete entity, must detach all policies first.
	status code: 409, request id: a205639a-d138-4dae-905f-cae693686a62  arn="arn:aws:iam::301721915996:user/geliu450309-whsk6-openshift-machine-api-aws-qrpl2"
DEBUG search for IAM roles                         
DEBUG search for IAM users 

Though also hang here for ever, but it is printing out WARNING message.

Comment 8 To Hung Sze 2020-06-17 13:23:45 UTC
I ran into a different variation of this bug:
I have a 4.5 RC1 GCP IPI cluster with a firewall rule set up (but disabled).
Destroy cluster script continuously loop through and display this error (I have debug on):
DEBUG Deleting network tszegp-gt4q5-network        
DEBUG Networks: failed to delete network tszegp-gt4q5-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-qe/global/networks/tszegp-gt4q5-network' is already being used by 'projects/openshift-qe/global/firewalls/tsze-blocks-quayio'
After I deleted the firewall rule, script continued to finish.

Comment 9 To Hung Sze 2020-06-19 15:05:58 UTC
Current behavior GCP behavior (4.5) when install is kicked off without specifying log-level and a firewall rule is associated with this cluster):
INFO Deleted route default-route-2296075b8a62a19a 
INFO Deleted route default-route-912519603c8403c4 
INFO Deleted target pool a3bc35fd3f5aa4401b32a3e466ea060a 
INFO Deleted target pool tszegcp61820c-76kff-api  
INFO Deleted backend service tszegcp61820c-76kff-api-internal 
INFO Deleted subnetwork tszegcp61820c-76kff-master-subnet 
INFO Deleted subnetwork tszegcp61820c-76kff-worker-subnet 
INFO Deleted instance group tszegcp61820c-76kff-master-us-central1-b 
INFO Deleted instance group tszegcp61820c-76kff-master-us-central1-c 
INFO Deleted instance group tszegcp61820c-76kff-master-us-central1-a 
INFO Deleted health check tszegcp61820c-76kff-api-internal 
INFO Deleted HTTP health check a3bc35fd3f5aa4401b32a3e466ea060a 
INFO Deleted HTTP health check tszegcp61820c-76kff-api 
(appears to be stuck here)





(note: deleted the firewall associated with this cluster separately - installer continues)
INFO Deleted network tszegcp61820c-76kff-network  
INFO Time elapsed: 1h24m25s

Comment 10 Jeremiah Stuever 2020-06-19 16:48:55 UTC
We only addressed AWS with this BZ. GCP was handled with an RFE, which has yet to merge.

https://github.com/openshift/installer/pull/3749

Comment 12 errata-xmlrpc 2020-07-13 17:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.