Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1662813

Summary: Local metadata and terraform files are not removed after run destroy cluster command
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: InstallerAssignee: Alex Crawford <crawford>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: wking, wking
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-23 18:23:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2019-01-02 04:27:31 UTC
Description of problem:
Hit an limit error when create cluster with installer. So run "destroy cluster" command to clean all created files and resources. But after destroy command completed successfully, found that local files still left which resulted that re-create cluster fail. And then re-run destroy command again still can not resolve it.

[root@preserve-installer 080]# ./openshift-install destroy cluster --dir demo
INFO Removed role jliu-master-role from instance profile jliu-master-profile 
INFO deleted profile jliu-master-profile          
INFO deleted role jliu-master-role                
INFO Removed role jliu-worker-role from instance profile jliu-worker-profile 
INFO deleted profile jliu-worker-profile          
INFO deleted role jliu-worker-role                
INFO Removed role jliu-bootstrap-role from instance profile jliu-bootstrap-profile 
INFO deleted profile jliu-bootstrap-profile       
INFO deleted role jliu-bootstrap-role             
INFO Emptied bucket                                name=terraform-20190102034846078300000001
INFO Deleted bucket                                name=terraform-20190102034846078300000001

[root@preserve-installer 080]# ./openshift-install create cluster --dir demo
FATAL failed to fetch Cluster: failed to load asset "Cluster": "terraform.tfstate" already exists.  There may already be a running cluster 

[root@preserve-installer 080]# ls -la demo/
total 1660
drwxr-xr-x. 3 root root   4096 Jan  2 04:03 .
drwxr-xr-x. 3 root root   4096 Jan  2 03:48 ..
drwxr-xr-x. 2 root root   4096 Jan  2 04:03 auth
-rw-r--r--. 1 root root   1351 Jan  2 02:18 kube-system-configmap-etcd-serving-ca.yaml
-rw-r--r--. 1 root root   1296 Jan  2 02:18 kube-system-configmap-root-ca.yaml
-rw-r--r--. 1 root root    167 Jan  2 03:48 metadata.json
-rw-r--r--. 1 root root 511013 Jan  2 04:03 .openshift_install.log
-rw-r--r--. 1 root root 830635 Jan  2 04:03 .openshift_install_state.json
-rw-r--r--. 1 root root 172311 Jan  2 03:48 terraform.tfstate
-rw-r--r--. 1 root root 153668 Jan  2 04:03 terraform.tfvars


Version-Release number of the following components:
./openshift-install v0.8.0

How reproducible:
100%

Steps to Reproduce:
1. Run "./openshift-install create cluster --dir demo" to create new cluster fail due to VpcLimitExceeded
INFO Creating cluster...                          
ERROR                                              
ERROR Error: Error applying plan:                  
ERROR                                              
ERROR 1 error occurred:                            
ERROR 	* module.vpc.aws_vpc.new_vpc: 1 error occurred: 
ERROR 	* aws_vpc.new_vpc: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached. 
ERROR 	status code: 400, request id: f3ccedc2-a419-4faa-883c-f4f06ac05f86

2. Run "./openshift-install destroy cluster --dir demo" to destroy the broken cluster.
3.

Actual results:
Local files on the disk are not removed.

Expected results:
Local files should be removed after destroy cluster.

Additional info:
Debug info for destroy command.
<--snip-->
DEBUG error getting tags for bucket us-west-1-226b53e58e0c418eb473ce68438afbde-2fd6: AuthorizationHeaderMalformed: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-1'
	status code: 400, request id: F60F46B22FFEDC04, host id: KzkGH0cvOj3yFn8xB/oRaOm2BE0L/RTdJiS6rrYHvaZZcSUK8w8HLAUDFGowwwxeHWppL+AYTfw=, skipping... 
DEBUG from 53 total s3 buckets, 0 match filters    
DEBUG Exiting deleting buckets (map[kubernetes.io/cluster/jliu:owned]) 
DEBUG goroutine deleteS3Buckets complete (1 left)  
DEBUG error getting tags for bucket us-west-1-226b53e58e0c418eb473ce68438afbde-2fd6: AuthorizationHeaderMalformed: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-1'
	status code: 400, request id: 0BA9BC7F52722D47, host id: SiGY9bJgcxh69/JMhwipC1OvXez5GRMqqr3+VeaFey6o4MguoBlv8MzVCmLmuNCFBMO1vmw4DTg=, skipping... 
DEBUG from 53 total s3 buckets, 0 match filters    
DEBUG Exiting deleting buckets (map[openshiftClusterID:3a5ef664-2baf-475b-8681-27cbbe4e0b0b]) 
DEBUG goroutine deleteS3Buckets complete (0 left)  
DEBUG Purging asset "Terraform Variables" from disk 
DEBUG Purging asset "Kubeconfig Admin" from disk

Comment 1 Alex Crawford 2019-01-05 00:42:17 UTC
> Local files should be removed after destroy cluster.

I disagree. Those assets and logs are immensely useful for debugging. Our docs suggest using a separate asset directory for each cluster to avoid this issue.

Comment 2 liujia 2019-01-07 02:33:42 UTC
Contrast with another destroy scenario("destroy cluster" against a normal cluster):
1. Run "create cluster" to new a cluster successfully.
2. Run "destroy cluster" to destroy above cluster.
# ls -la auto/
total 1280
drwxr-xr-x. 2 root root   4096 Jan  7 01:45 .
drwxr-xr-x. 5 root root  20480 Jan  4 07:31 ..
-rw-r--r--. 1 root root 666288 Jan  7 01:44 .openshift_install.log
-rw-r--r--. 1 root root 611654 Jan  7 01:44 .openshift_install_state.json
There are only .openshift_install.log and .openshift_install_state.json left after run "destroy cluster" successfully(This was regarded as destroy complete).

About the issue in the bug,"destroy cluster" did not clean remaining files after resources were cleaned just like above. 
1) Users choose "destroy" which means they do not want all of cluster. If they want to keep them, they will do debug or backup before "destroy". Since all resources have been cleaned during "destroy", those files were useless. 
2) If we thought "Those assets and logs are immensely useful for debugging.", then why local files were cleaned in above scenario. What's the deference between these two "destroy"? The difference will mislead users re-run destroy because they think the destroy failed if local files(except .* file) left.

Comment 3 W. Trevor King 2019-01-08 01:04:25 UTC
> There are only .openshift_install.log and .openshift_install_state.json left after run "destroy cluster" successfully(This was regarded as destroy complete).

When you successfully destroy the cluster, there are no longer any remote cluster resources around.  But when creation fails, there might still be remote resources.  This is why having the Terraform state around after a failed 'create cluster' is useful, while having the Terraform state around after a successful 'destroy cluster' is not.  .openshift_install.log is still useful after a successful 'destroy cluster' to address "but I still see $RESOURCE, why didn't 'destroy cluster' remove it?" issues.  I personally don't care one way or the other about whether .openshift_install_state.json survives a successful 'destroy cluster', but see [1], which Abhinav said was intentional (although he didn't make that comment in the public GitHub PR).

[1]: https://github.com/openshift/installer/pull/547#issuecomment-435565108

Comment 4 liujia 2019-01-08 01:52:52 UTC
> When you successfully destroy the cluster, there are no longer any remote
> cluster resources around.  But when creation fails, there might still be
> remote resources.  This is why having the Terraform state around after a
> failed 'create cluster' is useful, while having the Terraform state around
> after a successful 'destroy cluster' is not. 

Thx for your reply. Agree what u said. But my scenario is as following:
1) When creation fails, and the Terraform state left(expectation).
Users have to 'destroy cluster' because above creation fails and the cluster is semi-finished.
2) Run 'destroy cluster', the command finished without errors.  
# ./openshift-install destroy cluster --dir demo
INFO Removed role jliu-master-role from instance profile jliu-master-profile 
INFO deleted profile jliu-master-profile          
INFO deleted role jliu-master-role                
INFO Removed role jliu-worker-role from instance profile jliu-worker-profile 
INFO deleted profile jliu-worker-profile          
INFO deleted role jliu-worker-role                
INFO Removed role jliu-bootstrap-role from instance profile jliu-bootstrap-profile 
INFO deleted profile jliu-bootstrap-profile       
INFO deleted role jliu-bootstrap-role             
INFO Emptied bucket                                name=terraform-20190102034846078300000001
INFO Deleted bucket                                name=terraform-20190102034846078300000001

Then my concern is that does this can be regarded as a successful 'destroy'? If yes(destroy successfully), installer should act the same as a successful destroy just like 'while having the Terraform state around
> after a successful 'destroy cluster' is not'. If not(destroy fail), there is not any explicit hint or error output to note users the destroy fails.

Comment 5 W. Trevor King 2019-01-18 07:12:00 UTC
https://github.com/openshift/installer/pull/1086 is in flight to empty the state on a successful 'destroy cluster'.

> If not(destroy fail), there is not any explicit hint or error output to note users the destroy fails.

If the installer exits zero, it's a successful destroy (just like every other command-line command).  For some failure modes, the installer will log a fatal error (both to .openshift_install.log and the terminal) when it fails.  For "I tried to delete $RESOURCE but couldn't" errors, the installer will continue to attempt those deletions forever.  The user will be clued in to the failed deletion by the fact that they need to kill the 'destroy cluster' process in order to get their terminal back.

Comment 6 Alex Crawford 2019-01-23 18:23:07 UTC
This should be fixed in 0.10.1.