Bug 1801968

Summary:

GCP cluster destroy gets stuck in a loop of network deleting

Product:

OpenShift Container Platform

Reporter:

Yang Yang <yanyang>

Component:

Installer

Assignee:

aos-install

Installer sub component:

openshift-installer

QA Contact:

Yang Yang <yanyang>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

medium

CC:

adahiya, bleanhar, erich, gpei, jiajliu

Version:

4.4

Keywords:

Reopened

Target Milestone:

---

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-10-12 17:47:38 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
openshift_intsall.log	none
openshift_install_1.log	none
metadata.json	none

Description Yang Yang 2020-02-12 02:08:08 UTC

Description of problem:

Below error message shows up repeatly during cluster destroying.

Networks: failed to delete  network yybz3-qg2qj-network with error:  RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource  'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already  being used by  'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-http-hc' 

The root cause is that installer destroy filters the firewall rule having infra_id in name but some firewall rules do not have infra_id in name. In this case, google api complains the associated network was still in use when attempting to delete it

func (o *ClusterUninstaller) listFirewalls() ([]cloudResource, error) { 
        return o.listFirewallsWithFilter("items(name),nextPageToken", o.clusterIDFilter(), nil) 
} 


Version-Release number of the following components:
4.4.0-0.nightly-2020-02-11-060435

How reproducible:
Always

Steps to Reproduce:
1. Create an IPI cluster on GCP
2. Destroy cluster using openshift-install

Actual results:
DEBUG Listing instance groups                       
DEBUG Listing forwarding rules                      
DEBUG Listing backend services                      
DEBUG Listing health checks                         
DEBUG Listing HTTP health checks                    
DEBUG Listing routers                               
DEBUG Listing subnetworks                           
DEBUG Listing networks                              
DEBUG Found network: yybz3-qg2qj-network            
DEBUG Listing routes                                
DEBUG Deleting network yybz3-qg2qj-network          
DEBUG Networks: failed to delete network yybz3-qg2qj-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being used by 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-http-hc'  


Expected results:
Installer destroyed all cluster resources


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Yang Yang 2020-02-12 07:40:19 UTC

It's not 100% reproduced

Comment 2 Abhinav Dahiya 2020-02-12 17:35:32 UTC

(In reply to yangyang from comment #0)
> Description of problem:
> 
> Below error message shows up repeatly during cluster destroying.
> 
> Networks: failed to delete  network yybz3-qg2qj-network with error: 
> RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 
> 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already 
> being used by 
> 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-
> http-hc' 
> 
> The root cause is that installer destroy filters the firewall rule having
> infra_id in name but some firewall rules do not have infra_id in name. In
> this case, google api complains the associated network was still in use when
> attempting to delete it
> 
> func (o *ClusterUninstaller) listFirewalls() ([]cloudResource, error) { 
>         return o.listFirewallsWithFilter("items(name),nextPageToken",
> o.clusterIDFilter(), nil) 
> } 
> 
> 
> Version-Release number of the following components:
> 4.4.0-0.nightly-2020-02-11-060435
> 
> How reproducible:
> Always
> 
> Steps to Reproduce:
> 1. Create an IPI cluster on GCP
> 2. Destroy cluster using openshift-install
> 
> Actual results:
> DEBUG Listing instance groups                       
> DEBUG Listing forwarding rules                      
> DEBUG Listing backend services                      
> DEBUG Listing health checks                         
> DEBUG Listing HTTP health checks                    
> DEBUG Listing routers                               
> DEBUG Listing subnetworks                           
> DEBUG Listing networks                              
> DEBUG Found network: yybz3-qg2qj-network            
> DEBUG Listing routes                                
> DEBUG Deleting network yybz3-qg2qj-network          
> DEBUG Networks: failed to delete network yybz3-qg2qj-network with error:
> RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource
> 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being
> used by
> 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-
> http-hc'  
> 
> 
> Expected results:
> Installer destroyed all cluster resources
> 
> 
> Additional info:
> Please attach logs from ansible-playbook with the -vvv flag

Can you attach the complete .openshift_install.log ?

Comment 3 Yang Yang 2020-02-13 02:08:31 UTC

Created attachment 1662841 [details]
openshift_intsall.log

Comment 4 Yang Yang 2020-02-13 02:17:14 UTC

(In reply to yangyang from comment #3)
> Created attachment 1662841 [details]
> openshift_intsall.log

It's incorrect log hence please ignore it

Comment 5 Yang Yang 2020-02-13 02:19:48 UTC

Created attachment 1662843 [details]
openshift_install_1.log

Comment 6 Yang Yang 2020-02-13 03:37:24 UTC

It appears to be happening when destroying the firewall rules, but stopping the destroy progress before all firewall rules destroy is completed.

Comment 7 Yang Yang 2020-02-13 04:39:15 UTC

Updating the reproduced steps:
1, Create an IPI cluster on GCP
2, Destroy the cluster using openshift-install
3, Terminate the destroy progress when destroying the firewall rules
4, Re-destroy the cluster using openshift-install

Comment 8 Yang Yang 2020-02-13 08:03:18 UTC

The environment is kept around in case you need to debug the issue. Run openshit-install destroy to reproduce it. Attaching the metadata.json

Comment 9 Yang Yang 2020-02-13 08:03:53 UTC

Created attachment 1662898 [details]
metadata.json

Comment 10 Abhinav Dahiya 2020-02-13 17:18:18 UTC

(In reply to yangyang from comment #7)
> Updating the reproduced steps:
> 1, Create an IPI cluster on GCP
> 2, Destroy the cluster using openshift-install
> 3, Terminate the destroy progress when destroying the firewall rules
> 4, Re-destroy the cluster using openshift-install

Don't terminate the destroy, GCP k8s cloud provider creates randomly named resources which can only be identified when there are instances present. terminating destroy in between means we loose all the internal state collected from instances and next time the instances are gone.

There isn't a whole lot improvement we can do here.

Comment 11 Yang Yang 2020-02-14 02:18:34 UTC

> Don't terminate the destroy

Understood but we can't always expect the destroy runs smoothly

Comment 12 Yang Yang 2020-03-25 13:56:29 UTC

Hi Abhinav,

I re-tried the scenario in comment#7 but the issue did not occur. However, it happens when destroying a cluster in which there exists a machine that does not contain the infra_id. Cluster destroy leaks firewall rules projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-http-hc and projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2. I am updating the reproduced steps below. Please let me know if you need a separate bug. It's okay for me to close it.

Reproduced steps are as below:
1. Create an IPI cluster on GCP
2. Add a machine to the cluster such that the new instance name does not contain the infra_id, e.g.
# cat machine.yaml
name: yanyang-machine

# oc create -f machine.yaml

# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
yanyang-machine Running n1-standard-4 us-central1 us-central1-a 92m
yybz3-qg2qj-m-0 Running n1-standard-4 us-central1 us-central1-a 28h
yybz3-qg2qj-m-1 Running n1-standard-4 us-central1 us-central1-b 28h
yybz3-qg2qj-m-2 Running n1-standard-4 us-central1 us-central1-c 28h
yybz3-qg2qj-w-a-z6n69 Running n1-standard-4 us-central1 us-central1-a 28h
yybz3-qg2qj-w-b-zfzg9 Running n1-standard-4 us-central1 us-central1-b 28h
yybz3-qg2qj-w-c-rjgzb Running n1-standard-4 us-central1 us-central1-c 28h

3. Destroy the cluster but destroy leaks firewall rule so that destroy stuck in the loop of network deletion.
level=debug msg="Deleting network yybz3-qg2qj-network"
level=debug msg="Networks: failed to delete network yanyan-4ldrz-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being used by 'projects/openshift-qe/global/firewalls/k8s-a7bd0dc51cf3c4b70bfb906c861a0dad-http-hc'"

Comment 13 Abhinav Dahiya 2020-05-11 18:13:02 UTC

> However, it happens when destroying a cluster in which there exists a machine that does not contain the infra_id.

the installer will not delete any machines not prefixed with infra-id, because it implies that the machine is not for the cluster.

This will cause the installer to skip resources like instance groups, as it's not safe to delete those too since they have non-cluster machines.

This is expected behaviour and not a bug.

Comment 14 Yang Yang 2020-05-12 03:38:33 UTC

> the installer will not delete any machines not prefixed with infra-id

Since the PR https://github.com/openshift/installer/pull/3059 was introduced, installer could delete instances with owned label. 

The problem is that the firewall rules k8s-* are leaked which blocks the destroy.

Comment 20 Eric Rich 2020-09-01 20:45:13 UTC

(In reply to Abhinav Dahiya from comment #13)
> > However, it happens when destroying a cluster in which there exists a machine that does not contain the infra_id.
> 
> the installer will not delete any machines not prefixed with infra-id,
> because it implies that the machine is not for the cluster.
> 
> This will cause the installer to skip resources like instance groups, as
> it's not safe to delete those too since they have non-cluster machines.
> 
> This is expected behaviour and not a bug.

If customers hit this what actions should they take? IE: What is the workaround for this (delete things manually in the GCP console)? 
If yes, how do we identify OCP installer created resources to guide them through the process of deleting the resources?

Comment 22 Abhinav Dahiya 2020-10-12 17:47:38 UTC

I think for now the users will have to manually look at health checks and firewall rules in the the project and network that are not being used and deleted them using console. The k8s gcp controller adds not identifiable markers to these so no easy filtering is possible.

*** This bug has been marked as a duplicate of bug 1875511 ***