Hide Forgot
Description of problem: Below error message shows up repeatly during cluster destroying. Networks: failed to delete network yybz3-qg2qj-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being used by 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-http-hc' The root cause is that installer destroy filters the firewall rule having infra_id in name but some firewall rules do not have infra_id in name. In this case, google api complains the associated network was still in use when attempting to delete it func (o *ClusterUninstaller) listFirewalls() ([]cloudResource, error) { return o.listFirewallsWithFilter("items(name),nextPageToken", o.clusterIDFilter(), nil) } Version-Release number of the following components: 4.4.0-0.nightly-2020-02-11-060435 How reproducible: Always Steps to Reproduce: 1. Create an IPI cluster on GCP 2. Destroy cluster using openshift-install Actual results: DEBUG Listing instance groups DEBUG Listing forwarding rules DEBUG Listing backend services DEBUG Listing health checks DEBUG Listing HTTP health checks DEBUG Listing routers DEBUG Listing subnetworks DEBUG Listing networks DEBUG Found network: yybz3-qg2qj-network DEBUG Listing routes DEBUG Deleting network yybz3-qg2qj-network DEBUG Networks: failed to delete network yybz3-qg2qj-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being used by 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-http-hc' Expected results: Installer destroyed all cluster resources Additional info: Please attach logs from ansible-playbook with the -vvv flag
It's not 100% reproduced
(In reply to yangyang from comment #0) > Description of problem: > > Below error message shows up repeatly during cluster destroying. > > Networks: failed to delete network yybz3-qg2qj-network with error: > RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource > 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already > being used by > 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2- > http-hc' > > The root cause is that installer destroy filters the firewall rule having > infra_id in name but some firewall rules do not have infra_id in name. In > this case, google api complains the associated network was still in use when > attempting to delete it > > func (o *ClusterUninstaller) listFirewalls() ([]cloudResource, error) { > return o.listFirewallsWithFilter("items(name),nextPageToken", > o.clusterIDFilter(), nil) > } > > > Version-Release number of the following components: > 4.4.0-0.nightly-2020-02-11-060435 > > How reproducible: > Always > > Steps to Reproduce: > 1. Create an IPI cluster on GCP > 2. Destroy cluster using openshift-install > > Actual results: > DEBUG Listing instance groups > DEBUG Listing forwarding rules > DEBUG Listing backend services > DEBUG Listing health checks > DEBUG Listing HTTP health checks > DEBUG Listing routers > DEBUG Listing subnetworks > DEBUG Listing networks > DEBUG Found network: yybz3-qg2qj-network > DEBUG Listing routes > DEBUG Deleting network yybz3-qg2qj-network > DEBUG Networks: failed to delete network yybz3-qg2qj-network with error: > RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource > 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being > used by > 'projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2- > http-hc' > > > Expected results: > Installer destroyed all cluster resources > > > Additional info: > Please attach logs from ansible-playbook with the -vvv flag Can you attach the complete .openshift_install.log ?
Created attachment 1662841 [details] openshift_intsall.log
(In reply to yangyang from comment #3) > Created attachment 1662841 [details] > openshift_intsall.log It's incorrect log hence please ignore it
Created attachment 1662843 [details] openshift_install_1.log
It appears to be happening when destroying the firewall rules, but stopping the destroy progress before all firewall rules destroy is completed.
Updating the reproduced steps: 1, Create an IPI cluster on GCP 2, Destroy the cluster using openshift-install 3, Terminate the destroy progress when destroying the firewall rules 4, Re-destroy the cluster using openshift-install
The environment is kept around in case you need to debug the issue. Run openshit-install destroy to reproduce it. Attaching the metadata.json
Created attachment 1662898 [details] metadata.json
(In reply to yangyang from comment #7) > Updating the reproduced steps: > 1, Create an IPI cluster on GCP > 2, Destroy the cluster using openshift-install > 3, Terminate the destroy progress when destroying the firewall rules > 4, Re-destroy the cluster using openshift-install Don't terminate the destroy, GCP k8s cloud provider creates randomly named resources which can only be identified when there are instances present. terminating destroy in between means we loose all the internal state collected from instances and next time the instances are gone. There isn't a whole lot improvement we can do here.
> Don't terminate the destroy Understood but we can't always expect the destroy runs smoothly
Hi Abhinav, I re-tried the scenario in comment#7 but the issue did not occur. However, it happens when destroying a cluster in which there exists a machine that does not contain the infra_id. Cluster destroy leaks firewall rules projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2-http-hc and projects/openshift-qe/global/firewalls/k8s-a86199278446349158fc901ebd106be2. I am updating the reproduced steps below. Please let me know if you need a separate bug. It's okay for me to close it. Reproduced steps are as below: 1. Create an IPI cluster on GCP 2. Add a machine to the cluster such that the new instance name does not contain the infra_id, e.g. # cat machine.yaml name: yanyang-machine # oc create -f machine.yaml # oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE yanyang-machine Running n1-standard-4 us-central1 us-central1-a 92m yybz3-qg2qj-m-0 Running n1-standard-4 us-central1 us-central1-a 28h yybz3-qg2qj-m-1 Running n1-standard-4 us-central1 us-central1-b 28h yybz3-qg2qj-m-2 Running n1-standard-4 us-central1 us-central1-c 28h yybz3-qg2qj-w-a-z6n69 Running n1-standard-4 us-central1 us-central1-a 28h yybz3-qg2qj-w-b-zfzg9 Running n1-standard-4 us-central1 us-central1-b 28h yybz3-qg2qj-w-c-rjgzb Running n1-standard-4 us-central1 us-central1-c 28h 3. Destroy the cluster but destroy leaks firewall rule so that destroy stuck in the loop of network deletion. level=debug msg="Deleting network yybz3-qg2qj-network" level=debug msg="Networks: failed to delete network yanyan-4ldrz-network with error: RESOURCE_IN_USE_BY_ANOTHER_RESOURCE: The network resource 'projects/openshift-qe/global/networks/yybz3-qg2qj-network' is already being used by 'projects/openshift-qe/global/firewalls/k8s-a7bd0dc51cf3c4b70bfb906c861a0dad-http-hc'"
> However, it happens when destroying a cluster in which there exists a machine that does not contain the infra_id. the installer will not delete any machines not prefixed with infra-id, because it implies that the machine is not for the cluster. This will cause the installer to skip resources like instance groups, as it's not safe to delete those too since they have non-cluster machines. This is expected behaviour and not a bug.
> the installer will not delete any machines not prefixed with infra-id Since the PR https://github.com/openshift/installer/pull/3059 was introduced, installer could delete instances with owned label. The problem is that the firewall rules k8s-* are leaked which blocks the destroy.
(In reply to Abhinav Dahiya from comment #13) > > However, it happens when destroying a cluster in which there exists a machine that does not contain the infra_id. > > the installer will not delete any machines not prefixed with infra-id, > because it implies that the machine is not for the cluster. > > This will cause the installer to skip resources like instance groups, as > it's not safe to delete those too since they have non-cluster machines. > > This is expected behaviour and not a bug. If customers hit this what actions should they take? IE: What is the workaround for this (delete things manually in the GCP console)? If yes, how do we identify OCP installer created resources to guide them through the process of deleting the resources?
I think for now the users will have to manually look at health checks and firewall rules in the the project and network that are not being used and deleted them using console. The k8s gcp controller adds not identifiable markers to these so no easy filtering is possible. *** This bug has been marked as a duplicate of bug 1875511 ***