Bug 2090049

Summary:	destroying GCP cluster which has a compute node without infra id in name would fail to delete 2 k8s firewall-rules and VPC network
Product:	OpenShift Container Platform	Reporter:	Jianli Wei <jiwei>
Component:	Installer	Assignee:	Brent Barbachem <bbarbach>
Installer sub component:	openshift-installer	QA Contact:	Jianli Wei <jiwei>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	bbarbach, gpei
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:14:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jianli Wei 2022-05-25 02:23:24 UTC

Version:
./openshift-install 4.11.0-0.nightly-2022-05-20-213928
built from commit 69ac7528ba1c957132008ab52ccf5d9e7dad778f
release image registry.ci.openshift.org/ocp/release@sha256:e0719cb528dbac58ab0462637a6016aff7ce51b12d65747121a6165c170f9373
release architecture amd64

Platform: GCP

Please specify: IPI

What happened?
After adding one additional compute node to the cluster with its name not having the cluster infra id, then to destroy the cluster would fail to delete the 2 k8s firewall-rules and the VPC network. 

What did you expect to happen?
Even with a compute node whose name doesn't have infra id, destroying the cluster should be able to delete all resources, including the k8s firewall-rules and VPC network.

How to reproduce it (as minimally and precisely as possible)?
Always.

Anything else we need to know?
>FYI the QE flexy-install and flexy-destroy jobs:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/105606/
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/97154/

>the steps of launching an additional compute node using machineset:
$ export KUBECONFIG=kc1
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-20-213928   True        False         9m58s   Cluster version is 4.11.0-0.nightly-2022-05-20-213928
$ oc get nodes
NAME                                                         STATUS   ROLES    AGE   VERSION
jiwei-0524-12-lhccb-master-0.c.openshift-qe.internal         Ready    master   28m   v1.23.3+ad897c4
jiwei-0524-12-lhccb-master-1.c.openshift-qe.internal         Ready    master   28m   v1.23.3+ad897c4
jiwei-0524-12-lhccb-master-2.c.openshift-qe.internal         Ready    master   29m   v1.23.3+ad897c4
jiwei-0524-12-lhccb-worker-a-gvjmq.c.openshift-qe.internal   Ready    worker   20m   v1.23.3+ad897c4
jiwei-0524-12-lhccb-worker-b-zmsqj.c.openshift-qe.internal   Ready    worker   20m   v1.23.3+ad897c4
jiwei-0524-12-lhccb-worker-c-89dr8.c.openshift-qe.internal   Ready    worker   20m   v1.23.3+ad897c4
$ oc get machinesets -n openshift-machine-api
NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
jiwei-0524-12-lhccb-worker-a   1         1         1       1           29m
jiwei-0524-12-lhccb-worker-b   1         1         1       1           29m
jiwei-0524-12-lhccb-worker-c   1         1         1       1           29m
jiwei-0524-12-lhccb-worker-f   0         0                             29m
$ oc get machinesets jiwei-0524-12-lhccb-worker-a -n openshift-machine-api -oyaml > /tmp/ms1.yaml
$ sed -i 's/jiwei-0524-12-lhccb-worker-a/hello-world/g' /tmp/ms1.yaml
$ vim /tmp/ms1.yaml  <-- to remove "status" section
$ oc create -f /tmp/ms1.yaml
machineset.machine.openshift.io/hello-world created
$ date
Tue 24 May 2022 10:09:33 AM UTC
$ oc get machinesets -n openshift-machine-api
NAME                           DESIRED   CURRENT   READY   AVAILABLE   AGE
hello-world                    1         1                             11s
jiwei-0524-12-lhccb-worker-a   1         1         1       1           42m
jiwei-0524-12-lhccb-worker-b   1         1         1       1           42m
jiwei-0524-12-lhccb-worker-c   1         1         1       1           42m
jiwei-0524-12-lhccb-worker-f   0         0                             42m
$ oc get machines -n openshift-machine-api
NAME                                 PHASE         TYPE            REGION        ZONE            AGE
hello-world-48kzx                    Provisioned   n1-standard-4   us-central1   us-central1-a   34s
jiwei-0524-12-lhccb-master-0         Running       n1-standard-4   us-central1   us-central1-a   42m
jiwei-0524-12-lhccb-master-1         Running       n1-standard-4   us-central1   us-central1-b   42m
jiwei-0524-12-lhccb-master-2         Running       n1-standard-4   us-central1   us-central1-c   42m
jiwei-0524-12-lhccb-worker-a-gvjmq   Running       n1-standard-4   us-central1   us-central1-a   39m
jiwei-0524-12-lhccb-worker-b-zmsqj   Running       n1-standard-4   us-central1   us-central1-b   39m
jiwei-0524-12-lhccb-worker-c-89dr8   Running       n1-standard-4   us-central1   us-central1-c   38m
$ 
$ oc get machines -n openshift-machine-api
NAME                                 PHASE     TYPE            REGION        ZONE            AGE
hello-world-48kzx                    Running   n1-standard-4   us-central1   us-central1-a   3m59s
jiwei-0524-12-lhccb-master-0         Running   n1-standard-4   us-central1   us-central1-a   45m
jiwei-0524-12-lhccb-master-1         Running   n1-standard-4   us-central1   us-central1-b   45m
jiwei-0524-12-lhccb-master-2         Running   n1-standard-4   us-central1   us-central1-c   45m
jiwei-0524-12-lhccb-worker-a-gvjmq   Running   n1-standard-4   us-central1   us-central1-a   42m
jiwei-0524-12-lhccb-worker-b-zmsqj   Running   n1-standard-4   us-central1   us-central1-b   42m
jiwei-0524-12-lhccb-worker-c-89dr8   Running   n1-standard-4   us-central1   us-central1-c   42m
$ oc get nodes
NAME                                                         STATUS   ROLES    AGE    VERSION
hello-world-48kzx.c.openshift-qe.internal                    Ready    worker   106s   v1.23.3+ad897c4
jiwei-0524-12-lhccb-master-0.c.openshift-qe.internal         Ready    master   45m    v1.23.3+ad897c4
jiwei-0524-12-lhccb-master-1.c.openshift-qe.internal         Ready    master   45m    v1.23.3+ad897c4
jiwei-0524-12-lhccb-master-2.c.openshift-qe.internal         Ready    master   45m    v1.23.3+ad897c4
jiwei-0524-12-lhccb-worker-a-gvjmq.c.openshift-qe.internal   Ready    worker   37m    v1.23.3+ad897c4
jiwei-0524-12-lhccb-worker-b-zmsqj.c.openshift-qe.internal   Ready    worker   37m    v1.23.3+ad897c4
jiwei-0524-12-lhccb-worker-c-89dr8.c.openshift-qe.internal   Ready    worker   36m    v1.23.3+ad897c4
$ 

>after the flexy-destroy job, the 2 k8s firewall-rules and the VPC network are not deleted, although manually deleting them works
$ ./gcp_res_check.sh jiwei-0524-12
>>gcloud compute instances list | grep jiwei-0524-12
>>gcloud compute instance-groups list | grep jiwei-0524-12
>>gcloud compute disks list | grep jiwei-0524-12
>>gcloud compute networks list | grep jiwei-0524-12
jiwei-0524-12-lhccb-network      CUSTOM       REGIONAL
>>gcloud compute networks subnets list | grep jiwei-0524-12
>>gcloud compute routers list | grep jiwei-0524-12
>>gcloud compute firewall-rules list | grep jiwei-0524-12
k8s-a9d84d44a0bf448549e4e4281f21f0d1-http-hc    jiwei-0524-12-lhccb-network      INGRESS    1000      tcp:31862                                                                                                           False
k8s-fw-a9d84d44a0bf448549e4e4281f21f0d1         jiwei-0524-12-lhccb-network      INGRESS    1000      tcp:80,tcp:443                                                                                                      False

To show all fields of the firewall, please show in JSON format: --format=json
To show all fields in table format, please see the examples in --help.

>>gcloud compute health-checks list | grep jiwei-0524-12
>>gcloud compute http-health-checks list | grep jiwei-0524-12
>>gcloud compute forwarding-rules list | grep jiwei-0524-12
>>gcloud compute addresses list | grep jiwei-0524-12
>>gcloud compute target-pools list | grep jiwei-0524-12
>>gcloud compute backend-services list | grep jiwei-0524-12
>>gcloud dns managed-zones list | grep jiwei-0524-12
>>gcloud dns record-sets list --zone qe | grep jiwei-0524-12
>>gcloud iam service-accounts list | grep jiwei-0524-12
>>gcloud compute images list | grep jiwei-0524-12
>>gsutil ls | grep jiwei-0524-12
>>gcloud deployment-manager deployments list | grep jiwei-0524-12
Tue May 24 19:22:56 CST 2022
$ gcloud compute firewall-rules delete -q k8s-a9d84d44a0bf448549e4e4281f21f0d1-http-hc
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/global/firewalls/k8s-a9d84d44a0bf448549e4e4281f21f0d1-http-hc].
$ gcloud compute firewall-rules delete -q k8s-fw-a9d84d44a0bf448549e4e4281f21f0d1
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/global/firewalls/k8s-fw-a9d84d44a0bf448549e4e4281f21f0d1].
$ gcloud compute networks delete -q jiwei-0524-12-lhccb-network
Deleted [https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0524-12-lhccb-network].
$

Comment 1 Brent Barbachem 2022-06-01 15:23:02 UTC

Hello,

I am opening up a discussion based on the results that I have found investigating this BZ. 


When an Machine is created without the infra-ID as the prefix, it is compared with the cluster name to determine if it should be deleted. It find that these do NOT match and thus stops the deletion of this resource. Actually, it appears to stop the deletion of all of the TargetPools for GCP. (Question #1, should we delete the resources that are definitely part of the cluster on destroy, even though some may fail the current checks?) The reason that these resources are unable to be destroyed (firewalls ... etc.) is because they are attached to the resource(s) that could not be removed. They will never be able to be removed if they are still in use. (Question #2, should we delete resources that were created day 2? ). If the operator creates resources, the installer shouldn't make assumptions about their use and remove them should it?

Comment 2 Brent Barbachem 2022-06-01 15:24:43 UTC

@jianli wei, forgot to CC you on that comment

Comment 3 Jianli Wei 2022-06-06 06:20:31 UTC

Tested with the build (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1533626873861378048) genreated by slack App "Cluster Bot" for the PR https://github.com/openshift/installer/pull/5965, no the issue any more. 

> 1. launch the cluster
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/109236/ (SUCCESS)
LAUNCHER_VARS
installer_payload_image: registry.build01.ci.openshift.org/ci-ln-zz49qtk/release:latest

> 2. scale-up using machineset's yaml to launch one additional compute node whose name doesn't have cluster infra id
$ oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.ci.test-2022-06-06-025004-ci-ln-zz49qtk-latest   True        False         9m41s   Cluster version is 4.11.0-0.ci.test-2022-06-06-025004-ci-ln-zz49qtk-latest
$ oc get nodes
NAME                                                           STATUS   ROLES    AGE   VERSION
jiwei-openshift-gtsth-master-0.c.openshift-qe.internal         Ready    master   35m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-master-1.c.openshift-qe.internal         Ready    master   34m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-master-2.c.openshift-qe.internal         Ready    master   34m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-worker-a-xqlqm.c.openshift-qe.internal   Ready    worker   19m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-worker-b-7m86p.c.openshift-qe.internal   Ready    worker   19m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-worker-c-7g5x8.c.openshift-qe.internal   Ready    worker   19m   v1.24.0+bb9c2f1
$ oc get machinesets -n openshift-machine-api
NAME                             DESIRED   CURRENT   READY   AVAILABLE   AGE
jiwei-openshift-gtsth-worker-a   1         1         1       1           35m
jiwei-openshift-gtsth-worker-b   1         1         1       1           35m
jiwei-openshift-gtsth-worker-c   1         1         1       1           35m
jiwei-openshift-gtsth-worker-f   0         0                             35m
$ oc get machinesets jiwei-openshift-gtsth-worker-a -n openshift-machine-api -oyaml > /tmp/ms1.yaml
$ sed -i 's/jiwei-openshift-gtsth-worker-a/hello-world/g' /tmp/ms1.yaml
$ vim /tmp/ms1.yaml
$ oc create -f /tmp/ms1.yaml
machineset.machine.openshift.io/hello-world created
$ oc get machinesets -n openshift-machine-api
NAME                             DESIRED   CURRENT   READY   AVAILABLE   AGE
hello-world                      1         1                             59s
jiwei-openshift-gtsth-worker-a   1         1         1       1           38m
jiwei-openshift-gtsth-worker-b   1         1         1       1           38m
jiwei-openshift-gtsth-worker-c   1         1         1       1           38m
jiwei-openshift-gtsth-worker-f   0         0                             38m
$ oc get machines -n openshift-machine-api
NAME                                   PHASE         TYPE            REGION        ZONE            AGE
hello-world-wjg6h                      Provisioned   n1-standard-4   us-central1   us-central1-a   68s
jiwei-openshift-gtsth-master-0         Running       n1-standard-4   us-central1   us-central1-a   38m
jiwei-openshift-gtsth-master-1         Running       n1-standard-4   us-central1   us-central1-b   38m
jiwei-openshift-gtsth-master-2         Running       n1-standard-4   us-central1   us-central1-c   38m
jiwei-openshift-gtsth-worker-a-xqlqm   Running       n1-standard-4   us-central1   us-central1-a   34m
jiwei-openshift-gtsth-worker-b-7m86p   Running       n1-standard-4   us-central1   us-central1-b   34m
jiwei-openshift-gtsth-worker-c-7g5x8   Running       n1-standard-4   us-central1   us-central1-c   34m
$ 
$ oc get machines -n openshift-machine-api
NAME                                   PHASE     TYPE            REGION        ZONE            AGE
hello-world-wjg6h                      Running   n1-standard-4   us-central1   us-central1-a   3m24s
jiwei-openshift-gtsth-master-0         Running   n1-standard-4   us-central1   us-central1-a   40m
jiwei-openshift-gtsth-master-1         Running   n1-standard-4   us-central1   us-central1-b   40m
jiwei-openshift-gtsth-master-2         Running   n1-standard-4   us-central1   us-central1-c   40m
jiwei-openshift-gtsth-worker-a-xqlqm   Running   n1-standard-4   us-central1   us-central1-a   36m
jiwei-openshift-gtsth-worker-b-7m86p   Running   n1-standard-4   us-central1   us-central1-b   36m
jiwei-openshift-gtsth-worker-c-7g5x8   Running   n1-standard-4   us-central1   us-central1-c   36m
$ oc get nodes
NAME                                                           STATUS   ROLES    AGE   VERSION
hello-world-wjg6h.c.openshift-qe.internal                      Ready    worker   57s   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-master-0.c.openshift-qe.internal         Ready    master   40m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-master-1.c.openshift-qe.internal         Ready    master   40m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-master-2.c.openshift-qe.internal         Ready    master   39m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-worker-a-xqlqm.c.openshift-qe.internal   Ready    worker   25m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-worker-b-7m86p.c.openshift-qe.internal   Ready    worker   24m   v1.24.0+bb9c2f1
jiwei-openshift-gtsth-worker-c-7g5x8.c.openshift-qe.internal   Ready    worker   25m   v1.24.0+bb9c2f1
$ 

> 3. destroy the cluster
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-destroy/101055/ (SUCCESS)

> 4. check the cluster's resources on GCP and got nothing left-over
$ ./gcp_res_check.sh jiwei-openshift-gtsth
>>gcloud compute instances list | grep jiwei-openshift-gtsth
>>gcloud compute instance-groups list | grep jiwei-openshift-gtsth
>>gcloud compute disks list | grep jiwei-openshift-gtsth
>>gcloud compute networks list | grep jiwei-openshift-gtsth
>>gcloud compute networks subnets list | grep jiwei-openshift-gtsth
>>gcloud compute routers list | grep jiwei-openshift-gtsth
>>gcloud compute firewall-rules list | grep jiwei-openshift-gtsth

To show all fields of the firewall, please show in JSON format: --format=json
To show all fields in table format, please see the examples in --help.

>>gcloud compute health-checks list | grep jiwei-openshift-gtsth
>>gcloud compute http-health-checks list | grep jiwei-openshift-gtsth
>>gcloud compute forwarding-rules list | grep jiwei-openshift-gtsth
>>gcloud compute addresses list | grep jiwei-openshift-gtsth
>>gcloud compute target-pools list | grep jiwei-openshift-gtsth
>>gcloud compute backend-services list | grep jiwei-openshift-gtsth
>>gcloud dns managed-zones list | grep jiwei-openshift-gtsth
>>gcloud dns record-sets list --zone qe | grep jiwei-openshift-gtsth
>>gcloud iam service-accounts list | grep jiwei-openshift-gtsth
>>gcloud compute images list | grep jiwei-openshift-gtsth
>>gsutil ls | grep jiwei-openshift-gtsth
>>gcloud deployment-manager deployments list | grep jiwei-openshift-gtsth
Mon Jun  6 14:14:32 CST 2022
$

Comment 7 errata-xmlrpc 2022-08-10 11:14:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069