2021041 – [vsphere] Not found TagCategory when destroying ipi cluster

Bug 2021041 - [vsphere] Not found TagCategory when destroying ipi cluster

Summary: [vsphere] Not found TagCategory when destroying ipi cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Rafael Fonseca
QA Contact:	jima
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2062748
TreeView+	depends on / blocked

Reported:	2021-11-08 08:22 UTC by jima
Modified:	2022-08-10 10:40 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A bug in the vmware/govmomi library Consequence: When destroying multiple clusters in parallel, one of the destroys can fail because of a tag belonging to another cluster was deleted in the meantime, resulting in a 404 error and thus aborting the destroy. Fix: Ignore not found tags and continue with the destroy process. Result: "openshift-installer destroy cluster" finishes without error.
Clone Of:
Environment:
Last Closed:	2022-08-10 10:39:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 5558	0	None	open	Bug 2021041: vsphere: Not found TagCategory when destroying ipi cluster	2022-02-08 18:16:34 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:40:12 UTC

Description jima 2021-11-08 08:22:46 UTC

Version: 4.10.0-0.nightly-2021-11-04-001635

Platform: vsphere ipi

What happened?
Destroying ipi cluster by running "openshift-install destroy cluster", and failed with below error:
11-08 13:48:36.108  level=info msg=Destroyed Tag=jimaqeci-29061b-9kw7l
11-08 13:48:36.108  level=debug msg=Delete tag category
11-08 13:49:32.371  level=error msg=get category urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: GET https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/category/id:urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: 404 Not Found TagCategory=openshift-jimaqeci-29061b-9kw7l
11-08 13:49:32.371  level=fatal msg=Failed to destroy cluster: get category urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: GET https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/category/id:urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: 404 Not Found

Actually, I found that the tag category "openshift-jimaqeci-29061b-9kw7l"
is on VMC, but id is eec66951-b773-4c79-b070-7a2bd0f00232, which is not the one reported in error log, seems that "openshift-install destroy cluster" uses the wrong id to find tag category

$ govc tags.category.ls -json | grep -b2 openshift-jimaqeci-29061b-9kw7l
289974-  {
289978-    "id": "urn:vmomi:InventoryServiceCategory:eec66951-b773-4c79-b070-7a2bd0f00232:GLOBAL",
290070:    "name": "openshift-jimaqeci-29061b-9kw7l",
290117-    "description": "Added by openshift-install do not remove",
290180-    "cardinality": "SINGLE",

In terraform.pre-bootstrap.tfstate file:
    {
      "mode": "managed",
      "type": "vsphere_tag_category",
      "name": "category",
      "provider": "provider.vsphere",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "associable_types": [
              "Datastore",
              "Folder",
              "ResourcePool",
              "StoragePod",
              "VirtualMachine"
            ],
            "cardinality": "SINGLE",
            "description": "Added by openshift-install do not remove",
            "id": "urn:vmomi:InventoryServiceCategory:eec66951-b773-4c79-b070-7a2bd0f00232:GLOBAL",
            "name": "openshift-jimaqeci-29061b-9kw7l"
          },
          "private": "bnVsbA=="
        }
      ]
    },


What did you expect to happen?
"openshift-installer destroy cluster" is finished without error.


How to reproduce it (as minimally and precisely as possible)?
not always, similar issue can also be searched in CI jobs.
https://search.ci.openshift.org/?search=404+Not+Found+TagCategory%3D&maxAge=336h&context=0&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Anything else we need to know?

Attached files under install-dir.

Comment 2 Matthew Staebler 2021-11-08 20:31:12 UTC

As far as I can tell, the destroyer is doing the correct thing. It appears that the call to GetCategory is using the correct category name but returning the incorrect ID.

One note here is that the vSphere destroyer is incorrectly exiting when it encounters an error. Instead, the destroyer should continue attempting to destroy resources until the user cancels the destroy.

Comment 4 Rafael Fonseca 2022-01-19 09:01:48 UTC

The ID mismatch in this case is misleading and it's not the cause of the problem. When trying to get a category by name, the API does a GET request for all the existing categories [1] and one of those is resulting in a 404. But the way that error is reported [2] makes it appear as if the `openshift-*` category was the one to fail.



[1] https://github.com/vmware/govmomi/blob/master/vapi/tags/categories.go#L151-L160
[2] https://github.com/openshift/installer/blob/master/pkg/destroy/vsphere/vsphere.go#L277-L279

Comment 6 jima 2022-03-01 07:43:45 UTC

Issue happens frequently on QE side when more than two clusters are destroyed at the same time.

Verified on 4.11.0-0.ci-2022-02-28-224450 and passed, move bug to VERIFIED.
1. Install two clusters (A, B)
2. Destroy cluster A (by running command ./openshift-install destroy cluster --dir ...)
3. When installer find all objects attached with related tag on cluster A, starting to destroy cluster B
4. when installer tried to delete tag on cluster A, the process to destroy cluster B goes to "Find attached objects on tag", after tag on cluster A is delete, monitor that destroy process on cluster B find attached objects on tag and continue to delete resources, and didn't throw "404 Not Found" error any more

Destroying log on cluster A:
03-01 13:44:40.130  level=debug msg=OpenShift Installer 4.11.0-0.ci-2022-02-28-224450
03-01 13:44:40.130  level=debug msg=Built from commit 5171f6b9ad5def883839990054f5068278232dd5
03-01 13:44:41.090  level=debug msg=Find attached objects on tag

03-01 13:46:02.645  level=debug msg=No VirtualMachines found
03-01 13:46:02.645  level=debug msg=No managed Folder found
03-01 13:46:02.645  level=debug msg=Delete tag

03-01 13:47:10.442  level=info msg=Destroyed Tag=jima0301bug01-559fk
03-01 13:47:10.442  level=debug msg=Delete tag category

03-01 13:49:31.937  level=info msg=Destroyed TagCategory=openshift-jima0301bug01-559fk
03-01 13:49:31.937  level=debug msg=Purging asset "Metadata" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Master Ignition Customization Check" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Worker Ignition Customization Check" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Terraform Variables" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Kubeconfig Admin Client" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Kubeadmin Password" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Certificate (journal-gatewayd)" from disk
03-01 13:49:31.937  level=debug msg=Purging asset "Cluster" from disk
03-01 13:49:31.937  level=info msg=Time elapsed: 4m39s


Destroying log on cluster B:
03-01 13:46:21.866  level=debug msg=OpenShift Installer 4.11.0-0.ci-2022-02-28-224450
03-01 13:46:21.866  level=debug msg=Built from commit 5171f6b9ad5def883839990054f5068278232dd5
03-01 13:46:22.790  level=debug msg=Find attached objects on tag

03-01 13:47:44.222  level=debug msg=Find VirtualMachine objects
03-01 13:47:44.222  level=debug msg=Delete VirtualMachines
03-01 13:47:44.222  level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-rhcos
03-01 13:47:44.222  level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-master-0
03-01 13:47:44.222  level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-master-0
03-01 13:47:44.222  level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-master-2
03-01 13:47:44.222  level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-master-2
03-01 13:47:44.222  level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-master-1
03-01 13:47:44.222  level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-master-1
03-01 13:47:44.781  level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-worker-rf4m9

03-01 13:47:45.340  level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-worker-rf4m9
03-01 13:47:46.700  level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-worker-wknnv

03-01 13:47:47.270  level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-worker-wknnv
03-01 13:47:47.270  level=debug msg=Find Folder objects
03-01 13:47:47.270  level=debug msg=Delete Folder
03-01 13:47:47.836  level=info msg=Destroyed Folder=jima0301bug02-xsnj6

03-01 13:47:50.345  level=info msg=Destroyed StoragePolicy=openshift-storage-policy-jima0301bug02-xsnj6
03-01 13:47:50.345  level=debug msg=Delete tag

03-01 13:49:11.743  level=info msg=Destroyed Tag=jima0301bug02-xsnj6
03-01 13:49:11.743  level=debug msg=Delete tag category

03-01 13:51:18.147  level=info msg=Destroyed TagCategory=openshift-jima0301bug02-xsnj6
03-01 13:51:18.147  level=debug msg=Purging asset "Metadata" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Master Ignition Customization Check" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Worker Ignition Customization Check" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Terraform Variables" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Kubeconfig Admin Client" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Kubeadmin Password" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Certificate (journal-gatewayd)" from disk
03-01 13:51:18.147  level=debug msg=Purging asset "Cluster" from disk
03-01 13:51:18.147  level=info msg=Time elapsed: 4m54s

Comment 7 jima 2022-03-10 06:34:16 UTC

Is it planning to be backported to previous release? IPI cluster destroy job fails frequently on QE CI due to this issue, and left over resources on VMC.

Comment 8 Matthew Staebler 2022-03-10 13:59:25 UTC

(In reply to jima from comment #7)
> Is it planning to be backported to previous release? IPI cluster destroy job
> fails frequently on QE CI due to this issue, and left over resources on VMC.

Yes. Thanks for the reminder. This should be backported.

Comment 11 errata-xmlrpc 2022-08-10 10:39:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.