Version: 4.10.0-0.nightly-2021-11-04-001635 Platform: vsphere ipi What happened? Destroying ipi cluster by running "openshift-install destroy cluster", and failed with below error: 11-08 13:48:36.108 level=info msg=Destroyed Tag=jimaqeci-29061b-9kw7l 11-08 13:48:36.108 level=debug msg=Delete tag category 11-08 13:49:32.371 level=error msg=get category urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: GET https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/category/id:urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: 404 Not Found TagCategory=openshift-jimaqeci-29061b-9kw7l 11-08 13:49:32.371 level=fatal msg=Failed to destroy cluster: get category urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: GET https://vcenter.sddc-44-236-21-251.vmwarevmc.com/rest/com/vmware/cis/tagging/category/id:urn:vmomi:InventoryServiceCategory:89f52fc4-c238-439e-8e5c-237dd8c84931:GLOBAL: 404 Not Found Actually, I found that the tag category "openshift-jimaqeci-29061b-9kw7l" is on VMC, but id is eec66951-b773-4c79-b070-7a2bd0f00232, which is not the one reported in error log, seems that "openshift-install destroy cluster" uses the wrong id to find tag category $ govc tags.category.ls -json | grep -b2 openshift-jimaqeci-29061b-9kw7l 289974- { 289978- "id": "urn:vmomi:InventoryServiceCategory:eec66951-b773-4c79-b070-7a2bd0f00232:GLOBAL", 290070: "name": "openshift-jimaqeci-29061b-9kw7l", 290117- "description": "Added by openshift-install do not remove", 290180- "cardinality": "SINGLE", In terraform.pre-bootstrap.tfstate file: { "mode": "managed", "type": "vsphere_tag_category", "name": "category", "provider": "provider.vsphere", "instances": [ { "schema_version": 0, "attributes": { "associable_types": [ "Datastore", "Folder", "ResourcePool", "StoragePod", "VirtualMachine" ], "cardinality": "SINGLE", "description": "Added by openshift-install do not remove", "id": "urn:vmomi:InventoryServiceCategory:eec66951-b773-4c79-b070-7a2bd0f00232:GLOBAL", "name": "openshift-jimaqeci-29061b-9kw7l" }, "private": "bnVsbA==" } ] }, What did you expect to happen? "openshift-installer destroy cluster" is finished without error. How to reproduce it (as minimally and precisely as possible)? not always, similar issue can also be searched in CI jobs. https://search.ci.openshift.org/?search=404+Not+Found+TagCategory%3D&maxAge=336h&context=0&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Anything else we need to know? Attached files under install-dir.
As far as I can tell, the destroyer is doing the correct thing. It appears that the call to GetCategory is using the correct category name but returning the incorrect ID. One note here is that the vSphere destroyer is incorrectly exiting when it encounters an error. Instead, the destroyer should continue attempting to destroy resources until the user cancels the destroy.
The ID mismatch in this case is misleading and it's not the cause of the problem. When trying to get a category by name, the API does a GET request for all the existing categories [1] and one of those is resulting in a 404. But the way that error is reported [2] makes it appear as if the `openshift-*` category was the one to fail. [1] https://github.com/vmware/govmomi/blob/master/vapi/tags/categories.go#L151-L160 [2] https://github.com/openshift/installer/blob/master/pkg/destroy/vsphere/vsphere.go#L277-L279
Issue happens frequently on QE side when more than two clusters are destroyed at the same time. Verified on 4.11.0-0.ci-2022-02-28-224450 and passed, move bug to VERIFIED. 1. Install two clusters (A, B) 2. Destroy cluster A (by running command ./openshift-install destroy cluster --dir ...) 3. When installer find all objects attached with related tag on cluster A, starting to destroy cluster B 4. when installer tried to delete tag on cluster A, the process to destroy cluster B goes to "Find attached objects on tag", after tag on cluster A is delete, monitor that destroy process on cluster B find attached objects on tag and continue to delete resources, and didn't throw "404 Not Found" error any more Destroying log on cluster A: 03-01 13:44:40.130 level=debug msg=OpenShift Installer 4.11.0-0.ci-2022-02-28-224450 03-01 13:44:40.130 level=debug msg=Built from commit 5171f6b9ad5def883839990054f5068278232dd5 03-01 13:44:41.090 level=debug msg=Find attached objects on tag 03-01 13:46:02.645 level=debug msg=No VirtualMachines found 03-01 13:46:02.645 level=debug msg=No managed Folder found 03-01 13:46:02.645 level=debug msg=Delete tag 03-01 13:47:10.442 level=info msg=Destroyed Tag=jima0301bug01-559fk 03-01 13:47:10.442 level=debug msg=Delete tag category 03-01 13:49:31.937 level=info msg=Destroyed TagCategory=openshift-jima0301bug01-559fk 03-01 13:49:31.937 level=debug msg=Purging asset "Metadata" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Master Ignition Customization Check" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Worker Ignition Customization Check" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Terraform Variables" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Kubeconfig Admin Client" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Kubeadmin Password" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Certificate (journal-gatewayd)" from disk 03-01 13:49:31.937 level=debug msg=Purging asset "Cluster" from disk 03-01 13:49:31.937 level=info msg=Time elapsed: 4m39s Destroying log on cluster B: 03-01 13:46:21.866 level=debug msg=OpenShift Installer 4.11.0-0.ci-2022-02-28-224450 03-01 13:46:21.866 level=debug msg=Built from commit 5171f6b9ad5def883839990054f5068278232dd5 03-01 13:46:22.790 level=debug msg=Find attached objects on tag 03-01 13:47:44.222 level=debug msg=Find VirtualMachine objects 03-01 13:47:44.222 level=debug msg=Delete VirtualMachines 03-01 13:47:44.222 level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-rhcos 03-01 13:47:44.222 level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-master-0 03-01 13:47:44.222 level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-master-0 03-01 13:47:44.222 level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-master-2 03-01 13:47:44.222 level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-master-2 03-01 13:47:44.222 level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-master-1 03-01 13:47:44.222 level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-master-1 03-01 13:47:44.781 level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-worker-rf4m9 03-01 13:47:45.340 level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-worker-rf4m9 03-01 13:47:46.700 level=debug msg=Powered off VirtualMachine=jima0301bug02-xsnj6-worker-wknnv 03-01 13:47:47.270 level=info msg=Destroyed VirtualMachine=jima0301bug02-xsnj6-worker-wknnv 03-01 13:47:47.270 level=debug msg=Find Folder objects 03-01 13:47:47.270 level=debug msg=Delete Folder 03-01 13:47:47.836 level=info msg=Destroyed Folder=jima0301bug02-xsnj6 03-01 13:47:50.345 level=info msg=Destroyed StoragePolicy=openshift-storage-policy-jima0301bug02-xsnj6 03-01 13:47:50.345 level=debug msg=Delete tag 03-01 13:49:11.743 level=info msg=Destroyed Tag=jima0301bug02-xsnj6 03-01 13:49:11.743 level=debug msg=Delete tag category 03-01 13:51:18.147 level=info msg=Destroyed TagCategory=openshift-jima0301bug02-xsnj6 03-01 13:51:18.147 level=debug msg=Purging asset "Metadata" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Master Ignition Customization Check" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Worker Ignition Customization Check" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Terraform Variables" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Kubeconfig Admin Client" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Kubeadmin Password" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Certificate (journal-gatewayd)" from disk 03-01 13:51:18.147 level=debug msg=Purging asset "Cluster" from disk 03-01 13:51:18.147 level=info msg=Time elapsed: 4m54s
Is it planning to be backported to previous release? IPI cluster destroy job fails frequently on QE CI due to this issue, and left over resources on VMC.
(In reply to jima from comment #7) > Is it planning to be backported to previous release? IPI cluster destroy job > fails frequently on QE CI due to this issue, and left over resources on VMC. Yes. Thanks for the reminder. This should be backported.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069