2035324 – Trying to uninstall CNV with `uninstallStrategy: RemoveWorkloads` and existing workloads lefts the system in a corrupted state

Bug 2035324 - Trying to uninstall CNV with `uninstallStrategy: RemoveWorkloads` and existing workloads lefts the system in a corrupted state

Summary: Trying to uninstall CNV with `uninstallStrategy: RemoveWorkloads` and existin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	David Vossel
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-23 15:38 UTC by Simone Tiraboschi
Modified:	2022-03-16 16:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:	virt-operator-container-v4.10.0-203 hco-bundle-registry-container-v4.10.0-638
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-16 16:05:38 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs (51.96 KB, application/gzip) 2021-12-23 15:55 UTC, Simone Tiraboschi	no flags	Details
Kubevirt components logs (39.14 KB, application/gzip) 2022-01-11 11:42 UTC, Simone Tiraboschi	no flags	Details
kubevirt-must-gather (5.06 MB, application/gzip) 2022-01-11 12:16 UTC, Simone Tiraboschi	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:0947	0	None	Closed	RHEL EUS Errata Documentation	2022-06-16 06:26:55 UTC

Description Simone Tiraboschi 2021-12-23 15:38:33 UTC

Description of problem:
Setting `uninstallStrategy: RemoveWorkloads` on the HCO CR and having it correctly propagated to Kubevirt and CDI CR,
and then trying to force a one-click product removal cascading deleting existing workloads is not working as expected.

It seems that kubevirt-operator deletes the VMs CRD first and then it fails deleting the VMI one because it miss something.

At the end it will be stuck with:

$ oc get kubevirt -A
NAMESPACE                 NAME                               AGE   PHASE
kubevirt-hyperconverged   kubevirt-kubevirt-hyperconverged   17m   Deleting

$ oc get vms -A
error: the server doesn't have a resource type "vms"

$ oc get vmis -A

NAMESPACE                 NAME                         AGE   PHASE     IP            NODENAME                                          READY
kubevirt-hyperconverged   fedora-minimal-grasshopper   18m   Running   10.129.2.26   ci-ln-t854jt2-1d09d-ckc59-worker-eastus22-xzcsq   False

$ oc get datavolumes -A
error: the server doesn't have a resource type "datavolumes"


Checking the CRDs:
$ oc get crd | grep kubevirt
cdis.cdi.kubevirt.io                                              2021-12-23T15:00:18Z
hostpathprovisioners.hostpathprovisioner.kubevirt.io              2021-12-23T15:00:19Z
hyperconvergeds.hco.kubevirt.io                                   2021-12-23T15:00:18Z
kubevirts.kubevirt.io                                             2021-12-23T15:00:19Z
networkaddonsconfigs.networkaddonsoperator.network.kubevirt.io    2021-12-23T15:00:17Z
nodemaintenances.nodemaintenance.kubevirt.io                      2021-12-23T15:00:19Z
ssps.ssp.kubevirt.io                                              2021-12-23T15:00:17Z
virtualmachineinstances.kubevirt.io                               2021-12-23T15:04:46Z

and in virt-operator logs:

E1223 15:18:04.371676       1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1.VirtualMachineInstanceReplicaSet: failed to list *v1.VirtualMachineInstanceReplicaSet: the server could not find the requested resource (get virtualmachineinstancereplicasets.kubevirt.io)
E1223 15:18:07.501012       1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1alpha1.VirtualMachineClusterFlavor: failed to list *v1alpha1.VirtualMachineClusterFlavor: the server could not find the requested resource (get virtualmachineclusterflavors.flavor.kubevirt.io)
E1223 15:18:08.060590       1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1alpha1.VirtualMachineRestore: failed to list *v1alpha1.VirtualMachineRestore: the server could not find the requested resource (get virtualmachinerestores.snapshot.kubevirt.io)
E1223 15:18:09.158669       1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1alpha1.VirtualMachineFlavor: failed to list *v1alpha1.VirtualMachineFlavor: the server could not find the requested resource (get virtualmachineflavors.flavor.kubevirt.io)
{"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance kubevirt-hyperconverged/fedora-minimal-grasshopper","pos":"vmi.go:257","reason":"patching of vmi conditions and activePods failed: Internal error occurred: failed calling webhook \"virtualmachineinstances-mutator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.kubevirt-hyperconverged.svc:443/virtualmachineinstances-mutate?timeout=10s\": no endpoints available for service \"virt-api\"","timestamp":"2021-12-23T15:18:10.354341Z"}




Version-Release number of selected component (if applicable):


How reproducible:
1 over 2 tests, probably there is some race condition here

Steps to Reproduce:
1. Deploy CNV and set `uninstallStrategy: RemoveWorkloads` on the HCO CR
2. Create and start a VM
3. remove HCO CR

Actual results:
Kubevirt CR is stuck on deleting,
VM CRD is gone, DataVolume CRD is also gone,
VMI CRD and VMI instances are still there

Expected results:
everything is always going to be cascading deleted

Additional info:

Comment 1 Simone Tiraboschi 2021-12-23 15:50:22 UTC

On VMI instance I see:

  - lastProbeTime: "2021-12-23T15:15:20Z"
    lastTransitionTime: "2021-12-23T15:15:20Z"
    message: virt-launcher pod is terminating
    reason: PodTerminating
    status: "False"
    type: Ready


but I don't find any virt-launcher pod on my environment:

$ oc get pods -A | grep virt-launcher
$


I suspect that if a virt-launcher instance doesn't die quickly enough (please notice that in my test I had a single VM) something gets executed in the wrong order becoming stuck.

Comment 2 Simone Tiraboschi 2021-12-23 15:55:40 UTC

Created attachment 1847604 [details]
logs

Comment 3 David Vossel 2022-01-05 15:02:45 UTC

grasshopper","pos":"vmi.go:257","reason":"patching of vmi conditions and activePods failed: Internal error occurred: failed calling webhook \"virtualmachineinstances-mutator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.kubevirt-hyperconverged.svc:443/virtualmachineinstances-mutate?timeout=10s\": no endpoints available for service \"virt-api\"","timestamp":"2021-12-23T15:18:10.354341Z"}

^^^ This indicates that updates to VMIs is not possible due to some issue with virt-api.

Because virt-api's webhooks are not responsive, virt-controller will not be able to remove the finalizer on the VMI's which likely is causing this problem.

We need logs of virt-api and ideally logs of all the kubevirt components if possible.

Comment 4 David Vossel 2022-01-07 18:25:33 UTC

waiting on kubevirt component logs to further diagnose this.

Comment 5 Simone Tiraboschi 2022-01-11 11:42:13 UTC

Created attachment 1850065 [details]
Kubevirt components logs

Comment 6 Simone Tiraboschi 2022-01-11 12:16:49 UTC

Created attachment 1850074 [details]
kubevirt-must-gather

Comment 7 David Vossel 2022-01-11 22:06:33 UTC

Thanks for the logs. I understand better what is occurring now. This issue exposes a flaw in our KubeVirt tear down logic. I'll explain.


=====================
--- What's happening.
=====================

To understand what's occurring I need to point out a few things about how the system works.

1. We register mutation/validation webhooks our CRDs which are served by virt-api
2. Update/Patch to our KubeVirt APIs fail when virt-api is unavailable due to these webhook registrations
3. virt-api will fail to initialize if not all CRDs are registered due to not being able to sync informer cache.

What we have going on in this BZ is that during teardown, virt-api re-initialized (most likely due to CDI apis being removed), and was unable to come back online due to some of our CRDs no longer being registered. Since virt-api hosts the webhooks for all our CRDs, any remaining KubeVirt CRs in the cluster that require the removal of finalizers will fail because we can't call Patch/Update without webhooks being online.

The result is a deadlock of sorts. We can't remove objects with finalizers because we can't call Update/Patch on those objects. And we can't call update/patch because virt-api is down, and virt-api is down because some KubeVirt APIs have already been removed which prevents the informer cache from syncing.

virt-controller is subject to this same issue. If some CRDs are not registered anymore, virt-controller will fail to initialize in that situation.

===================================
--- Logs showing sequence of events
===================================

virt-api re-initializes because DataSource object from CDI is removed.

2022-01-11T11:15:52.647206472Z {"component":"virt-api","level":"info","msg":"Reinitialize virt-api, cdi DataSource api has been removed","pos":"api.go:985","timestamp":"2022-01-11T11:15:52.647130Z"}

virt-api is stuck trying to come back online but can't sync informers

022-01-11T11:53:19.372804960Z E0111 11:53:19.372752       1 reflector.go:138] pkg/controller/virtinformers.go:305: Failed to watch *v1alpha1.VirtualMachineRestore: failed to list *v1alpha1.VirtualMachineRestore: the server could not find the requested resource (get virtualmachinerestores.snapshot.kubevirt.io)
2022-01-11T11:53:34.820951778Z E0111 11:53:34.820897       1 reflector.go:138] pkg/controller/virtinformers.go:305: Failed to watch *v1.VirtualMachineInstancePreset: failed to list *v1.VirtualMachineInstancePreset: the server could not find the requested resource (get virtualmachineinstancepresets.kubevirt.io)

virt-controller can't patch VMI object because virt-api isn't initialized.

2022-01-11T11:53:52.792661238Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance kubevirt-hyperconverged/fedora-responsible-boar","pos":"vmi.go:272","reason":"patching of vmi conditions and activePods failed: Internal error occurred: failed calling webhook \"virtualmachineinstances-mutator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.kubevirt-hyperconverged.svc:443/virtualmachineinstances-mutate?timeout=10s\": no endpoints available for service \"virt-api\"","timestamp":"2022-01-11T11:53:52.792516Z"}

==================
--- How we fix it.
==================

We need to change the deletion order from this...

- Delete all CRDs
- Delete Webhook registrations

to something like this...

- Block all creations of kubevirt api objects by registering a temporary validation webhook that points to a non-existent endpoint.
- Delete all CRs across the entire cluster for every registered CRD in the install strategy
- Delete all CRDS
- Delete all registered webhooks (including the temporary one used to block new creations)

The theory behind this is method is that we're freezing the ability to create new objects during tear down which allows us to do a controlled deletion of the KubeVirt CRs before un-registering the CRDs and Webhooks. By destroying the CRDs last, we can ensure that our control plane will remain operational to remove all the CRs gracefully.


===============
--- References
===============

The technique to use a dummy validation webhook to block an API isn't something new. We actually already do it today when new APIs are introduced during an update. We block usage of the new API until all the control plane components come online to process the new API.

See createDummyWebhookValidator() function in pkg/virt-operator/resource/apply/reconcile.go.

We'd simply be doing something similar to this during tear down to prevent new creations while we tear down all CRs.

Comment 8 David Vossel 2022-01-12 18:43:30 UTC

PR posted: https://github.com/kubevirt/kubevirt/pull/7064 

I took a slightly different approach to the fix than what I outlined in comment 7. Instead of using a webhook to block creations, I'm using a finalizer on the CRDs. This accomplishes the same task. When a CRD is marked for deletion with a finalizer, the CRD can't disappear and no new creations for that CRD can occur.


So, with the new change, I changed the deletion order from this...

- Delete all CRDs
- Delete Webhook registrations

to this...

- Add finalizers to all CRDs
- Delete All CRDs (finalizer prevents them from being completely removed)
- Remove All CRD finalizers after waiting for all CRDs to report CRs are removed
- Delete webhook registrations

Comment 9 Kedar Bidarkar 2022-02-05 14:00:55 UTC

Setting `uninstallStrategy: RemoveWorkloads` on the HCO CR and then trying to force a one-click product removal cascading deleting existing workloads is now working as expected.

1. Deployed CNV and set `uninstallStrategy: RemoveWorkloads` on the HCO CR
2. Created and started a VM, VM was in Running state.
3. Removed HCO CR and the openshift-cnv namespace successfully, for all the 4 attempts.

This was tested almost 4 times and this issue was no longer seen.

As this was because of race condition and hard to reproduce, please feel to reopen this issue, if it is seen again.

VERIFIED With hco-bundle-registry-container-v4.10.0-648

Comment 14 errata-xmlrpc 2022-03-16 16:05:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947

Note You need to log in before you can comment on or make changes to this bug.