Description of problem: Setting `uninstallStrategy: RemoveWorkloads` on the HCO CR and having it correctly propagated to Kubevirt and CDI CR, and then trying to force a one-click product removal cascading deleting existing workloads is not working as expected. It seems that kubevirt-operator deletes the VMs CRD first and then it fails deleting the VMI one because it miss something. At the end it will be stuck with: $ oc get kubevirt -A NAMESPACE NAME AGE PHASE kubevirt-hyperconverged kubevirt-kubevirt-hyperconverged 17m Deleting $ oc get vms -A error: the server doesn't have a resource type "vms" $ oc get vmis -A NAMESPACE NAME AGE PHASE IP NODENAME READY kubevirt-hyperconverged fedora-minimal-grasshopper 18m Running 10.129.2.26 ci-ln-t854jt2-1d09d-ckc59-worker-eastus22-xzcsq False $ oc get datavolumes -A error: the server doesn't have a resource type "datavolumes" Checking the CRDs: $ oc get crd | grep kubevirt cdis.cdi.kubevirt.io 2021-12-23T15:00:18Z hostpathprovisioners.hostpathprovisioner.kubevirt.io 2021-12-23T15:00:19Z hyperconvergeds.hco.kubevirt.io 2021-12-23T15:00:18Z kubevirts.kubevirt.io 2021-12-23T15:00:19Z networkaddonsconfigs.networkaddonsoperator.network.kubevirt.io 2021-12-23T15:00:17Z nodemaintenances.nodemaintenance.kubevirt.io 2021-12-23T15:00:19Z ssps.ssp.kubevirt.io 2021-12-23T15:00:17Z virtualmachineinstances.kubevirt.io 2021-12-23T15:04:46Z and in virt-operator logs: E1223 15:18:04.371676 1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1.VirtualMachineInstanceReplicaSet: failed to list *v1.VirtualMachineInstanceReplicaSet: the server could not find the requested resource (get virtualmachineinstancereplicasets.kubevirt.io) E1223 15:18:07.501012 1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1alpha1.VirtualMachineClusterFlavor: failed to list *v1alpha1.VirtualMachineClusterFlavor: the server could not find the requested resource (get virtualmachineclusterflavors.flavor.kubevirt.io) E1223 15:18:08.060590 1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1alpha1.VirtualMachineRestore: failed to list *v1alpha1.VirtualMachineRestore: the server could not find the requested resource (get virtualmachinerestores.snapshot.kubevirt.io) E1223 15:18:09.158669 1 reflector.go:138] pkg/controller/virtinformers.go:291: Failed to watch *v1alpha1.VirtualMachineFlavor: failed to list *v1alpha1.VirtualMachineFlavor: the server could not find the requested resource (get virtualmachineflavors.flavor.kubevirt.io) {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance kubevirt-hyperconverged/fedora-minimal-grasshopper","pos":"vmi.go:257","reason":"patching of vmi conditions and activePods failed: Internal error occurred: failed calling webhook \"virtualmachineinstances-mutator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.kubevirt-hyperconverged.svc:443/virtualmachineinstances-mutate?timeout=10s\": no endpoints available for service \"virt-api\"","timestamp":"2021-12-23T15:18:10.354341Z"} Version-Release number of selected component (if applicable): How reproducible: 1 over 2 tests, probably there is some race condition here Steps to Reproduce: 1. Deploy CNV and set `uninstallStrategy: RemoveWorkloads` on the HCO CR 2. Create and start a VM 3. remove HCO CR Actual results: Kubevirt CR is stuck on deleting, VM CRD is gone, DataVolume CRD is also gone, VMI CRD and VMI instances are still there Expected results: everything is always going to be cascading deleted Additional info:
On VMI instance I see: - lastProbeTime: "2021-12-23T15:15:20Z" lastTransitionTime: "2021-12-23T15:15:20Z" message: virt-launcher pod is terminating reason: PodTerminating status: "False" type: Ready but I don't find any virt-launcher pod on my environment: $ oc get pods -A | grep virt-launcher $ I suspect that if a virt-launcher instance doesn't die quickly enough (please notice that in my test I had a single VM) something gets executed in the wrong order becoming stuck.
Created attachment 1847604 [details] logs
grasshopper","pos":"vmi.go:257","reason":"patching of vmi conditions and activePods failed: Internal error occurred: failed calling webhook \"virtualmachineinstances-mutator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.kubevirt-hyperconverged.svc:443/virtualmachineinstances-mutate?timeout=10s\": no endpoints available for service \"virt-api\"","timestamp":"2021-12-23T15:18:10.354341Z"} ^^^ This indicates that updates to VMIs is not possible due to some issue with virt-api. Because virt-api's webhooks are not responsive, virt-controller will not be able to remove the finalizer on the VMI's which likely is causing this problem. We need logs of virt-api and ideally logs of all the kubevirt components if possible.
waiting on kubevirt component logs to further diagnose this.
Created attachment 1850065 [details] Kubevirt components logs
Created attachment 1850074 [details] kubevirt-must-gather
Thanks for the logs. I understand better what is occurring now. This issue exposes a flaw in our KubeVirt tear down logic. I'll explain. ===================== --- What's happening. ===================== To understand what's occurring I need to point out a few things about how the system works. 1. We register mutation/validation webhooks our CRDs which are served by virt-api 2. Update/Patch to our KubeVirt APIs fail when virt-api is unavailable due to these webhook registrations 3. virt-api will fail to initialize if not all CRDs are registered due to not being able to sync informer cache. What we have going on in this BZ is that during teardown, virt-api re-initialized (most likely due to CDI apis being removed), and was unable to come back online due to some of our CRDs no longer being registered. Since virt-api hosts the webhooks for all our CRDs, any remaining KubeVirt CRs in the cluster that require the removal of finalizers will fail because we can't call Patch/Update without webhooks being online. The result is a deadlock of sorts. We can't remove objects with finalizers because we can't call Update/Patch on those objects. And we can't call update/patch because virt-api is down, and virt-api is down because some KubeVirt APIs have already been removed which prevents the informer cache from syncing. virt-controller is subject to this same issue. If some CRDs are not registered anymore, virt-controller will fail to initialize in that situation. =================================== --- Logs showing sequence of events =================================== virt-api re-initializes because DataSource object from CDI is removed. 2022-01-11T11:15:52.647206472Z {"component":"virt-api","level":"info","msg":"Reinitialize virt-api, cdi DataSource api has been removed","pos":"api.go:985","timestamp":"2022-01-11T11:15:52.647130Z"} virt-api is stuck trying to come back online but can't sync informers 022-01-11T11:53:19.372804960Z E0111 11:53:19.372752 1 reflector.go:138] pkg/controller/virtinformers.go:305: Failed to watch *v1alpha1.VirtualMachineRestore: failed to list *v1alpha1.VirtualMachineRestore: the server could not find the requested resource (get virtualmachinerestores.snapshot.kubevirt.io) 2022-01-11T11:53:34.820951778Z E0111 11:53:34.820897 1 reflector.go:138] pkg/controller/virtinformers.go:305: Failed to watch *v1.VirtualMachineInstancePreset: failed to list *v1.VirtualMachineInstancePreset: the server could not find the requested resource (get virtualmachineinstancepresets.kubevirt.io) virt-controller can't patch VMI object because virt-api isn't initialized. 2022-01-11T11:53:52.792661238Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance kubevirt-hyperconverged/fedora-responsible-boar","pos":"vmi.go:272","reason":"patching of vmi conditions and activePods failed: Internal error occurred: failed calling webhook \"virtualmachineinstances-mutator.kubevirt.io\": failed to call webhook: Post \"https://virt-api.kubevirt-hyperconverged.svc:443/virtualmachineinstances-mutate?timeout=10s\": no endpoints available for service \"virt-api\"","timestamp":"2022-01-11T11:53:52.792516Z"} ================== --- How we fix it. ================== We need to change the deletion order from this... - Delete all CRDs - Delete Webhook registrations to something like this... - Block all creations of kubevirt api objects by registering a temporary validation webhook that points to a non-existent endpoint. - Delete all CRs across the entire cluster for every registered CRD in the install strategy - Delete all CRDS - Delete all registered webhooks (including the temporary one used to block new creations) The theory behind this is method is that we're freezing the ability to create new objects during tear down which allows us to do a controlled deletion of the KubeVirt CRs before un-registering the CRDs and Webhooks. By destroying the CRDs last, we can ensure that our control plane will remain operational to remove all the CRs gracefully. =============== --- References =============== The technique to use a dummy validation webhook to block an API isn't something new. We actually already do it today when new APIs are introduced during an update. We block usage of the new API until all the control plane components come online to process the new API. See createDummyWebhookValidator() function in pkg/virt-operator/resource/apply/reconcile.go. We'd simply be doing something similar to this during tear down to prevent new creations while we tear down all CRs.
PR posted: https://github.com/kubevirt/kubevirt/pull/7064 I took a slightly different approach to the fix than what I outlined in comment 7. Instead of using a webhook to block creations, I'm using a finalizer on the CRDs. This accomplishes the same task. When a CRD is marked for deletion with a finalizer, the CRD can't disappear and no new creations for that CRD can occur. So, with the new change, I changed the deletion order from this... - Delete all CRDs - Delete Webhook registrations to this... - Add finalizers to all CRDs - Delete All CRDs (finalizer prevents them from being completely removed) - Remove All CRD finalizers after waiting for all CRDs to report CRs are removed - Delete webhook registrations
Setting `uninstallStrategy: RemoveWorkloads` on the HCO CR and then trying to force a one-click product removal cascading deleting existing workloads is now working as expected. 1. Deployed CNV and set `uninstallStrategy: RemoveWorkloads` on the HCO CR 2. Created and started a VM, VM was in Running state. 3. Removed HCO CR and the openshift-cnv namespace successfully, for all the 4 attempts. This was tested almost 4 times and this issue was no longer seen. As this was because of race condition and hard to reproduce, please feel to reopen this issue, if it is seen again. VERIFIED With hco-bundle-registry-container-v4.10.0-648
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0947