Bug 2132873

Summary: VM is removed before virt-launcher pod exits, new VM with same name points to old VMI/virt-launcher pod still terminating
Product: Container Native Virtualization (CNV) Reporter: Germano Veit Michel <gveitmic>
Component: VirtualizationAssignee: ffossemo
Status: CLOSED ERRATA QA Contact: zhe peng <zpeng>
Severity: medium Docs Contact:
Priority: urgent    
Version: 4.11.0CC: acardace, fdeutsch, ffossemo, ycui
Target Milestone: ---Keywords: Reopened
Target Release: 4.13.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: hco-bundle-registry-container-v4.13.0.rhel9-1639 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-18 02:55:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2022-10-07 04:58:28 UTC
Description of problem:

Duplicate name check does not work after the VM is deleted, customer is able to create a new VM with the same name as the old VM with same name that still has its virt-launcher pod terminating.

The problem is the new VM is suddenly assigned to the old VMI and its terminating virt-launcher pod.

Version-Release number of selected component (if applicable):
OCP 4.11.7
CNV 4.11.0

How reproducible:
Always

Steps to Reproduce:
1. Create a new Windows VM with Blank PVC (so it fails to boot)
2. Delete the VM
3. Check the VM is gone, but virt-launcher and VMI are still there

NAME                                               READY   STATUS        RESTARTS        AGE
virt-launcher-win10-thin-albatross-8hzzh           1/1     Terminating   0               7m

% oc get vmi            
NAME                        AGE     PHASE     IP            NODENAME                   READY
win10-thin-albatross        8m27s   Running   10.129.2.19   shift-dcv2q-worker-2c76s   False

% oc get vm
No resources found in default namespace.

4. Given the VM is gone, you can now create a new VM with the same name. Do it but no need to start it, leave it off.

5. You can now connect to the console and do other things on the terminating VM from step 1. 

% oc get vm
NAME                   AGE     STATUS    READY
win10-thin-albatross   5m44s   Stopped   False

% oc get vmi
NAME                        AGE   PHASE     IP            NODENAME                   READY
win10-thin-albatross        15m   Running   10.129.2.19   shift-dcv2q-worker-2c76s   False

It's connecting back to the old terminating pod.

Comment 1 Germano Veit Michel 2022-10-07 05:02:21 UTC
(In reply to Germano Veit Michel from comment #0)
> Duplicate name check does not work after the VM is deleted, customer is able
> to create a new VM with the same name as the old VM with same name that
> still has its virt-launcher pod terminating.

This ended up quite confusing.

Let me rephrase:
* After VM is deleted, one is able to create new VM with same name
* But VM deleted does not mean VMI and virt-launcher already gone
* If one creates new VM with same name of old VMI that is still terminating, they get associated.

Hopefully its better :)

Comment 2 sgott 2022-10-12 12:10:51 UTC
This is intended behavior overall. From OpenShift Virtualization's point of view, we're relying on the OwnerReference to trigger a cascading delete. Which means we're relying on the kube scheduler to clean up remaining objects, which is not an atomic operation.

For the workflow given in the description, we suggest that adding the "--cascade" and "--foreground" flags to the delete VM command, This will result in the desired behavior.

Comment 3 Germano Veit Michel 2022-10-12 21:56:02 UTC
(In reply to sgott from comment #2)
> This is intended behavior overall. From OpenShift Virtualization's point of
> view, we're relying on the OwnerReference to trigger a cascading delete.
> Which means we're relying on the kube scheduler to clean up remaining
> objects, which is not an atomic operation.

I disagree with this. For a user point of view this is horrible and we have
a customer complaining about it.

While the cleanup is being done a new VM should have its own VMI and so on.
It should not link back to an old VMI that is completely unrelated except for its name.

The cleanup of old objects is one thing and thats fine that its still being deleted,
the problem is relating a new object to old objects that have nothing to do with it.

> 
> For the workflow given in the description, we suggest that adding the
> "--cascade" and "--foreground" flags to the delete VM command, This will
> result in the desired behavior.

Can you consider doing this on the Web UI then?

Comment 6 Fabian Deutsch 2022-11-08 11:25:18 UTC
WRT deletion I agree that the system behaves as expected, and follows principles for other objects in the cluster.
delete --foreground - or kubectl delete --cascade=foreground can mitigate this problem.

However, that the new VM is linked to the old VMI instance, is a bug. This must not happen.

Consider the following flow
1. Create VM A
2. Start VM A
3. Delete VM A
4. Create VM A
5. Start VM A

Then we should fail in step #4 or at latest in step #5 assuming that the old VMI is still running.

Because VMs will create VMIs with the same name, it might make sense to block step #4 whenever we detect a VMI with the same name as the VM.

Comment 7 sgott 2022-11-09 13:38:10 UTC
Given that a new VM will most likely conflict with a VMI's settings in some way, it does make sense to block the creation of the VM in step 4 of comment #6.

Comment 8 ffossemo 2022-12-05 09:35:03 UTC
@fdeutsch @sgott 
For uniformity, we should do the same in case of creation of VMIs. Let's take an example:
1. Creation of the VM `vm-test`
2. Creation of the VMI `vm-test`
3. Start of the VMI `vm-test`
Currently the behaviour is that the creation of the VMI` at point 2. will be deleted at least it is created, due to the reconciliation of the VM created at point 1.
So, it results in the following:

```
fossedihelm@fedora ~/w/kubevirt (name_conflict_vm_vmi) [1]> k create -f examples/vmi-windows.yaml
virtualmachineinstance.kubevirt.io/windows-11-test created
fossedihelm@fedora ~/w/kubevirt (name_conflict_vm_vmi)> k get vmi
No resources found in default namespace.
```

Is it ok that the creation of a VMI checks that there are no VM with the same name? Thanks

Comment 9 Fabian Deutsch 2022-12-05 09:49:24 UTC
This is not enough I suppose, and VMIs do not have the problem. The problem exists for VMs only.

Consider the following flow
1. Create VM A
2. Start VM A (VMI A is getting created)
3. Delete VM A (VMI A is started to be deleted)
4. Create VM A
5. Start VM A (old VMI A was not deleted and is now bound to VM A again)

The bug is step 5: VMI A which got created in step #2, was marked for deletion in step #3, however, VMI A was NOT deleted before step #5.
Thus just before step #5
VM A exists, and VMI A exists.
On step #5
VM A is bound to VMI A.

This is the bug, becuse VMI A (in step #5) is from VM A in step #1, and not th einstance of VM A from step #4.

Comment 11 Fabian Deutsch 2022-12-14 09:36:24 UTC
As part of this bug, please take care to create a new testcase in tier1 (upstream) to test for this scenario/bug.

Assumption is:
1. VM created and eventually running
2. Delete VM (background deletion)
3. Fast create a new VM

Expected outcome:
New VM will eventually be running, has to wait for the old VMI to be gone

Comment 13 zhe peng 2023-03-07 07:34:02 UTC
I can reproduce this bug.
verified with build: CNV-v4.13.0.rhel9-1639

step:
1: Create a new Windows VM with Blank PVC
2: Delete vm with --cascade=orphan
$ kubectl delete vm vm-win10 --cascade=orphan
virtualmachine.kubevirt.io "vm-win10" deleted

3: check vm, vmi  and virt-launcher pod
$ oc get vm
No resources found in default namespace.
$ oc get vmi
NAME       AGE   PHASE     IP             NODENAME                             READY
vm-win10   25m   Running   10.129.3.160   c01-zpeng-413-sv526-worker-0-8dp6m   False

$ oc get pods
NAME                           READY   STATUS        RESTARTS   AGE
virt-launcher-vm-win10-h2dhq   2/2     Terminating   0          49m


4: create a new window vm using same name with step 1 and start it
$ virtctl start vm-win10
Error starting VirtualMachine Operation cannot be fulfilled on virtualmachine.kubevirt.io "vm-win10": VM is already running

new vm is not bound to vmi. 
move to verified.

Comment 17 errata-xmlrpc 2023-05-18 02:55:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205