Bug 1914900

Summary:	[CNV][Chaos] VM is not migrated to other node if kill the original virt-laucncher pod immediately
Product:	Container Native Virtualization (CNV)	Reporter:	Guohua Ouyang <gouyang>
Component:	Virtualization	Assignee:	sgott
Status:	CLOSED NOTABUG	QA Contact:	Israel Pinto <ipinto>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.6.0	CC:	cnv-qe-bugs, gouyang, pkliczew
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-13 13:10:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1908661

Description Guohua Ouyang 2021-01-11 12:59:47 UTC

Description of problem:
Migrate a running VM, once migration action is going, kill the VM original pod immediately, both POD are terminated. After that a new pod is coming up and schedule the VM to original node(can be any node).

If wait for a few seconds at step 2 to let the new pod get into a better status, killing the original pod, the new pod can continue and the VM is migrated to other node.

1. get VM information
$ oc get vmi
NAME AGE PHASE IP NODENAME
vm-example 2m30s Running 10.129.3.127 sys01-pwk5k-worker-0-wq9rh
$ oc get pod
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-example-tlgr6 2/2 Running 0 2m57s

2. migrate the VM
$ virtctl migrate vm-example
VM vm-example was scheduled to migrate

3. Kill the original pod immediately
$ oc delete pod virt-launcher-vm-example-tlgr6
pod "virt-launcher-vm-example-tlgr6" deleted

4. monitor pod in another tab, the original pod and new pod are been terminating
$ oc get pod
NAME READY STATUS RESTARTS AGE
virt-launcher-vm-example-tlgr6 2/2 Terminating 0 3m18s
virt-launcher-vm-example-wwkqj 2/2 Terminating 0 10s

5. VM is ruuning on the original node eventually.
$ oc get vmi
NAME AGE PHASE IP NODENAME
vm-example 4m38s Running 10.129.3.128 sys01-pwk5k-worker-0-wq9rh

Version-Release number of selected component (if applicable):
CNV 2.6

How reproducible:
100%

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 sgott 2021-01-11 19:11:01 UTC

I don't believe this is a valid test scenario. Killing the original pod should result in what you described. Can you please explain why you think it's a bug that a VM was restarted when you explicitly killed it?

Comment 2 Guohua Ouyang 2021-01-12 03:23:07 UTC

(In reply to sgott from comment #1)
> I don't believe this is a valid test scenario. Killing the original pod
> should result in what you described. Can you please explain why you think
> it's a bug that a VM was restarted when you explicitly killed it?

It was adding as a disruptive scenario for CNV chaos testing [1]. As we can see it can be succeeful if just wait for a few more seconds before killing the pod, so it looks it depends on the state of the 2nd pod.

I'm not sure whether this is a valid disruptive test scenario.

@pkliczew, can you jump in here?

[1] https://issues.redhat.com/browse/CNV-8366

Comment 3 Piotr Kliczewski 2021-01-12 08:06:25 UTC

@Stu, we work on chaos scenarios. We come up with different ways to break to cluster to see how resilient the code is to handle rare issues. If we can assume that virt launcher pod will always be there than it is not a bug but I am afraid it can be terminated/evicted for many reasons. The idea is to make sure that user understands what happened and the code handles the situation gracefully.

Comment 4 sgott 2021-01-13 13:10:32 UTC

Completely understood about the scenario. As far as I can see, this is behaving exactly as expected. If you kill the source pod, a migration simply cannot happen. The VM will then be re-started or not based on the runStrategy of the VM.

In other words, the existence of a migration object is not a guarantee that a migration can happen.

I am going to close this as not a bug. If you are sure I'm missing something, please re-open it.

Comment 5 Piotr Kliczewski 2021-01-13 16:20:27 UTC

Guohua, do you see any inconsistencies like migration object being in progress or any other failures. I agree with Stu if we fail gracefully it is not a bug.

Comment 6 Guohua Ouyang 2021-01-18 03:01:52 UTC

The thing can make the results different is when to kill the original pod:
If kill the 1st pod after the 2nd pod is getting into running state, the migration can be done

    $ virtctl migrate vm1
    VM vm1 was scheduled to migrate
    $ oc get pod
    NAME                      READY   STATUS              RESTARTS   AGE
    virt-launcher-vm1-4nc7v   1/1     Running             0          4m19s
    virt-launcher-vm1-shjmz   0/1     ContainerCreating   0          16s

    $ oc get pod
    NAME                      READY   STATUS    RESTARTS   AGE
    virt-launcher-vm1-4nc7v   1/1     Running   0          4m24s
    virt-launcher-vm1-shjmz   1/1     Running   0          21s
    $ oc delete pod virt-launcher-vm1-4nc7v
    pod "virt-launcher-vm1-4nc7v" deleted
    $ oc get pod
    NAME                      READY   STATUS    RESTARTS   AGE
    virt-launcher-vm1-shjmz   1/1     Running   0          107s


If kill the 1st pod immediately, the migration cannot be done.

    $ virtctl migrate vm1
    VM vm1 was scheduled to migrate
    $ oc get pod
    NAME                      READY   STATUS              RESTARTS   AGE
    virt-launcher-vm1-96vfx   0/1     ContainerCreating   0          4s
    virt-launcher-vm1-shjmz   1/1     Running             0          2m
    $ oc delete pod virt-launcher-vm1-shjmz
    pod "virt-launcher-vm1-shjmz" deleted
    $ oc get pod
    NAME                      READY   STATUS    RESTARTS   AGE
    virt-launcher-vm1-bh5gh   1/1     Running   0          15s

Comment 7 Piotr Kliczewski 2021-01-18 08:35:49 UTC

Guohua, in second case what is the status of the migration. Is there any information why migration failed?

Comment 8 Guohua Ouyang 2021-01-18 09:04:27 UTC

(In reply to Piotr Kliczewski from comment #7)
> Guohua, in second case what is the status of the migration. Is there any
> information why migration failed?

It's failed because the pod bring up by migration is killed along with killing the original pod.
A new pod is always coming up, but not relevant to migration any more.