Description of problem: Migrate a running VM, once migration action is going, kill the VM original pod immediately, both POD are terminated. After that a new pod is coming up and schedule the VM to original node(can be any node). If wait for a few seconds at step 2 to let the new pod get into a better status, killing the original pod, the new pod can continue and the VM is migrated to other node. 1. get VM information $ oc get vmi NAME AGE PHASE IP NODENAME vm-example 2m30s Running 10.129.3.127 sys01-pwk5k-worker-0-wq9rh $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm-example-tlgr6 2/2 Running 0 2m57s 2. migrate the VM $ virtctl migrate vm-example VM vm-example was scheduled to migrate 3. Kill the original pod immediately $ oc delete pod virt-launcher-vm-example-tlgr6 pod "virt-launcher-vm-example-tlgr6" deleted 4. monitor pod in another tab, the original pod and new pod are been terminating $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm-example-tlgr6 2/2 Terminating 0 3m18s virt-launcher-vm-example-wwkqj 2/2 Terminating 0 10s 5. VM is ruuning on the original node eventually. $ oc get vmi NAME AGE PHASE IP NODENAME vm-example 4m38s Running 10.129.3.128 sys01-pwk5k-worker-0-wq9rh Version-Release number of selected component (if applicable): CNV 2.6 How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I don't believe this is a valid test scenario. Killing the original pod should result in what you described. Can you please explain why you think it's a bug that a VM was restarted when you explicitly killed it?
(In reply to sgott from comment #1) > I don't believe this is a valid test scenario. Killing the original pod > should result in what you described. Can you please explain why you think > it's a bug that a VM was restarted when you explicitly killed it? It was adding as a disruptive scenario for CNV chaos testing [1]. As we can see it can be succeeful if just wait for a few more seconds before killing the pod, so it looks it depends on the state of the 2nd pod. I'm not sure whether this is a valid disruptive test scenario. @pkliczew, can you jump in here? [1] https://issues.redhat.com/browse/CNV-8366
@Stu, we work on chaos scenarios. We come up with different ways to break to cluster to see how resilient the code is to handle rare issues. If we can assume that virt launcher pod will always be there than it is not a bug but I am afraid it can be terminated/evicted for many reasons. The idea is to make sure that user understands what happened and the code handles the situation gracefully.
Completely understood about the scenario. As far as I can see, this is behaving exactly as expected. If you kill the source pod, a migration simply cannot happen. The VM will then be re-started or not based on the runStrategy of the VM. In other words, the existence of a migration object is not a guarantee that a migration can happen. I am going to close this as not a bug. If you are sure I'm missing something, please re-open it.
Guohua, do you see any inconsistencies like migration object being in progress or any other failures. I agree with Stu if we fail gracefully it is not a bug.
The thing can make the results different is when to kill the original pod: If kill the 1st pod after the 2nd pod is getting into running state, the migration can be done $ virtctl migrate vm1 VM vm1 was scheduled to migrate $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm1-4nc7v 1/1 Running 0 4m19s virt-launcher-vm1-shjmz 0/1 ContainerCreating 0 16s $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm1-4nc7v 1/1 Running 0 4m24s virt-launcher-vm1-shjmz 1/1 Running 0 21s $ oc delete pod virt-launcher-vm1-4nc7v pod "virt-launcher-vm1-4nc7v" deleted $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm1-shjmz 1/1 Running 0 107s If kill the 1st pod immediately, the migration cannot be done. $ virtctl migrate vm1 VM vm1 was scheduled to migrate $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm1-96vfx 0/1 ContainerCreating 0 4s virt-launcher-vm1-shjmz 1/1 Running 0 2m $ oc delete pod virt-launcher-vm1-shjmz pod "virt-launcher-vm1-shjmz" deleted $ oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm1-bh5gh 1/1 Running 0 15s
Guohua, in second case what is the status of the migration. Is there any information why migration failed?
(In reply to Piotr Kliczewski from comment #7) > Guohua, in second case what is the status of the migration. Is there any > information why migration failed? It's failed because the pod bring up by migration is killed along with killing the original pod. A new pod is always coming up, but not relevant to migration any more.