Description of problem:
When user cancels VM migration, Status of the VM is displayed as Migrating for as long as the vmim resource exists. Deletion of the vmim resource however takes a lot of time, meaning that VM stays in Migrating state for unexpectedly long time after the Virtual Machine Migration was canceled.
This is not a serious issue, but it is a poor user experience for sure.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a live migratable VM
2. Migrate the VM
3. Cancel the VM migration
I believe the correct way to detect that the VM is being migrated is to check the existence of the vmim. Moving to the virt to investigate the time it takes to remove the vmim
Petr, can you investigate?
I have investigated the code and played with the migration cancellation.
It occurred almost instantly. I have also discussed with Radim.
Radim reproduced the steps and could not reproduce the issue.
Therefore I am for closing it and reopen if it appears once again.
Radim are you okay with it?
Since I wasn't able to reproduce the issue, I agree with closing this.
We can reopen if we start hitting this in the future.
Created attachment 1650441 [details]
I apologize for the mess, I need to reopen this.
I have misinterpreted our tests when I thought that the VM was stuck in 'Migrating' status.
The actual problem is that the VM Overview page displayed incorrect Node name.
The page displays the migration target node, instead the source node the VM was originally running on. VMI contains correct Node (the original one).
Interestingly, after cca 5 or 6 minutes, the VM Status is suddenly 'Stopping', without invoking any action. VMI again reads correct phase Running.
Updated reproduction steps:
1. Created Cirros VM from URL
2. Started VM
- make note of node the VM is running on, in my case working-9lt8v-worker-0-n7plb
3. Started Migration
4. Canceled the migration right away
5. Wait until the VM is back in Running
Now at this point in time, the displayed Node in the Overview and Dashboard pages is working-9lt8v-worker-0-fvbl4, which is the designated target node of the migration.
Wait a couple of minutes, 5-6 and VM Status suddenly becomes Stopping.
Video sample of the whole process: http://file.rdu.redhat.com/rhrazdil/migrationcancellation-2020-01-07_16.20.59.mkv
it is a bit longer, the important times are:
00:27 VM is in Running state
00:36 VM is in Migration is canceled
01:11 VM is back in Running state, with incorrectly displayed Node
02:38 VMI content
06:07 VM suddenly goes to Stopping State
Moving back to User Interface
*** Bug 1787551 has been marked as a duplicate of this bug. ***
on my system after stoping and starting migrations I get strange things:
I have one vm, one vmi, two vmim ( state schedualing ) and three pods ( one in running state )
Is my cluster broken ?
[yzamir@dhcp-2-187 ~]$ oc get vm
NAME AGE RUNNING VOLUME
example 57m true
[yzamir@dhcp-2-187 ~]$ oc get vmi
NAME AGE PHASE IP NODENAME
example 57m Running 10.130.2.12 working-xb67k-worker-0-jbqql
[yzamir@dhcp-2-187 ~]$ oc get vmim
[yzamir@dhcp-2-187 ~]$ oc get pods
NAME READY STATUS RESTARTS AGE
virt-launcher-example-2v9sh 0/2 Completed 0 16m
virt-launcher-example-44dm8 2/2 Running 0 57m
virt-launcher-example-lfp6v 1/2 Error 0 56m
moving to post - vm status is not fixed in case we have two pods of same vm.
verified on 4.4.0-0.nightly-2020-03-06-170328, cancel VM migration, it does not take much time for VM back to running and stay at the previous node.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.