Bug 1785344

Summary: Failed / cancelled migration shows incorrect status and node in the UI
Product: OpenShift Container Platform Reporter: Radim Hrazdil <rhrazdil>
Component: Console Kubevirt PluginAssignee: Yaacov Zamir <yzamir>
Status: CLOSED ERRATA QA Contact: Nelly Credi <ncredi>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.0CC: aos-bugs, cnv-qe-bugs, gouyang, tjelinek, yzamir, zpeng
Target Milestone: ---Keywords: Reopened
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The failed migration was not handled correctly in the UI Consequence: If the migration was cancelled or it failed, the UI showed the target node of the VM being very confusing for the user since it looked like the migration actually passed. Fix: The handeling of the cancelled / failed migration has been fixed. Result: The UI reports the actual state of the VM even in case of failed/cancelled migration
Story Points: ---
Clone Of:
: 1792847 1806974 (view as bug list) Environment:
Last Closed: 2020-05-04 11:20:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1806974    
Bug Blocks: 1792847    
Attachments:
Description Flags
VMI.yaml none

Description Radim Hrazdil 2019-12-19 18:03:04 UTC
Description of problem:
When user cancels VM migration, Status of the VM is displayed as Migrating for as long as the vmim resource exists. Deletion of the vmim resource however takes a lot of time, meaning that VM stays in Migrating state for unexpectedly long time after the Virtual Machine Migration was canceled.

This is not a serious issue, but it is a poor user experience for sure.

Version-Release number of selected component (if applicable):
OCP 4.3
CNV 2.2

How reproducible:
100%

Steps to Reproduce:
1. Create a live migratable VM
2. Migrate the VM
3. Cancel the VM migration

Actual results:

Expected results:

Additional info:

Comment 1 Tomas Jelinek 2020-01-02 10:07:49 UTC
I believe the correct way to detect that the VM is being migrated is to check the existence of the vmim. Moving to the virt to investigate the time it takes to remove the vmim

Comment 2 sgott 2020-01-06 18:38:47 UTC
Petr, can you investigate?

Comment 4 Petr Kotas 2020-01-07 14:08:19 UTC
I have investigated the code and played with the migration cancellation.
It occurred almost instantly. I have also discussed with Radim.
Radim reproduced the steps and could not reproduce the issue.

Therefore I am for closing it and reopen if it appears once again.

Radim are you okay with it?

Comment 5 Radim Hrazdil 2020-01-07 14:20:06 UTC
Since I wasn't able to reproduce the issue, I agree with closing this.
We can reopen if we start hitting this in the future.

Comment 6 Radim Hrazdil 2020-01-07 15:44:48 UTC
Created attachment 1650441 [details]
VMI.yaml

I apologize for the mess, I need to reopen this.
I have misinterpreted our tests when I thought that the VM was stuck in 'Migrating' status.
The actual problem is that the VM Overview page displayed incorrect Node name.
The page displays the migration target node, instead the source node the VM was originally running on. VMI contains correct Node (the original one).
Interestingly, after cca 5 or 6 minutes, the VM Status is suddenly 'Stopping', without invoking any action. VMI again reads correct phase Running.


Updated reproduction steps:
1. Created Cirros VM from URL
2. Started VM
 - make note of node the VM is running on, in my case working-9lt8v-worker-0-n7plb
3. Started Migration
4. Canceled the migration right away
5. Wait until the VM is back in Running


Now at this point in time, the displayed Node in the Overview and Dashboard pages is working-9lt8v-worker-0-fvbl4, which is the designated target node of the migration.
Wait a couple of minutes, 5-6 and VM Status suddenly becomes Stopping.

Video sample of the whole process: http://file.rdu.redhat.com/rhrazdil/migrationcancellation-2020-01-07_16.20.59.mkv
it is a bit longer, the important times are:
00:27 VM is in Running state
00:36 VM is in Migration is canceled
01:11 VM is back in Running state, with incorrectly displayed Node
02:38 VMI content
06:07 VM suddenly goes to Stopping State

Comment 7 Radim Hrazdil 2020-01-07 15:47:59 UTC
Moving back to User Interface

Comment 8 Tomas Jelinek 2020-01-15 11:41:22 UTC
*** Bug 1787551 has been marked as a duplicate of this bug. ***

Comment 12 Yaacov Zamir 2020-02-23 16:13:00 UTC
Radim hi,

on my system after stoping and starting migrations I get strange things:

I have one vm, one vmi, two vmim ( state schedualing ) and three pods ( one in running state )

Is my cluster broken ?

```
[yzamir@dhcp-2-187 ~]$ oc get vm
NAME      AGE       RUNNING   VOLUME
example   57m       true      
[yzamir@dhcp-2-187 ~]$ oc get vmi
NAME      AGE       PHASE     IP            NODENAME
example   57m       Running   10.130.2.12   working-xb67k-worker-0-jbqql
[yzamir@dhcp-2-187 ~]$ oc get vmim
NAME                      AGE
example-migration-b6gmk   55m
example-migration-qsfmq   16m
[yzamir@dhcp-2-187 ~]$ oc get pods
NAME                          READY     STATUS      RESTARTS   AGE
virt-launcher-example-2v9sh   0/2       Completed   0          16m
virt-launcher-example-44dm8   2/2       Running     0          57m
virt-launcher-example-lfp6v   1/2       Error       0          56m
```

Comment 16 Yaacov Zamir 2020-02-27 12:06:52 UTC
moving to post - vm status is not fixed in case we have two pods of same vm.

Comment 18 Guohua Ouyang 2020-03-12 07:57:52 UTC
verified on 4.4.0-0.nightly-2020-03-06-170328, cancel VM migration, it does not take much time for VM back to running and stay at the previous node.

Comment 20 errata-xmlrpc 2020-05-04 11:20:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581