Bug 1968433 - [DR] Failover / Failback HA VM Fails to be started due to 'VM XXX is being imported'
Summary: [DR] Failover / Failback HA VM Fails to be started due to 'VM XXX is being im...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.6
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.3
: ---
Assignee: Arik
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-07 12:03 UTC by Ilan Zuckerman
Modified: 2022-11-16 12:17 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.5.3.1
Doc Type: Bug Fix
Doc Text:
Previously, attempts to start highly available virtual machines during failover or failback flows sometimes failed with an error "Cannot run VM. VM X is being imported", resulting in the virtual machines staying down. In this release, virtual machines are no longer started by the disaster-recovery scripts while being imported.
Clone Of:
Environment:
Last Closed: 2022-11-16 12:17:27 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovirt dr + engine logs (81.29 KB, application/x-xz)
2021-06-07 12:03 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 655 0 None open Make register-vm a sync operation 2022-09-18 08:21:39 UTC
Red Hat Knowledge Base (Solution) 6131621 0 None None None 2021-06-27 15:22:39 UTC
Red Hat Product Errata RHSA-2022:8502 0 None None None 2022-11-16 12:17:37 UTC

Description Ilan Zuckerman 2021-06-07 12:03:15 UTC
Created attachment 1789207 [details]
ovirt dr + engine logs

Description of problem:

When trying to invoke  Failover/Failback flows of the DR, with having a running HA VM on the 'primary' site (if failback) or on 'secondary' site (if failover), the VM fails to be started [1] (from ovirt-dr log). 
The 'failover' flow seems to be completed successfully, but the VM is in 'stopped' state on the 'secondary' site.
The ERROR occurs during the 'TASK [redhat.rhv.disaster_recovery : Run VMs]' ansible task.
The flow below describes the issue as it was seen with 'failover', but the same kind of behavior can be observed with 'failback' as well.

DR schema:
Master - as a driver of the DR scripts - storage-ge-13
Primary - an env where the disaster occurs - storage-ge-15
Secondary - an env where the assets are being migrated to - storage-ge-16


Envs state prior the testing:

1. Primary site containing:
Active data center
Active cluster
Active hosts
One active and attached nfs storage domain
One template
No VMs

2. Secondary site containing:
Active data center
Active cluster
Active hosts
NO VMs, templates or attached storage domains


[1]:
ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot run VM. VM test is being imported.]". HTTP response code is 409.


Version-Release number of selected component (if applicable):
rhv-release-4.4.6-9-001.noarch

How reproducible:
100%

Steps to Reproduce:
- Create HA VM on the 'master' site + start it
- generate the mappings file with ./ovirt-dr generate
- Update OVF store for the storage domain on master
- Mount primary and secondary Storages:
mount -t nfs mantis-nfs-xxx.com:/nas01/ge_storage16_nfs_1 /mnt/secondary/ge_storage16_nfs_1
mount -t nfs mantis-nfs-xxx.com:/nas01/ge_storage15_nfs_1 /mnt/primary/ge_storage15_nfs_1/

- Make sure that the secondary mount point is empty:
Rm -rf /mnt/secondary/ge_storage16_nfs_1/*

- To create a replica, rsync primary storage content to a secondary one:
[root@storage-ge-13 files]# rsync -azvh /mnt/primary/ge_storage15_nfs_1/* /mnt/secondary/ge_storage16_nfs_1

- change ownership of the replicated storage mount folder + contents:
chown -R vdsm:kvm /mnt/secondary/ge_storage16_nfs_1/


- Run failover ./ovirt-dr failover


Actual results:
The secondary storage domain is attached and active
The template was imported
The VM was imported BUT NOT running


Expected results:
Wait for failover to finish and verify that:
The secondary storage domain is attached and active
The template was imported
The VM was imported and running


Additional info:
Attaching ovirt-dr log and engine log of the 'secondary' site (where the VM should be up) and vdsm

Comment 1 Eyal Shenitzky 2021-06-27 15:22:39 UTC
*** Bug 1974535 has been marked as a duplicate of this bug. ***

Comment 5 Avihai 2022-01-19 08:56:07 UTC
Raising to High severity as VM not running after DR failover/back is not a medium severity bug and seems like a basic DR feature expectation.
Although the issue does not seem like a regression from recent 4.4 builds at least starting HA VM is still a basic requirement that should be fixed and get more attention.

Comment 7 Marina Kalinin 2022-04-13 22:27:39 UTC
Changing to d/w, since it has a customer ticket attached.

Comment 8 Arik 2022-05-10 07:01:28 UTC
This should be solved by the fix to bz 2074112 - import vm from configuration is now synchronous so we would not get to RunVm while the VM is locked by the ImportVmFromConfiguration command

Comment 11 Arik 2022-06-02 16:15:27 UTC
(In reply to Arik from comment #8)
> This should be solved by the fix to bz 2074112 - import vm from
> configuration is now synchronous so we would not get to RunVm while the VM
> is locked by the ImportVmFromConfiguration command

The above is true for import-vm [1] but not for register-vm [2] which is used by those scripts..
We can change [2] in a similar way but it would be better if the client would set the operation as async=False instead only for VMs that it is going to run..

[1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.5.1/backend/manager/modules/restapi/jaxrs/src/main/java/org/ovirt/engine/api/restapi/resource/BackendVmsResource.java#L402-L407
[2] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.5.1/backend/manager/modules/restapi/jaxrs/src/main/java/org/ovirt/engine/api/restapi/resource/BackendStorageDomainVmResource.java#L120

Comment 12 Casper (RHV QE bot) 2022-09-18 08:30:40 UTC
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 16 sshmulev 2022-10-25 10:08:25 UTC
Verified.
VM was up both in failover and fallback in the same flow.

Versions:
RHV 4.5.3-3
ovirt-engine-4.5.3.1-2
vdsm-4.50.3.4-1

Comment 20 errata-xmlrpc 2022-11-16 12:17:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.3] bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:8502


Note You need to log in before you can comment on or make changes to this bug.