Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1441322

Summary: HA VMs running in two hosts at a time after restoring backup of RHV-M
Product: Red Hat Enterprise Virtualization Manager Reporter: Julio Entrena Perez <jentrena>
Component: ovirt-engineAssignee: Lev Veyde <lveyde>
Status: CLOSED ERRATA QA Contact: Jiri Belka <jbelka>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.0.6CC: ahadas, apinnick, bmcclain, lsurette, mkalinin, nashok, pstehlik, rbalakri, Rhev-m-bugs, srevivo, stirabos, tjelinek, ykaul, ylavi
Target Milestone: ovirt-4.2.0Keywords: ZStream
Target Release: ---Flags: ylavi: testing_plan_complete?
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously, when the Engine was restored from a backup, it sometimes tried to start a virtual machine that was already running, believing that the virtual machine was down. This resulted in a second instance of the virtual machine being started on a different host. In the current release, the virtual machine will not be restarted automatically after restoration if it is already running somewhere else.
Story Points: ---
Clone Of:
: 1446055 (view as bug list) Environment:
Last Closed: 2018-05-15 17:41:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1446055    

Description Julio Entrena Perez 2017-04-11 16:40:50 UTC
Description of problem:
When a backup of RHV-M is restored and VMs flagged as HA are running in different hosts than they were when the backup was taken, a race condition results in RHV-M starting an already running VM in the host where it was running when the backup was taken.

Version-Release number of selected component (if applicable):
ovirt-engine-4.0.6.3-0.1.el7ev

How reproducible:
Very frequently (so far only 2 attempts, both resulting in a successful reproduction)

Steps to Reproduce:
1. Have 2 VMs flagged as HA running in host 1.
2. Shutdown RHV-M.
3. Either take a snapshot of RHV-M or make a backup.
3. Start RHV-M again.
4. Live migrate both VMs to host 2.
5. Shutdown RHV-M.
6. Restore the snapshot or the backup taken in step 3.
7. Start RHV-M again.

Actual results:
RHV-M does not find VMs running in host 1 where it expected it to be and starts recovery of (one or more) VMs before noticing that VMs are running in the other host.

Expected results:
RHV-M does not start recovery of HA VMs until having state of all hosts.

Additional info:
This affects the RHEV-M 3.6 to RHV-M 4.0 upgrade since once RHV-M 4.0 has been started with the backup of the 3.6 RHEV-M, it's no longer safe to stop RHV-M 4 and start RHEV-M 3.6 back as a rollback strategy if something goes wrong.
Therefore this should be fixed in RHEV-M 3.6 too.

Comment 2 Arik 2017-04-12 15:13:08 UTC
*** Bug 1419649 has been marked as a duplicate of this bug. ***

Comment 3 Arik 2017-04-12 15:29:34 UTC
Best solution imho would be to set the VM as Down (this clears its run_on_vds) and to set it with a special exit-reason while restoring the backup. Initially, those VMs will be reported with Unknown status.

Positive flow: the VMs are detected either on the original host or on any other host, they would be handled and updated accordingly.

Negative flow: the VMs are not reported on any host (the host they run-on is non-responsive or the host has been rebooted), then for 5 minutes after the engine starts these VMs are reported back to clients with Unknown status - the user cannot do anything with these VMs. After 5 minutes these VMs are reported as Down. The user can then starts them (it is the user's responsibility not to start such VM if it may run on a different host).

Simone, we discussed this as a possible solution for bz 1419649 - would you be able to adjust the restore process?

Comment 4 Arik 2017-04-13 08:40:27 UTC
With the posted patch, the logic after restoring a backup should be:
if a VM is highly-available (auto-startup='t') and is not set with a vm-lease (lease_sd_id=NULL) then set it to Down (status=0) with Unknown exit_status (exit_status=2) and with Unknown exit_reason (exit_reason=-1):

UPDATE vm_dynamic
SET status=0, exit_status=2, exit_reason=-1
WHERE vm_guid IN 
       (SELECT vm_guid
        FROM vm_static
        WHERE auto_startup='t' AND lease_sd_id=NULL);

Comment 5 Simone Tiraboschi 2017-04-13 13:07:33 UTC
Why covering just the HA VMs?
In theory the user could face the same issue if, just after a restore, he explicitly tries to start a non-HA VM that is instead running somewhere.

Comment 6 Arik 2017-04-13 13:59:40 UTC
(In reply to Simone Tiraboschi from comment #5)
> Why covering just the HA VMs?
> In theory the user could face the same issue if, just after a restore, he
> explicitly tries to start a non-HA VM that is instead running somewhere.

That's true, but we should look at it in the broader scope.
The ideal solution for that would probably be to use vm-leases, when that feature will be complete users can use it for all their HA VMs and we won't need such a defensive handling.
In light of the ability to avoid this problem with vm-leases and that the probablity of having a HA VM running on a non-responsive host after restoring a backup is extremely low, we would prefer to concentrate on the most important and painful issue (which is also what happened in this particular case) and that is the automatic restart of HA VMs.
We actually think of changing the solution described in comment 3 so the VM won't be reported with status Unknown to clients and not to block the user from running the VM in the first 5 minutes after engine startup. It may well be over-engineering.
I would suggest to start only with HA VMs and address the automatic restart of the VM, it would most probably be enough for any real-world case.

Comment 7 Julio Entrena Perez 2017-04-13 14:31:23 UTC
(In reply to Arik from comment #6)

> the
> probablity of having a HA VM running on a non-responsive host after
> restoring a backup is extremely low, 

Sorry to be a party killer but the above problem reproduces with all host being responsive, there wasn't any non-responsive host neither on customers report nor in my reproduction of the problem.

Comment 8 Arik 2017-04-13 15:33:51 UTC
(In reply to Julio Entrena Perez from comment #7)
> (In reply to Arik from comment #6)
> Sorry to be a party killer but the above problem reproduces with all host
> being responsive, there wasn't any non-responsive host neither on customers
> report nor in my reproduction of the problem.

Right, that's exactly my point - in 99.9% of the cases, the hosts will be responsive so we can introduce the simple solution described above for that scenario rather than something more complicated.

Comment 15 Jiri Belka 2017-11-13 12:50:05 UTC
ok, ovirt-engine-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch / ovirt-engine-tools-backup-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch

Comment 18 errata-xmlrpc 2018-05-15 17:41:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 19 Franta Kust 2019-05-16 13:04:33 UTC
BZ<2>Jira Resync