1441322 – HA VMs running in two hosts at a time after restoring backup of RHV-M

Bug 1441322 - HA VMs running in two hosts at a time after restoring backup of RHV-M

Summary: HA VMs running in two hosts at a time after restoring backup of RHV-M

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.6
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Lev Veyde
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1419649 (view as bug list)
Depends On:
Blocks:	1446055
TreeView+	depends on / blocked

Reported:	2017-04-11 16:40 UTC by Julio Entrena Perez
Modified:	2020-08-13 09:03 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, when the Engine was restored from a backup, it sometimes tried to start a virtual machine that was already running, believing that the virtual machine was down. This resulted in a second instance of the virtual machine being started on a different host. In the current release, the virtual machine will not be restarted automatically after restoration if it is already running somewhere else.
Clone Of:
Clones:	1446055 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:41:54 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	ylavi: testing_plan_complete?

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	1446055	None	None	None	2020-08-03 08:44:02 UTC
Red Hat Product Errata	RHEA-2018:1488	None	None	None	2018-05-15 17:43:01 UTC
oVirt gerrit	75510	master	MERGED	core: add vm exit reason of type unknown	2020-08-03 07:42:04 UTC
oVirt gerrit	75904	ovirt-engine-4.1	MERGED	core: add vm exit reason of type unknown	2020-08-03 07:42:04 UTC
oVirt gerrit	76145	master	MERGED	restore: forcing status on HA VMs at restore time	2020-08-03 07:42:04 UTC
oVirt gerrit	76166	ovirt-engine-4.1	MERGED	restore: forcing status on HA VMs at restore time	2020-08-03 07:42:03 UTC
oVirt gerrit	76866	master	MERGED	core: fix the adjustment of ha vms at restore time	2020-08-03 07:42:03 UTC

Description Julio Entrena Perez 2017-04-11 16:40:50 UTC

Description of problem:
When a backup of RHV-M is restored and VMs flagged as HA are running in different hosts than they were when the backup was taken, a race condition results in RHV-M starting an already running VM in the host where it was running when the backup was taken.

Version-Release number of selected component (if applicable):
ovirt-engine-4.0.6.3-0.1.el7ev

How reproducible:
Very frequently (so far only 2 attempts, both resulting in a successful reproduction)

Steps to Reproduce:
1. Have 2 VMs flagged as HA running in host 1.
2. Shutdown RHV-M.
3. Either take a snapshot of RHV-M or make a backup.
3. Start RHV-M again.
4. Live migrate both VMs to host 2.
5. Shutdown RHV-M.
6. Restore the snapshot or the backup taken in step 3.
7. Start RHV-M again.

Actual results:
RHV-M does not find VMs running in host 1 where it expected it to be and starts recovery of (one or more) VMs before noticing that VMs are running in the other host.

Expected results:
RHV-M does not start recovery of HA VMs until having state of all hosts.

Additional info:
This affects the RHEV-M 3.6 to RHV-M 4.0 upgrade since once RHV-M 4.0 has been started with the backup of the 3.6 RHEV-M, it's no longer safe to stop RHV-M 4 and start RHEV-M 3.6 back as a rollback strategy if something goes wrong.
Therefore this should be fixed in RHEV-M 3.6 too.

Comment 2 Arik 2017-04-12 15:13:08 UTC

*** Bug 1419649 has been marked as a duplicate of this bug. ***

Comment 3 Arik 2017-04-12 15:29:34 UTC

Best solution imho would be to set the VM as Down (this clears its run_on_vds) and to set it with a special exit-reason while restoring the backup. Initially, those VMs will be reported with Unknown status.

Positive flow: the VMs are detected either on the original host or on any other host, they would be handled and updated accordingly.

Negative flow: the VMs are not reported on any host (the host they run-on is non-responsive or the host has been rebooted), then for 5 minutes after the engine starts these VMs are reported back to clients with Unknown status - the user cannot do anything with these VMs. After 5 minutes these VMs are reported as Down. The user can then starts them (it is the user's responsibility not to start such VM if it may run on a different host).

Simone, we discussed this as a possible solution for bz 1419649 - would you be able to adjust the restore process?

Comment 4 Arik 2017-04-13 08:40:27 UTC

With the posted patch, the logic after restoring a backup should be:
if a VM is highly-available (auto-startup='t') and is not set with a vm-lease (lease_sd_id=NULL) then set it to Down (status=0) with Unknown exit_status (exit_status=2) and with Unknown exit_reason (exit_reason=-1):

UPDATE vm_dynamic
SET status=0, exit_status=2, exit_reason=-1
WHERE vm_guid IN 
       (SELECT vm_guid
        FROM vm_static
        WHERE auto_startup='t' AND lease_sd_id=NULL);

Comment 5 Simone Tiraboschi 2017-04-13 13:07:33 UTC

Why covering just the HA VMs?
In theory the user could face the same issue if, just after a restore, he explicitly tries to start a non-HA VM that is instead running somewhere.

Comment 6 Arik 2017-04-13 13:59:40 UTC

(In reply to Simone Tiraboschi from comment #5)
> Why covering just the HA VMs?
> In theory the user could face the same issue if, just after a restore, he
> explicitly tries to start a non-HA VM that is instead running somewhere.

That's true, but we should look at it in the broader scope.
The ideal solution for that would probably be to use vm-leases, when that feature will be complete users can use it for all their HA VMs and we won't need such a defensive handling.
In light of the ability to avoid this problem with vm-leases and that the probablity of having a HA VM running on a non-responsive host after restoring a backup is extremely low, we would prefer to concentrate on the most important and painful issue (which is also what happened in this particular case) and that is the automatic restart of HA VMs.
We actually think of changing the solution described in comment 3 so the VM won't be reported with status Unknown to clients and not to block the user from running the VM in the first 5 minutes after engine startup. It may well be over-engineering.
I would suggest to start only with HA VMs and address the automatic restart of the VM, it would most probably be enough for any real-world case.

Comment 7 Julio Entrena Perez 2017-04-13 14:31:23 UTC

(In reply to Arik from comment #6)

> the
> probablity of having a HA VM running on a non-responsive host after
> restoring a backup is extremely low, 

Sorry to be a party killer but the above problem reproduces with all host being responsive, there wasn't any non-responsive host neither on customers report nor in my reproduction of the problem.

Comment 8 Arik 2017-04-13 15:33:51 UTC

(In reply to Julio Entrena Perez from comment #7)
> (In reply to Arik from comment #6)
> Sorry to be a party killer but the above problem reproduces with all host
> being responsive, there wasn't any non-responsive host neither on customers
> report nor in my reproduction of the problem.

Right, that's exactly my point - in 99.9% of the cases, the hosts will be responsive so we can introduce the simple solution described above for that scenario rather than something more complicated.

Comment 15 Jiri Belka 2017-11-13 12:50:05 UTC

ok, ovirt-engine-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch / ovirt-engine-tools-backup-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch

Comment 18 errata-xmlrpc 2018-05-15 17:41:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 19 Franta Kust 2019-05-16 13:04:33 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.