Description of problem: When a backup of RHV-M is restored and VMs flagged as HA are running in different hosts than they were when the backup was taken, a race condition results in RHV-M starting an already running VM in the host where it was running when the backup was taken. Version-Release number of selected component (if applicable): ovirt-engine-4.0.6.3-0.1.el7ev How reproducible: Very frequently (so far only 2 attempts, both resulting in a successful reproduction) Steps to Reproduce: 1. Have 2 VMs flagged as HA running in host 1. 2. Shutdown RHV-M. 3. Either take a snapshot of RHV-M or make a backup. 3. Start RHV-M again. 4. Live migrate both VMs to host 2. 5. Shutdown RHV-M. 6. Restore the snapshot or the backup taken in step 3. 7. Start RHV-M again. Actual results: RHV-M does not find VMs running in host 1 where it expected it to be and starts recovery of (one or more) VMs before noticing that VMs are running in the other host. Expected results: RHV-M does not start recovery of HA VMs until having state of all hosts. Additional info: This affects the RHEV-M 3.6 to RHV-M 4.0 upgrade since once RHV-M 4.0 has been started with the backup of the 3.6 RHEV-M, it's no longer safe to stop RHV-M 4 and start RHEV-M 3.6 back as a rollback strategy if something goes wrong. Therefore this should be fixed in RHEV-M 3.6 too.
*** Bug 1419649 has been marked as a duplicate of this bug. ***
Best solution imho would be to set the VM as Down (this clears its run_on_vds) and to set it with a special exit-reason while restoring the backup. Initially, those VMs will be reported with Unknown status. Positive flow: the VMs are detected either on the original host or on any other host, they would be handled and updated accordingly. Negative flow: the VMs are not reported on any host (the host they run-on is non-responsive or the host has been rebooted), then for 5 minutes after the engine starts these VMs are reported back to clients with Unknown status - the user cannot do anything with these VMs. After 5 minutes these VMs are reported as Down. The user can then starts them (it is the user's responsibility not to start such VM if it may run on a different host). Simone, we discussed this as a possible solution for bz 1419649 - would you be able to adjust the restore process?
With the posted patch, the logic after restoring a backup should be: if a VM is highly-available (auto-startup='t') and is not set with a vm-lease (lease_sd_id=NULL) then set it to Down (status=0) with Unknown exit_status (exit_status=2) and with Unknown exit_reason (exit_reason=-1): UPDATE vm_dynamic SET status=0, exit_status=2, exit_reason=-1 WHERE vm_guid IN (SELECT vm_guid FROM vm_static WHERE auto_startup='t' AND lease_sd_id=NULL);
Why covering just the HA VMs? In theory the user could face the same issue if, just after a restore, he explicitly tries to start a non-HA VM that is instead running somewhere.
(In reply to Simone Tiraboschi from comment #5) > Why covering just the HA VMs? > In theory the user could face the same issue if, just after a restore, he > explicitly tries to start a non-HA VM that is instead running somewhere. That's true, but we should look at it in the broader scope. The ideal solution for that would probably be to use vm-leases, when that feature will be complete users can use it for all their HA VMs and we won't need such a defensive handling. In light of the ability to avoid this problem with vm-leases and that the probablity of having a HA VM running on a non-responsive host after restoring a backup is extremely low, we would prefer to concentrate on the most important and painful issue (which is also what happened in this particular case) and that is the automatic restart of HA VMs. We actually think of changing the solution described in comment 3 so the VM won't be reported with status Unknown to clients and not to block the user from running the VM in the first 5 minutes after engine startup. It may well be over-engineering. I would suggest to start only with HA VMs and address the automatic restart of the VM, it would most probably be enough for any real-world case.
(In reply to Arik from comment #6) > the > probablity of having a HA VM running on a non-responsive host after > restoring a backup is extremely low, Sorry to be a party killer but the above problem reproduces with all host being responsive, there wasn't any non-responsive host neither on customers report nor in my reproduction of the problem.
(In reply to Julio Entrena Perez from comment #7) > (In reply to Arik from comment #6) > Sorry to be a party killer but the above problem reproduces with all host > being responsive, there wasn't any non-responsive host neither on customers > report nor in my reproduction of the problem. Right, that's exactly my point - in 99.9% of the cases, the hosts will be responsive so we can introduce the simple solution described above for that scenario rather than something more complicated.
ok, ovirt-engine-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch / ovirt-engine-tools-backup-4.2.0-0.0.master.20171112130303.git8bc889c.el7.centos.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488
BZ<2>Jira Resync