1446055 – [downstream clone - 4.1.3] HA VMs running in two hosts at a time after restoring backup of RHV-M

Bug 1446055 - [downstream clone - 4.1.3] HA VMs running in two hosts at a time after restoring backup of RHV-M

Summary: [downstream clone - 4.1.3] HA VMs running in two hosts at a time after restor...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.0.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.3
Target Release:	---
Assignee:	Arik
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:
Depends On:	1441322
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-27 07:46 UTC by rhev-integ
Modified:	2020-12-14 08:34 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, after restoring from backup the Manager occasionally attempted to start an already running virtual machine again automatically as the Manager believed the virtual machine was down and this caused a second instance of the virtual machine to start on a different host. Now, the virtual machine status is set to down and the exit status is set to unknown after restoring from a backup. This status is changed as soon as the virtual machine is run. This means that the automatic restart of the virtual machine is no longer attempted.
Clone Of:	1441322
Environment:
Last Closed:	2017-07-06 07:30:42 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1450835	high	CLOSED	[Docs][Admin][SHE] Add a warning that backup and restore is supported via the engine-backup tool only, and 3rd party too...	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	5283211	None	None	None	2020-08-03 08:43:36 UTC
Red Hat Product Errata	RHEA-2017:1692	normal	SHIPPED_LIVE	Red Hat Virtualization Manager (ovirt-engine) 4.1.3	2017-07-06 11:24:35 UTC
oVirt gerrit	75510	master	MERGED	core: add vm exit reason of type unknown	2020-08-03 07:41:44 UTC
oVirt gerrit	75904	ovirt-engine-4.1	MERGED	core: add vm exit reason of type unknown	2020-08-03 07:41:44 UTC
oVirt gerrit	76166	ovirt-engine-4.1	MERGED	restore: forcing status on HA VMs at restore time	2020-08-03 07:41:44 UTC
oVirt gerrit	76867	ovirt-engine-4.1	MERGED	core: fix the adjustment of ha vms at restore time	2020-08-03 07:41:44 UTC

Internal Links: 1450835

Description rhev-integ 2017-04-27 07:46:49 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1441322 +++
======================================================================

Description of problem:
When a backup of RHV-M is restored and VMs flagged as HA are running in different hosts than they were when the backup was taken, a race condition results in RHV-M starting an already running VM in the host where it was running when the backup was taken.

Version-Release number of selected component (if applicable):
ovirt-engine-4.0.6.3-0.1.el7ev

How reproducible:
Very frequently (so far only 2 attempts, both resulting in a successful reproduction)

Steps to Reproduce:
1. Have 2 VMs flagged as HA running in host 1.
2. Shutdown RHV-M.
3. Either take a snapshot of RHV-M or make a backup.
3. Start RHV-M again.
4. Live migrate both VMs to host 2.
5. Shutdown RHV-M.
6. Restore the snapshot or the backup taken in step 3.
7. Start RHV-M again.

Actual results:
RHV-M does not find VMs running in host 1 where it expected it to be and starts recovery of (one or more) VMs before noticing that VMs are running in the other host.

Expected results:
RHV-M does not start recovery of HA VMs until having state of all hosts.

Additional info:
This affects the RHEV-M 3.6 to RHV-M 4.0 upgrade since once RHV-M 4.0 has been started with the backup of the 3.6 RHEV-M, it's no longer safe to stop RHV-M 4 and start RHEV-M 3.6 back as a rollback strategy if something goes wrong.
Therefore this should be fixed in RHEV-M 3.6 too.

(Originally by Julio Entrena Perez)

Comment 3 rhev-integ 2017-04-27 07:47:05 UTC

*** Bug 1419649 has been marked as a duplicate of this bug. ***

(Originally by Arik Hadas)

Comment 4 rhev-integ 2017-04-27 07:47:11 UTC

Best solution imho would be to set the VM as Down (this clears its run_on_vds) and to set it with a special exit-reason while restoring the backup. Initially, those VMs will be reported with Unknown status.

Positive flow: the VMs are detected either on the original host or on any other host, they would be handled and updated accordingly.

Negative flow: the VMs are not reported on any host (the host they run-on is non-responsive or the host has been rebooted), then for 5 minutes after the engine starts these VMs are reported back to clients with Unknown status - the user cannot do anything with these VMs. After 5 minutes these VMs are reported as Down. The user can then starts them (it is the user's responsibility not to start such VM if it may run on a different host).

Simone, we discussed this as a possible solution for bz 1419649 - would you be able to adjust the restore process?

(Originally by Arik Hadas)

Comment 5 rhev-integ 2017-04-27 07:47:18 UTC

With the posted patch, the logic after restoring a backup should be:
if a VM is highly-available (auto-startup='t') and is not set with a vm-lease (lease_sd_id=NULL) then set it to Down (status=0) with Unknown exit_status (exit_status=2) and with Unknown exit_reason (exit_reason=-1):

UPDATE vm_dynamic
SET status=0, exit_status=2, exit_reason=-1
WHERE vm_guid IN 
       (SELECT vm_guid
        FROM vm_static
        WHERE auto_startup='t' AND lease_sd_id=NULL);

(Originally by Arik Hadas)

Comment 6 rhev-integ 2017-04-27 07:47:24 UTC

Why covering just the HA VMs?
In theory the user could face the same issue if, just after a restore, he explicitly tries to start a non-HA VM that is instead running somewhere.

(Originally by Simone Tiraboschi)

Comment 7 rhev-integ 2017-04-27 07:47:30 UTC

(In reply to Simone Tiraboschi from comment #5)
> Why covering just the HA VMs?
> In theory the user could face the same issue if, just after a restore, he
> explicitly tries to start a non-HA VM that is instead running somewhere.

That's true, but we should look at it in the broader scope.
The ideal solution for that would probably be to use vm-leases, when that feature will be complete users can use it for all their HA VMs and we won't need such a defensive handling.
In light of the ability to avoid this problem with vm-leases and that the probablity of having a HA VM running on a non-responsive host after restoring a backup is extremely low, we would prefer to concentrate on the most important and painful issue (which is also what happened in this particular case) and that is the automatic restart of HA VMs.
We actually think of changing the solution described in comment 3 so the VM won't be reported with status Unknown to clients and not to block the user from running the VM in the first 5 minutes after engine startup. It may well be over-engineering.
I would suggest to start only with HA VMs and address the automatic restart of the VM, it would most probably be enough for any real-world case.

(Originally by Arik Hadas)

Comment 8 rhev-integ 2017-04-27 07:47:37 UTC

(In reply to Arik from comment #6)

> the
> probablity of having a HA VM running on a non-responsive host after
> restoring a backup is extremely low, 

Sorry to be a party killer but the above problem reproduces with all host being responsive, there wasn't any non-responsive host neither on customers report nor in my reproduction of the problem.

(Originally by Julio Entrena Perez)

Comment 9 rhev-integ 2017-04-27 07:47:43 UTC

(In reply to Julio Entrena Perez from comment #7)
> (In reply to Arik from comment #6)
> Sorry to be a party killer but the above problem reproduces with all host
> being responsive, there wasn't any non-responsive host neither on customers
> report nor in my reproduction of the problem.

Right, that's exactly my point - in 99.9% of the cases, the hosts will be responsive so we can introduce the simple solution described above for that scenario rather than something more complicated.

(Originally by Arik Hadas)

Comment 14 Jiri Belka 2017-05-11 13:43:08 UTC

fail, ovirt-engine-4.1.2.1-0.1.el7.noarch

2017-05-11 15:37:32,071+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [] VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) was unexpectedly detected as 'Up' on VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'(dell-r210ii-04) (expected on '2320e034-3d7f-4a6a-881f-47bc3091da91')
^^ engine complains after it did start the test1 vm itself twice that it is running...

HA VMs were running host host13 in time of snapshot, then migrated to host4 while engine was down.

# ps -eo comm,start_time | grep qemu
qemu-kvm        15:15
qemu-kvm        15:15

After previewing snapshot and starting engine, one of the HA VM is running twine, on both nodes:

# ps -eo comm,start_time | grep qemu-kvm
qemu-kvm        15:27

2017-05-11 15:27:18,410+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler3) [] Fetched 0 VMs from VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'
^^ 0 VMs
2017-05-11 15:27:18,471+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [] VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) is running in db and not running on VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13)
2017-05-11 15:27:18,539+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler9) [] Fetched 2 VMs from VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'
^^ 2 VMs
2017-05-11 15:27:18,559+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler9) [] VM 'b44f645a-12b9-4252-b36e-b58126a5079d'(test2) was unexpectedly detected as 'Up' on VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'(dell-r210ii-04) (expected on '2320e034-3d7f-4a6a-881f-47bc3091da91')
2017-05-11 15:27:18,578+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [] add VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) to HA rerun treatment
2017-05-11 15:27:19,311+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test1 failed. It will be restarted automatically.
^^ here we are firing up again test1
2017-05-11 15:27:19,310+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler9) [] VMs initialization finished for Host: 'dell-r210ii-04:351b4656-68a1-4a4b-82e8-2f11855b4f1f'
2017-05-11 15:27:19,311+02 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler3) [] Highly Available VM went down. Attempting to restart. VM Name 'test1', VM Id '6c1c6b66-7da0-4628-a05a-985eea12af15'
2017-05-11 15:27:19,321+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler3) [] VMs initialization finished for Host: 'dell-r210ii-13:2320e034-3d7f-4a6a-881f-47bc3091da91'


qemu     26779  0.8  0.6 2655920 48116 ?       Sl   15:15   0:10 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-test1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu SandyBridge -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 6c1c6b66-7da0-4628-a05a-985eea12af15 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.3-1.1.el7,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=6c1c6b66-7da0-4628-a05a-985eea12af15 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-8-test1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-11T13:15:46,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-00000000017c/d93e6782-f200-4e5e-9713-2a53ceca3c49/images/1c60424e-e3a3-44e3-bde0-0e51b8121287/0c8aab4b-40cd-41d3-811a-aa6055d7a534,format=raw,if=none,id=drive-scsi0-0-0-1,serial=1c60424e-e3a3-44e3-bde0-0e51b8121287,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e1:3f:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5901,addr=10.34.63.223,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on

qemu     15528  0.8  0.6 2638576 48744 ?       Sl   15:27   0:04 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-15-test1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu SandyBridge -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 6c1c6b66-7da0-4628-a05a-985eea12af15 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.3-1.1.el7,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=6c1c6b66-7da0-4628-a05a-985eea12af15 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-15-test1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-11T13:27:20,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-00000000017c/d93e6782-f200-4e5e-9713-2a53ceca3c49/images/1c60424e-e3a3-44e3-bde0-0e51b8121287/0c8aab4b-40cd-41d3-811a-aa6055d7a534,format=raw,if=none,id=drive-scsi0-0-0-1,serial=1c60424e-e3a3-44e3-bde0-0e51b8121287,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e1:3f:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.34.62.205,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
root     16305  0.0  0.0 112652   960 pts/0    S+   15:37   0:00 grep --color=auto qemu-kvm

Comment 16 Arik 2017-05-14 12:09:17 UTC

(In reply to Jiri Belka from comment #14)
It seems that the database was not modified as it should have been, the engine knew when it was started that the VM was running on host13:
2017-05-11 15:27:18,471+02 VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) is running in db and not running on VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13)

The script should have moved the VM to down and clear its run_on_vds field.
Did you restore the backup using the script provided for this? Did you see "- Resetting HA VM status" in the output?

Comment 17 Jiri Belka 2017-05-15 08:10:12 UTC

(In reply to Arik from comment #16)
> (In reply to Jiri Belka from comment #14)
> It seems that the database was not modified as it should have been, the
> engine knew when it was started that the VM was running on host13:
> 2017-05-11 15:27:18,471+02 VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1)
> is running in db and not running on VDS
> '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13)
> 
> The script should have moved the VM to down and clear its run_on_vds field.
> Did you restore the backup using the script provided for this? Did you see
> "- Resetting HA VM status" in the output?

I did:
- run the env with both HA running on same host
- stop engine
- make snapshot of the engine VM with ram
- start engine
- move both HA to other host
- poweroff the engine VM
- preview the snapshot with ram
- start engine

Thus no backup/restore.

Comment 18 Arik 2017-05-15 08:30:15 UTC

(In reply to Jiri Belka from comment #17) 
> I did:
> - run the env with both HA running on same host
> - stop engine
> - make snapshot of the engine VM with ram
> - start engine
> - move both HA to other host
> - poweroff the engine VM
> - preview the snapshot with ram
> - start engine
> 
> Thus no backup/restore.

I see. The fix is specific to the reported scenario in which the supported procedure for restoring a backup of the engine is done. The process you describe is not covered by this fix - but that's a different issue.

The verification of this bug should include making a backup of the engine and restoring that backup using the provided scripts.

Comment 19 Julio Entrena Perez 2017-05-15 08:44:00 UTC

(In reply to Arik from comment #18)

> The verification of this bug should include making a backup of the engine
> and restoring that backup using the provided scripts.

Some live migrations should happen after taking the backup and before restoring it to ensure that HA vms are no longer running in the same host that they were when the backup was taken.

Comment 20 Jiri Belka 2017-05-15 09:22:09 UTC

(In reply to Arik from comment #18)
> (In reply to Jiri Belka from comment #17) 
> > I did:
> > - run the env with both HA running on same host
> > - stop engine
> > - make snapshot of the engine VM with ram
> > - start engine
> > - move both HA to other host
> > - poweroff the engine VM
> > - preview the snapshot with ram
> > - start engine
> > 
> > Thus no backup/restore.
> 
> I see. The fix is specific to the reported scenario in which the supported
> procedure for restoring a backup of the engine is done. The process you
> describe is not covered by this fix - but that's a different issue.
> 
> The verification of this bug should include making a backup of the engine
> and restoring that backup using the provided scripts.

OK, I made separate BZ for above scenarion https://bugzilla.redhat.com/show_bug.cgi?id=1450835

Comment 23 Michal Skrivanek 2017-05-15 12:36:53 UTC

the fix is specific to engine db restore. For cases where this is not performed it's not going to work.
I suggest to move back to ON_QA as the description talks about "when a backup of RHV-M is restored". Taht should work.

Comment 24 Jiri Belka 2017-05-15 14:01:23 UTC

ovirt-engine-4.1.2.2-0.1.el7.noarch
ovirt-engine-tools-backup-4.1.2.2-0.1.el7.noarch

I re-tested again, now with restore scenario and I do not see HA VM(s) running twice after restore.

But... There's something odd in current solution - I see odd events which reveal something is odd inside engine:

~~~
# tail -f /var/log/ovirt-engine/engine.log | grep ERROR
2017-05-15 15:42:30,952+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test2 failed. It will be restarted automatically.
2017-05-15 15:42:31,216+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test1 failed. It will be restarted automatically.
2017-05-15 15:42:31,219+02 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler8) [] Failed to invoke scheduled method poll: null
~~~

IMO if current solution would be real fix, such odd events should not be present - what the events tell us? Keep in mind the HA VMs are running fine, although they do run on different host than in time of backup.

IMO it could be considered this BZ as verified if this BZ is understood as workaround, and we would solve real issue and above events in https://bugzilla.redhat.com/show_bug.cgi?id=1450835.

Comment 27 Michal Skrivanek 2017-05-15 16:09:00 UTC

(In reply to Jiri Belka from comment #24)
> ovirt-engine-4.1.2.2-0.1.el7.noarch
> ovirt-engine-tools-backup-4.1.2.2-0.1.el7.noarch
> 
> I re-tested again, now with restore scenario and I do not see HA VM(s)
> running twice after restore.
...
> IMO if current solution would be real fix, such odd events should not be
> present - what the events tell us? Keep in mind the HA VMs are running fine,
> although they do run on different host than in time of backup.

It doesn't look like working correctly. Can you please add full logs of the restore, ideally also content of the DB showing the status and exit reason of the VM - It doesn't look like it was set correctly by the restore tool. Did you see "Resetting HA VM status" log line?

Also, can anyone please clarify the "other" scenario with restore from snapshot? How is that even allowed?

Comment 29 Jiri Belka 2017-05-16 07:08:04 UTC

(In reply to Michal Skrivanek from comment #27)
> (In reply to Jiri Belka from comment #24)
> > ovirt-engine-4.1.2.2-0.1.el7.noarch
> > ovirt-engine-tools-backup-4.1.2.2-0.1.el7.noarch
> > 
> > I re-tested again, now with restore scenario and I do not see HA VM(s)
> > running twice after restore.
> ...
> > IMO if current solution would be real fix, such odd events should not be
> > present - what the events tell us? Keep in mind the HA VMs are running fine,
> > although they do run on different host than in time of backup.
> 
> It doesn't look like working correctly. Can you please add full logs of the
> restore, ideally also content of the DB showing the status and exit reason
> of the VM - It doesn't look like it was set correctly by the restore tool.
> Did you see "Resetting HA VM status" log line?

engine.log in https://bugzilla.redhat.com/attachment.cgi?id=1279006 is not enough? i can redo all flow if needed... (please state what exactly do you need - logs in DEBUG, DB backup?)

2017-05-15 15:38:02 12951: OUTPUT:   - Resetting HA VM status
2017-05-15 15:38:02 12951: pg_cmd running: psql -w -U engine -h localhost -p 5432  engine -t -c UPDATE vm_dynamic
          SET status=0, exit_status=2, exit_reason=-1
          WHERE vm_guid IN
             (SELECT vm_guid
              FROM vm_static
              WHERE auto_startup='t' AND lease_sd_id is NULL)
UPDATE 2

> Also, can anyone please clarify the "other" scenario with restore from
> snapshot? How is that even allowed?

discussed here https://bugzilla.redhat.com/show_bug.cgi?id=1450835

Comment 30 Arik 2017-05-16 08:28:07 UTC

(In reply to Jiri Belka from comment #29)
> engine.log in https://bugzilla.redhat.com/attachment.cgi?id=1279006 is not
> enough? i can redo all flow if needed... (please state what exactly do you
> need - logs in DEBUG, DB backup?)
> 
> 2017-05-15 15:38:02 12951: OUTPUT:   - Resetting HA VM status
> 2017-05-15 15:38:02 12951: pg_cmd running: psql -w -U engine -h localhost -p
> 5432  engine -t -c UPDATE vm_dynamic
>           SET status=0, exit_status=2, exit_reason=-1
>           WHERE vm_guid IN
>              (SELECT vm_guid
>               FROM vm_static
>               WHERE auto_startup='t' AND lease_sd_id is NULL)
> UPDATE 2
> 

Alright, so now it's clear.

I'll modify that sql query, it doesn't clear the run_on_vds field - that's my bad.
It is not that critical though - so the monitoring still expects the VM to be running on the wrong host and when it finds that the VM is not running there, it issues a log and an audit log (those are the wrong events mentioned above). But, the mechanism that starts highly available VMs checks their status and exit status, and since the exit status is not 'Error' (but 'Unknown'), it won't try to restart them - so the split brain is actually prevented.

Not sure about the state of 4.1's z-streams, but I think this issue should not block anything.

Comment 32 Jiri Belka 2017-06-16 09:36:00 UTC

ok, ovirt-engine-4.1.3.2-0.1.el7.noarch

no more starting "3rd" HA vm, ha vms run fine
no more odd events (#24)

Comment 34 errata-xmlrpc 2017-07-06 07:30:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1692

Note You need to log in before you can comment on or make changes to this bug.

ahadas
bmcclain
jbelka
jentrena
lsurette
lveyde
melewis
michal.skrivanek
mkalinin
mperina
nashok
pstehlik
rbalakri
Rhev-m-bugs
srevivo
stirabos
tjelinek
usurse
ykaul
ylavi