Bug 1446055
Summary: | [downstream clone - 4.1.3] HA VMs running in two hosts at a time after restoring backup of RHV-M | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ |
Component: | ovirt-engine | Assignee: | Arik <ahadas> |
Status: | CLOSED ERRATA | QA Contact: | Jiri Belka <jbelka> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.0.6 | CC: | ahadas, bmcclain, jbelka, jentrena, lsurette, lveyde, melewis, michal.skrivanek, mkalinin, mperina, nashok, pstehlik, rbalakri, Rhev-m-bugs, srevivo, stirabos, tjelinek, usurse, ykaul, ylavi |
Target Milestone: | ovirt-4.1.3 | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Previously, after restoring from backup the Manager occasionally attempted to start an already running virtual machine again automatically as the Manager believed the virtual machine was down and this caused a second instance of the virtual machine to start on a different host. Now, the virtual machine status is set to down and the exit status is set to unknown after restoring from a backup. This status is changed as soon as the virtual machine is run. This means that the automatic restart of the virtual machine is no longer attempted.
|
Story Points: | --- |
Clone Of: | 1441322 | Environment: | |
Last Closed: | 2017-07-06 07:30:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1441322 | ||
Bug Blocks: |
Description
rhev-integ
2017-04-27 07:46:49 UTC
*** Bug 1419649 has been marked as a duplicate of this bug. *** (Originally by Arik Hadas) Best solution imho would be to set the VM as Down (this clears its run_on_vds) and to set it with a special exit-reason while restoring the backup. Initially, those VMs will be reported with Unknown status. Positive flow: the VMs are detected either on the original host or on any other host, they would be handled and updated accordingly. Negative flow: the VMs are not reported on any host (the host they run-on is non-responsive or the host has been rebooted), then for 5 minutes after the engine starts these VMs are reported back to clients with Unknown status - the user cannot do anything with these VMs. After 5 minutes these VMs are reported as Down. The user can then starts them (it is the user's responsibility not to start such VM if it may run on a different host). Simone, we discussed this as a possible solution for bz 1419649 - would you be able to adjust the restore process? (Originally by Arik Hadas) With the posted patch, the logic after restoring a backup should be: if a VM is highly-available (auto-startup='t') and is not set with a vm-lease (lease_sd_id=NULL) then set it to Down (status=0) with Unknown exit_status (exit_status=2) and with Unknown exit_reason (exit_reason=-1): UPDATE vm_dynamic SET status=0, exit_status=2, exit_reason=-1 WHERE vm_guid IN (SELECT vm_guid FROM vm_static WHERE auto_startup='t' AND lease_sd_id=NULL); (Originally by Arik Hadas) Why covering just the HA VMs? In theory the user could face the same issue if, just after a restore, he explicitly tries to start a non-HA VM that is instead running somewhere. (Originally by Simone Tiraboschi) (In reply to Simone Tiraboschi from comment #5) > Why covering just the HA VMs? > In theory the user could face the same issue if, just after a restore, he > explicitly tries to start a non-HA VM that is instead running somewhere. That's true, but we should look at it in the broader scope. The ideal solution for that would probably be to use vm-leases, when that feature will be complete users can use it for all their HA VMs and we won't need such a defensive handling. In light of the ability to avoid this problem with vm-leases and that the probablity of having a HA VM running on a non-responsive host after restoring a backup is extremely low, we would prefer to concentrate on the most important and painful issue (which is also what happened in this particular case) and that is the automatic restart of HA VMs. We actually think of changing the solution described in comment 3 so the VM won't be reported with status Unknown to clients and not to block the user from running the VM in the first 5 minutes after engine startup. It may well be over-engineering. I would suggest to start only with HA VMs and address the automatic restart of the VM, it would most probably be enough for any real-world case. (Originally by Arik Hadas) (In reply to Arik from comment #6) > the > probablity of having a HA VM running on a non-responsive host after > restoring a backup is extremely low, Sorry to be a party killer but the above problem reproduces with all host being responsive, there wasn't any non-responsive host neither on customers report nor in my reproduction of the problem. (Originally by Julio Entrena Perez) (In reply to Julio Entrena Perez from comment #7) > (In reply to Arik from comment #6) > Sorry to be a party killer but the above problem reproduces with all host > being responsive, there wasn't any non-responsive host neither on customers > report nor in my reproduction of the problem. Right, that's exactly my point - in 99.9% of the cases, the hosts will be responsive so we can introduce the simple solution described above for that scenario rather than something more complicated. (Originally by Arik Hadas) fail, ovirt-engine-4.1.2.1-0.1.el7.noarch 2017-05-11 15:37:32,071+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [] VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) was unexpectedly detected as 'Up' on VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'(dell-r210ii-04) (expected on '2320e034-3d7f-4a6a-881f-47bc3091da91') ^^ engine complains after it did start the test1 vm itself twice that it is running... HA VMs were running host host13 in time of snapshot, then migrated to host4 while engine was down. # ps -eo comm,start_time | grep qemu qemu-kvm 15:15 qemu-kvm 15:15 After previewing snapshot and starting engine, one of the HA VM is running twine, on both nodes: # ps -eo comm,start_time | grep qemu-kvm qemu-kvm 15:27 2017-05-11 15:27:18,410+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler3) [] Fetched 0 VMs from VDS '2320e034-3d7f-4a6a-881f-47bc3091da91' ^^ 0 VMs 2017-05-11 15:27:18,471+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [] VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) is running in db and not running on VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13) 2017-05-11 15:27:18,539+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler9) [] Fetched 2 VMs from VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f' ^^ 2 VMs 2017-05-11 15:27:18,559+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler9) [] VM 'b44f645a-12b9-4252-b36e-b58126a5079d'(test2) was unexpectedly detected as 'Up' on VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'(dell-r210ii-04) (expected on '2320e034-3d7f-4a6a-881f-47bc3091da91') 2017-05-11 15:27:18,578+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [] add VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) to HA rerun treatment 2017-05-11 15:27:19,311+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test1 failed. It will be restarted automatically. ^^ here we are firing up again test1 2017-05-11 15:27:19,310+02 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler9) [] VMs initialization finished for Host: 'dell-r210ii-04:351b4656-68a1-4a4b-82e8-2f11855b4f1f' 2017-05-11 15:27:19,311+02 INFO [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler3) [] Highly Available VM went down. Attempting to restart. VM Name 'test1', VM Id '6c1c6b66-7da0-4628-a05a-985eea12af15' 2017-05-11 15:27:19,321+02 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler3) [] VMs initialization finished for Host: 'dell-r210ii-13:2320e034-3d7f-4a6a-881f-47bc3091da91' qemu 26779 0.8 0.6 2655920 48116 ? Sl 15:15 0:10 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-test1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu SandyBridge -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 6c1c6b66-7da0-4628-a05a-985eea12af15 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.3-1.1.el7,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=6c1c6b66-7da0-4628-a05a-985eea12af15 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-8-test1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-11T13:15:46,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-00000000017c/d93e6782-f200-4e5e-9713-2a53ceca3c49/images/1c60424e-e3a3-44e3-bde0-0e51b8121287/0c8aab4b-40cd-41d3-811a-aa6055d7a534,format=raw,if=none,id=drive-scsi0-0-0-1,serial=1c60424e-e3a3-44e3-bde0-0e51b8121287,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e1:3f:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5901,addr=10.34.63.223,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on qemu 15528 0.8 0.6 2638576 48744 ? Sl 15:27 0:04 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-15-test1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu SandyBridge -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 6c1c6b66-7da0-4628-a05a-985eea12af15 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.3-1.1.el7,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=6c1c6b66-7da0-4628-a05a-985eea12af15 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-15-test1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-11T13:27:20,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-00000000017c/d93e6782-f200-4e5e-9713-2a53ceca3c49/images/1c60424e-e3a3-44e3-bde0-0e51b8121287/0c8aab4b-40cd-41d3-811a-aa6055d7a534,format=raw,if=none,id=drive-scsi0-0-0-1,serial=1c60424e-e3a3-44e3-bde0-0e51b8121287,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e1:3f:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.34.62.205,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on root 16305 0.0 0.0 112652 960 pts/0 S+ 15:37 0:00 grep --color=auto qemu-kvm (In reply to Jiri Belka from comment #14) It seems that the database was not modified as it should have been, the engine knew when it was started that the VM was running on host13: 2017-05-11 15:27:18,471+02 VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) is running in db and not running on VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13) The script should have moved the VM to down and clear its run_on_vds field. Did you restore the backup using the script provided for this? Did you see "- Resetting HA VM status" in the output? (In reply to Arik from comment #16) > (In reply to Jiri Belka from comment #14) > It seems that the database was not modified as it should have been, the > engine knew when it was started that the VM was running on host13: > 2017-05-11 15:27:18,471+02 VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) > is running in db and not running on VDS > '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13) > > The script should have moved the VM to down and clear its run_on_vds field. > Did you restore the backup using the script provided for this? Did you see > "- Resetting HA VM status" in the output? I did: - run the env with both HA running on same host - stop engine - make snapshot of the engine VM with ram - start engine - move both HA to other host - poweroff the engine VM - preview the snapshot with ram - start engine Thus no backup/restore. (In reply to Jiri Belka from comment #17) > I did: > - run the env with both HA running on same host > - stop engine > - make snapshot of the engine VM with ram > - start engine > - move both HA to other host > - poweroff the engine VM > - preview the snapshot with ram > - start engine > > Thus no backup/restore. I see. The fix is specific to the reported scenario in which the supported procedure for restoring a backup of the engine is done. The process you describe is not covered by this fix - but that's a different issue. The verification of this bug should include making a backup of the engine and restoring that backup using the provided scripts. (In reply to Arik from comment #18) > The verification of this bug should include making a backup of the engine > and restoring that backup using the provided scripts. Some live migrations should happen after taking the backup and before restoring it to ensure that HA vms are no longer running in the same host that they were when the backup was taken. (In reply to Arik from comment #18) > (In reply to Jiri Belka from comment #17) > > I did: > > - run the env with both HA running on same host > > - stop engine > > - make snapshot of the engine VM with ram > > - start engine > > - move both HA to other host > > - poweroff the engine VM > > - preview the snapshot with ram > > - start engine > > > > Thus no backup/restore. > > I see. The fix is specific to the reported scenario in which the supported > procedure for restoring a backup of the engine is done. The process you > describe is not covered by this fix - but that's a different issue. > > The verification of this bug should include making a backup of the engine > and restoring that backup using the provided scripts. OK, I made separate BZ for above scenarion https://bugzilla.redhat.com/show_bug.cgi?id=1450835 the fix is specific to engine db restore. For cases where this is not performed it's not going to work. I suggest to move back to ON_QA as the description talks about "when a backup of RHV-M is restored". Taht should work. ovirt-engine-4.1.2.2-0.1.el7.noarch ovirt-engine-tools-backup-4.1.2.2-0.1.el7.noarch I re-tested again, now with restore scenario and I do not see HA VM(s) running twice after restore. But... There's something odd in current solution - I see odd events which reveal something is odd inside engine: ~~~ # tail -f /var/log/ovirt-engine/engine.log | grep ERROR 2017-05-15 15:42:30,952+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test2 failed. It will be restarted automatically. 2017-05-15 15:42:31,216+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test1 failed. It will be restarted automatically. 2017-05-15 15:42:31,219+02 ERROR [org.ovirt.engine.core.utils.timer.SchedulerUtilQuartzImpl] (DefaultQuartzScheduler8) [] Failed to invoke scheduled method poll: null ~~~ IMO if current solution would be real fix, such odd events should not be present - what the events tell us? Keep in mind the HA VMs are running fine, although they do run on different host than in time of backup. IMO it could be considered this BZ as verified if this BZ is understood as workaround, and we would solve real issue and above events in https://bugzilla.redhat.com/show_bug.cgi?id=1450835. (In reply to Jiri Belka from comment #24) > ovirt-engine-4.1.2.2-0.1.el7.noarch > ovirt-engine-tools-backup-4.1.2.2-0.1.el7.noarch > > I re-tested again, now with restore scenario and I do not see HA VM(s) > running twice after restore. ... > IMO if current solution would be real fix, such odd events should not be > present - what the events tell us? Keep in mind the HA VMs are running fine, > although they do run on different host than in time of backup. It doesn't look like working correctly. Can you please add full logs of the restore, ideally also content of the DB showing the status and exit reason of the VM - It doesn't look like it was set correctly by the restore tool. Did you see "Resetting HA VM status" log line? Also, can anyone please clarify the "other" scenario with restore from snapshot? How is that even allowed? (In reply to Michal Skrivanek from comment #27) > (In reply to Jiri Belka from comment #24) > > ovirt-engine-4.1.2.2-0.1.el7.noarch > > ovirt-engine-tools-backup-4.1.2.2-0.1.el7.noarch > > > > I re-tested again, now with restore scenario and I do not see HA VM(s) > > running twice after restore. > ... > > IMO if current solution would be real fix, such odd events should not be > > present - what the events tell us? Keep in mind the HA VMs are running fine, > > although they do run on different host than in time of backup. > > It doesn't look like working correctly. Can you please add full logs of the > restore, ideally also content of the DB showing the status and exit reason > of the VM - It doesn't look like it was set correctly by the restore tool. > Did you see "Resetting HA VM status" log line? engine.log in https://bugzilla.redhat.com/attachment.cgi?id=1279006 is not enough? i can redo all flow if needed... (please state what exactly do you need - logs in DEBUG, DB backup?) 2017-05-15 15:38:02 12951: OUTPUT: - Resetting HA VM status 2017-05-15 15:38:02 12951: pg_cmd running: psql -w -U engine -h localhost -p 5432 engine -t -c UPDATE vm_dynamic SET status=0, exit_status=2, exit_reason=-1 WHERE vm_guid IN (SELECT vm_guid FROM vm_static WHERE auto_startup='t' AND lease_sd_id is NULL) UPDATE 2 > Also, can anyone please clarify the "other" scenario with restore from > snapshot? How is that even allowed? discussed here https://bugzilla.redhat.com/show_bug.cgi?id=1450835 (In reply to Jiri Belka from comment #29) > engine.log in https://bugzilla.redhat.com/attachment.cgi?id=1279006 is not > enough? i can redo all flow if needed... (please state what exactly do you > need - logs in DEBUG, DB backup?) > > 2017-05-15 15:38:02 12951: OUTPUT: - Resetting HA VM status > 2017-05-15 15:38:02 12951: pg_cmd running: psql -w -U engine -h localhost -p > 5432 engine -t -c UPDATE vm_dynamic > SET status=0, exit_status=2, exit_reason=-1 > WHERE vm_guid IN > (SELECT vm_guid > FROM vm_static > WHERE auto_startup='t' AND lease_sd_id is NULL) > UPDATE 2 > Alright, so now it's clear. I'll modify that sql query, it doesn't clear the run_on_vds field - that's my bad. It is not that critical though - so the monitoring still expects the VM to be running on the wrong host and when it finds that the VM is not running there, it issues a log and an audit log (those are the wrong events mentioned above). But, the mechanism that starts highly available VMs checks their status and exit status, and since the exit status is not 'Error' (but 'Unknown'), it won't try to restart them - so the split brain is actually prevented. Not sure about the state of 4.1's z-streams, but I think this issue should not block anything. ok, ovirt-engine-4.1.3.2-0.1.el7.noarch no more starting "3rd" HA vm, ha vms run fine no more odd events (#24) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1692 |