Bug 1450835 - [Docs][Admin][SHE] Add a warning that backup and restore is supported via the engine-backup tool only, and 3rd party tools can be used to back up the resulting tarball
Summary: [Docs][Admin][SHE] Add a warning that backup and restore is supported via the...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: Documentation
Version: 4.1.2
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.1.11
: ---
Assignee: Avital Pinnick
QA Contact: Billy Burmester
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-15 09:21 UTC by Jiri Belka
Modified: 2019-05-07 12:50 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-24 07:21:25 UTC
oVirt Team: Docs
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1446055 0 high CLOSED [downstream clone - 4.1.3] HA VMs running in two hosts at a time after restoring backup of RHV-M 2021-02-22 00:41:40 UTC

Internal Links: 1446055

Description Jiri Belka 2017-05-15 09:21:37 UTC
Description of problem:

This is a split from BZ1446055 as BZ1446055 is about backup-migrate-restore case and this BZ does not involved restore step which tries to solve BZ1446055.

If you run two HA VMs on a host 'foo', move the VMs to host 'bar' and after this you make engine to face this situation (eg. stop engine, snapshot engine VM, start engine, move HA VMs, restore to the previously made snapshot), the engine is confused and start to run a HA VM once more, ie. a HA VM would run twice on two hosts.

~~~ https://bugzilla.redhat.com/show_bug.cgi?id=1446055#c14 ~~~
2017-05-11 15:37:32,071+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler2) [] VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) was unexpectedly detected as 'Up' on VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'(dell-r210ii-04) (expected on '2320e034-3d7f-4a6a-881f-47bc3091da91')
^^ engine complains after it did start the test1 vm itself twice that it is running...

HA VMs were running host host13 in time of snapshot, then migrated to host4 while engine was down.

# ps -eo comm,start_time | grep qemu
qemu-kvm        15:15
qemu-kvm        15:15

After previewing snapshot and starting engine, one of the HA VM is running twine, on both nodes:

# ps -eo comm,start_time | grep qemu-kvm
qemu-kvm        15:27

2017-05-11 15:27:18,410+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler3) [] Fetched 0 VMs from VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'
^^ 0 VMs
2017-05-11 15:27:18,471+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [] VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) is running in db and not running on VDS '2320e034-3d7f-4a6a-881f-47bc3091da91'(dell-r210ii-13)
2017-05-11 15:27:18,539+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler9) [] Fetched 2 VMs from VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'
^^ 2 VMs
2017-05-11 15:27:18,559+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler9) [] VM 'b44f645a-12b9-4252-b36e-b58126a5079d'(test2) was unexpectedly detected as 'Up' on VDS '351b4656-68a1-4a4b-82e8-2f11855b4f1f'(dell-r210ii-04) (expected on '2320e034-3d7f-4a6a-881f-47bc3091da91')
2017-05-11 15:27:18,578+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler3) [] add VM '6c1c6b66-7da0-4628-a05a-985eea12af15'(test1) to HA rerun treatment
2017-05-11 15:27:19,311+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler3) [] EVENT_ID: HA_VM_FAILED(9,602), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Highly Available VM test1 failed. It will be restarted automatically.
^^ here we are firing up again test1
2017-05-11 15:27:19,310+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler9) [] VMs initialization finished for Host: 'dell-r210ii-04:351b4656-68a1-4a4b-82e8-2f11855b4f1f'
2017-05-11 15:27:19,311+02 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (DefaultQuartzScheduler3) [] Highly Available VM went down. Attempting to restart. VM Name 'test1', VM Id '6c1c6b66-7da0-4628-a05a-985eea12af15'
2017-05-11 15:27:19,321+02 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler3) [] VMs initialization finished for Host: 'dell-r210ii-13:2320e034-3d7f-4a6a-881f-47bc3091da91'


qemu     26779  0.8  0.6 2655920 48116 ?       Sl   15:15   0:10 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-8-test1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu SandyBridge -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 6c1c6b66-7da0-4628-a05a-985eea12af15 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.3-1.1.el7,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=6c1c6b66-7da0-4628-a05a-985eea12af15 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-8-test1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-11T13:15:46,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-00000000017c/d93e6782-f200-4e5e-9713-2a53ceca3c49/images/1c60424e-e3a3-44e3-bde0-0e51b8121287/0c8aab4b-40cd-41d3-811a-aa6055d7a534,format=raw,if=none,id=drive-scsi0-0-0-1,serial=1c60424e-e3a3-44e3-bde0-0e51b8121287,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e1:3f:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5901,addr=10.34.63.223,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on

qemu     15528  0.8  0.6 2638576 48744 ?       Sl   15:27   0:04 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-15-test1/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu SandyBridge -m size=2097152k,slots=16,maxmem=8388608k -realtime mlock=off -smp 2,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0-1,mem=2048 -uuid 6c1c6b66-7da0-4628-a05a-985eea12af15 -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=7.3-1.1.el7,serial=4C4C4544-0034-5310-8052-B3C04F4A354A,uuid=6c1c6b66-7da0-4628-a05a-985eea12af15 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-15-test1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-11T13:27:20,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-center/00000001-0001-0001-0001-00000000017c/d93e6782-f200-4e5e-9713-2a53ceca3c49/images/1c60424e-e3a3-44e3-bde0-0e51b8121287/0c8aab4b-40cd-41d3-811a-aa6055d7a534,format=raw,if=none,id=drive-scsi0-0-0-1,serial=1c60424e-e3a3-44e3-bde0-0e51b8121287,cache=none,werror=stop,rerror=stop,aio=threads -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e1:3f:00,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/6c1c6b66-7da0-4628-a05a-985eea12af15.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel2,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=3,chardev=charchannel2,id=channel2,name=com.redhat.spice.0 -spice tls-port=5900,addr=10.34.62.205,x509-dir=/etc/pki/vdsm/libvirt-spice,tls-channel=default,tls-channel=main,tls-channel=display,tls-channel=inputs,tls-channel=cursor,tls-channel=playback,tls-channel=record,tls-channel=smartcard,tls-channel=usbredir,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=33554432,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on
root     16305  0.0  0.0 112652   960 pts/0    S+   15:37   0:00 grep --color=auto qemu-kvm
~~~

~~~ This case was no handled in original BZ, see https://bugzilla.redhat.com/show_bug.cgi?id=1446055#c18
I did:
- run the env with both HA running on same host
- stop engine
- make snapshot of the engine VM with ram
- start engine
- move both HA to other host
- poweroff the engine VM
- preview the snapshot with ram
- start engine

Thus no backup/restore.
~~~

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.2.1-0.1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
- run the env with both HA running on same host
- stop engine
- make snapshot of the engine VM with ram
- start engine
- move both HA to other host
- poweroff the engine VM
- preview the snapshot with ram
- start engine

Actual results:
After previewing snapshot and starting engine, one of the HA VM is running twine, on both nodes.

Expected results:
engine monitoring should be able to solve the issue with having HA VMs marked as running on different host than reality, ie. not to run any HA VMs twice.

Additional info:
It would be make nice to test with with host pinning or VMs affinity as well.

Comment 1 Arik 2017-05-15 20:45:01 UTC
The fundamental question here is whether this scenario is supported. I think that every restoration of a previous database state should be done using the script provided for that, otherwise it is equivalent to powering off the engine and modifying the database while it is down. The engine currently assumes that it cannot happen and thus rely on the data it reads from the database when it starts - changing that is likely to break stuff.

Comment 2 Michal Skrivanek 2017-05-16 04:41:28 UTC
Indeed. How do you perform the snapshot operations when engine is down?

Comment 3 Yaniv Kaul 2017-05-16 05:42:23 UTC
Also, was this performed using the new HA with leases?

Comment 4 Jiri Belka 2017-05-16 06:27:18 UTC
(In reply to Michal Skrivanek from comment #2)
> Indeed. How do you perform the snapshot operations when engine is down?

engine down = service down.

Comment 5 Jiri Belka 2017-05-16 06:27:52 UTC
(In reply to Yaniv Kaul from comment #3)
> Also, was this performed using the new HA with leases?

no, i used default settings in HA part of VM properties.

Comment 6 Michal Skrivanek 2017-05-16 06:34:17 UTC
(In reply to Jiri Belka from comment #4)
> (In reply to Michal Skrivanek from comment #2)
> > Indeed. How do you perform the snapshot operations when engine is down?
> 
> engine down = service down.

what service? engine service? How do you perform the snapshot operations when engine is down?

Comment 7 Jiri Belka 2017-05-16 06:36:11 UTC
(In reply to Michal Skrivanek from comment #6)
> (In reply to Jiri Belka from comment #4)
> > (In reply to Michal Skrivanek from comment #2)
> > > Indeed. How do you perform the snapshot operations when engine is down?
> > 
> > engine down = service down.
> 
> what service? engine service? How do you perform the snapshot operations
> when engine is down?

yes, ovirt-engine service was down during snapshot operation.

Comment 8 Jiri Belka 2017-05-16 06:37:45 UTC
(In reply to Arik from comment #1)
> The fundamental question here is whether this scenario is supported. I think
> that every restoration of a previous database state should be done using the
> script provided for that, otherwise it is equivalent to powering off the
> engine and modifying the database while it is down. The engine currently
> assumes that it cannot happen and thus rely on the data it reads from the
> database when it starts - changing that is likely to break stuff.

IMO technically the logic in the code is wrong and "assumption" doesn't solve anything. The decision is political, to repair it or keep it and hide it behind external restrictions.

Comment 9 Jiri Belka 2017-05-16 06:39:01 UTC
I any case, if this BZ will be closed as won't fix, I'll open new one for those odd events which appeared with supported flow as in https://bugzilla.redhat.com/show_bug.cgi?id=1446055

Comment 10 Michal Skrivanek 2017-05-16 06:55:10 UTC
Ok. Got it on the service status. But I'm still not clear on how exactly did you create and restore snapshot. What tool/steps you used?

Comment 11 Jiri Belka 2017-05-16 07:14:36 UTC
(In reply to Michal Skrivanek from comment #10)
> Ok. Got it on the service status. But I'm still not clear on how exactly did
> you create and restore snapshot. What tool/steps you used?

Steps to Reproduce:
- run the env with both HA running on same host
- stop engine _service_ (systemctl stop ovirt-engine)
- make snapshot of the _running_ engine VM with ram from Admin Portal
  (our engine host is just another VM running on RHV)
- start engine _service_ (systemctl start ovirt-engine)
- move both HA VMs to other host (ie. the host where HA VMs will be running are
  different than in time of the snapshot of the whole engine VM [incl DB as too])
- poweroff the currently running engine VM
- preview the snapshot of the engine VM with ram
- start engine VM
- systemctl start ovirt-engine (as engine service was down during original
  snapshot)

Is it clear? It tries to mimic a situation where engine has different (i don't know internals) info about where HA VMs are running and reality, and how would engine solve this situation. Engine fails to solve this as both HA VMs are running fine and the engine anyway starts another instance of a HA VM on "original" host.

Is it clear now?

Comment 12 Arik 2017-05-16 08:03:35 UTC
(In reply to Jiri Belka from comment #8)
> IMO technically the logic in the code is wrong and "assumption" doesn't
> solve anything. The decision is political, to repair it or keep it and hide
> it behind external restrictions.

Well, don't underestimate assumptions when it comes to large-scale and complex system - you have to take some at some point. And we currently heavily rely on the assumption someone took a while ago on having the latest data on the database.

Imagine that you're starting live storage migration, so initially, you take a snapshot and rely on having the commands and the tasks in the database. Now, if you restore a previous state of the database - you'll lose the command and the tasks. At best, the live storage migration will stop and you'll just have an unused volume in the disk's chain. At worst, when making further operations on the disks you'll get conflicts (for example, because you may end up having several leafs for a disk).

So when you restore a previous state of the database, you'll need to make some adjustments. Again, otherwise, you need to design your system in a way that it fetches the current state of the system on startup - in the live storage migration case, to examine the volumes of the disks and compare it with what's in the database. It has a performance penalty and extra-complexity. So if this scenario is non-realistic in practice, I would still recommend lowering its priority to the minimum or close it as won't fix.

Comment 13 Julio Entrena Perez 2017-05-16 08:33:31 UTC
There are scenarios where customers may not be using engine-backup script to backup RHV-M but other backup mechanisms such as a snapshot (e.g. RHV-M running on VMware) or other backup tools such as Relax-and-Recover, which is included in RHEL, or other 3rd party backup products to perform a bare metal backup and restore.
Therefore the fix in engine-backup script does not cover all backup and restore scenarios.

Comment 14 Arik 2017-05-16 09:21:50 UTC
(In reply to Julio Entrena Perez from comment #13)
> There are scenarios where customers may not be using engine-backup script to
> backup RHV-M but other backup mechanisms such as a snapshot (e.g. RHV-M
> running on VMware) or other backup tools such as Relax-and-Recover, which is
> included in RHEL, or other 3rd party backup products to perform a bare metal
> backup and restore.

I would imagine us telling users/customers "Listen, you can use whatever tools you like, it may work, but we cannot guarantee that. If you want to be on the safe side, you will need to use the scripts we provide for that". Much like smartphone companies say for one that replaces his screen, not by handing it to an official lab. Another option is an RFE to extract the requirement adjustments in a way that users can execute them after using other tools for the backup-restore. Alternatively, an RFE for a seamless restoration of any backup that can be taken for the engine can be opened. I don't think the latter is realistic.

> Therefore the fix in engine-backup script does not cover all backup and
> restore scenarios.

Right, that's why we separated this to a different bug.

Comment 15 Michal Skrivanek 2017-05-16 17:11:30 UTC
None of the approaches in comment #13 are supported and I hope we are not recommending them. We can't really prevent them though. I would suggest to document 

Putting aside the artificial QE case, is it what customers realistically do? If do we have to stop them asap, and if needed provide a sw solution to prevent it, or even support such cases if needed. Those would be RFEs

Comment 23 Michal Skrivanek 2017-08-07 09:56:24 UTC
we need to document and emphasize the need to use engine-backup/engine-restore and warn about unattended 3rd party snapshot/restore of a live engine VM

Comment 24 Sandro Bonazzola 2017-12-20 14:00:15 UTC
oVirt 4.2.0 has been released on Dec 20th 2017. Please consider re-targeting this bug to next milestone

Comment 25 Michal Skrivanek 2017-12-20 18:16:14 UTC
this is Docs item, not a code change. Redirecting to Lucy for further assignment

Comment 26 Lucy Bopf 2018-03-26 00:47:31 UTC
Moving to the downstream product, and clearing targets to allow proper triage.

Comment 27 Lucy Bopf 2018-03-26 01:13:57 UTC
Updating the summary to be a bit more specific about the ask in comment 23.

We have backup and restore via engine-backup already documented for SHE and non-SHE setups. To that, we must add a warning with the following information, from comment 18:

- The provided engine-backup tool must be used for backup and restore purposes and, if a 3rd party backup tool is in place (archiving, ageing, off-site, tape library robot...) then the 3rd party backup tool should back up the tarball produced by engine-backup, resulting in combined usage of both.


Related: The SHE backup and restore procedure has been reviewed and will be updated as part of bug 1420604.

Comment 28 Lucy Bopf 2018-04-05 00:48:28 UTC
Flagging this for 4.1 and 4.2. The warning should be added for both SHE and non-SHE backup and restore procedures for both versions.


Note You need to log in before you can comment on or make changes to this bug.