Bug 1450835
Summary: | [Docs][Admin][SHE] Add a warning that backup and restore is supported via the engine-backup tool only, and 3rd party tools can be used to back up the resulting tarball | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jiri Belka <jbelka> |
Component: | Documentation | Assignee: | Avital Pinnick <apinnick> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Billy Burmester <bburmest> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.1.2 | CC: | adahms, ahadas, amureini, apinnick, bugs, jbelka, jentrena, lbopf, lsurette, michal.skrivanek, rbalakri, sbonazzo, srevivo, ykaul, ylavi |
Target Milestone: | ovirt-4.1.11 | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-24 07:21:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Docs | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jiri Belka
2017-05-15 09:21:37 UTC
The fundamental question here is whether this scenario is supported. I think that every restoration of a previous database state should be done using the script provided for that, otherwise it is equivalent to powering off the engine and modifying the database while it is down. The engine currently assumes that it cannot happen and thus rely on the data it reads from the database when it starts - changing that is likely to break stuff. Indeed. How do you perform the snapshot operations when engine is down? Also, was this performed using the new HA with leases? (In reply to Michal Skrivanek from comment #2) > Indeed. How do you perform the snapshot operations when engine is down? engine down = service down. (In reply to Yaniv Kaul from comment #3) > Also, was this performed using the new HA with leases? no, i used default settings in HA part of VM properties. (In reply to Jiri Belka from comment #4) > (In reply to Michal Skrivanek from comment #2) > > Indeed. How do you perform the snapshot operations when engine is down? > > engine down = service down. what service? engine service? How do you perform the snapshot operations when engine is down? (In reply to Michal Skrivanek from comment #6) > (In reply to Jiri Belka from comment #4) > > (In reply to Michal Skrivanek from comment #2) > > > Indeed. How do you perform the snapshot operations when engine is down? > > > > engine down = service down. > > what service? engine service? How do you perform the snapshot operations > when engine is down? yes, ovirt-engine service was down during snapshot operation. (In reply to Arik from comment #1) > The fundamental question here is whether this scenario is supported. I think > that every restoration of a previous database state should be done using the > script provided for that, otherwise it is equivalent to powering off the > engine and modifying the database while it is down. The engine currently > assumes that it cannot happen and thus rely on the data it reads from the > database when it starts - changing that is likely to break stuff. IMO technically the logic in the code is wrong and "assumption" doesn't solve anything. The decision is political, to repair it or keep it and hide it behind external restrictions. I any case, if this BZ will be closed as won't fix, I'll open new one for those odd events which appeared with supported flow as in https://bugzilla.redhat.com/show_bug.cgi?id=1446055 Ok. Got it on the service status. But I'm still not clear on how exactly did you create and restore snapshot. What tool/steps you used? (In reply to Michal Skrivanek from comment #10) > Ok. Got it on the service status. But I'm still not clear on how exactly did > you create and restore snapshot. What tool/steps you used? Steps to Reproduce: - run the env with both HA running on same host - stop engine _service_ (systemctl stop ovirt-engine) - make snapshot of the _running_ engine VM with ram from Admin Portal (our engine host is just another VM running on RHV) - start engine _service_ (systemctl start ovirt-engine) - move both HA VMs to other host (ie. the host where HA VMs will be running are different than in time of the snapshot of the whole engine VM [incl DB as too]) - poweroff the currently running engine VM - preview the snapshot of the engine VM with ram - start engine VM - systemctl start ovirt-engine (as engine service was down during original snapshot) Is it clear? It tries to mimic a situation where engine has different (i don't know internals) info about where HA VMs are running and reality, and how would engine solve this situation. Engine fails to solve this as both HA VMs are running fine and the engine anyway starts another instance of a HA VM on "original" host. Is it clear now? (In reply to Jiri Belka from comment #8) > IMO technically the logic in the code is wrong and "assumption" doesn't > solve anything. The decision is political, to repair it or keep it and hide > it behind external restrictions. Well, don't underestimate assumptions when it comes to large-scale and complex system - you have to take some at some point. And we currently heavily rely on the assumption someone took a while ago on having the latest data on the database. Imagine that you're starting live storage migration, so initially, you take a snapshot and rely on having the commands and the tasks in the database. Now, if you restore a previous state of the database - you'll lose the command and the tasks. At best, the live storage migration will stop and you'll just have an unused volume in the disk's chain. At worst, when making further operations on the disks you'll get conflicts (for example, because you may end up having several leafs for a disk). So when you restore a previous state of the database, you'll need to make some adjustments. Again, otherwise, you need to design your system in a way that it fetches the current state of the system on startup - in the live storage migration case, to examine the volumes of the disks and compare it with what's in the database. It has a performance penalty and extra-complexity. So if this scenario is non-realistic in practice, I would still recommend lowering its priority to the minimum or close it as won't fix. There are scenarios where customers may not be using engine-backup script to backup RHV-M but other backup mechanisms such as a snapshot (e.g. RHV-M running on VMware) or other backup tools such as Relax-and-Recover, which is included in RHEL, or other 3rd party backup products to perform a bare metal backup and restore. Therefore the fix in engine-backup script does not cover all backup and restore scenarios. (In reply to Julio Entrena Perez from comment #13) > There are scenarios where customers may not be using engine-backup script to > backup RHV-M but other backup mechanisms such as a snapshot (e.g. RHV-M > running on VMware) or other backup tools such as Relax-and-Recover, which is > included in RHEL, or other 3rd party backup products to perform a bare metal > backup and restore. I would imagine us telling users/customers "Listen, you can use whatever tools you like, it may work, but we cannot guarantee that. If you want to be on the safe side, you will need to use the scripts we provide for that". Much like smartphone companies say for one that replaces his screen, not by handing it to an official lab. Another option is an RFE to extract the requirement adjustments in a way that users can execute them after using other tools for the backup-restore. Alternatively, an RFE for a seamless restoration of any backup that can be taken for the engine can be opened. I don't think the latter is realistic. > Therefore the fix in engine-backup script does not cover all backup and > restore scenarios. Right, that's why we separated this to a different bug. None of the approaches in comment #13 are supported and I hope we are not recommending them. We can't really prevent them though. I would suggest to document Putting aside the artificial QE case, is it what customers realistically do? If do we have to stop them asap, and if needed provide a sw solution to prevent it, or even support such cases if needed. Those would be RFEs we need to document and emphasize the need to use engine-backup/engine-restore and warn about unattended 3rd party snapshot/restore of a live engine VM oVirt 4.2.0 has been released on Dec 20th 2017. Please consider re-targeting this bug to next milestone this is Docs item, not a code change. Redirecting to Lucy for further assignment Moving to the downstream product, and clearing targets to allow proper triage. Updating the summary to be a bit more specific about the ask in comment 23. We have backup and restore via engine-backup already documented for SHE and non-SHE setups. To that, we must add a warning with the following information, from comment 18: - The provided engine-backup tool must be used for backup and restore purposes and, if a 3rd party backup tool is in place (archiving, ageing, off-site, tape library robot...) then the 3rd party backup tool should back up the tarball produced by engine-backup, resulting in combined usage of both. Related: The SHE backup and restore procedure has been reviewed and will be updated as part of bug 1420604. Flagging this for 4.1 and 4.2. The warning should be added for both SHE and non-SHE backup and restore procedures for both versions. Updated documents: https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html-single/self-hosted_engine_guide/#sect-Restoring_SHE_bkup https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html-single/self-hosted_engine_guide/#Backing_up_the_Self-Hosted_Engine_Manager_Virtual_Machine https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html-single/administration_guide/index#Restoring_a_Backup_to_a_Fresh_Installation https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html-single/administration_guide/index#Restoring_a_Backup_to_Overwrite_an_Existing_Installation https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html-single/administration_guide/index#sect-Backing_Up_and_Restoring_the_Red_Hat_Enterprise_Virtualization_Manager |