Bug 1108334 - [RFE] Disaster Recovery Plan for Hosted Engine
Summary: [RFE] Disaster Recovery Plan for Hosted Engine
Keywords:
Status: CLOSED DUPLICATE of bug 1420604
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: RFEs
Version: 3.4.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Scott Herold
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On: 1116469 1232136
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-06-11 18:08 UTC by James W. Mills
Modified: 2021-09-09 11:37 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-29 13:50:31 UTC
oVirt Team: Integration
Target Upstream Version:
Embargoed:
sherold: Triaged+
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1181636 1 None None None 2021-01-20 06:05:38 UTC
Red Hat Issue Tracker RHV-43424 0 None None None 2021-09-09 11:37:47 UTC
Red Hat Knowledge Base (Solution) 916853 0 None None None Never

Internal Links: 1181636

Description James W. Mills 2014-06-11 18:08:07 UTC
Description of problem:  The introduction of hosted-engine really brought some needed HA functionality to the RHEV Manager.

However, there is no disaster recovery plan, and this is already being brought up by customers.

Imagine a scenario where hosted engine is deployed on a rack of servers, and the server rack fails permanently.  We have no plan in place for a customer to re-deploy hosted engine.

My teammates and I have done some research and experimentation, and we have a rough article on recovering from a disaster of this nature here:

https://access.redhat.com/site/articles/912863

I would love to see hosted-engine incorporate the ability to:

* When an already defined local engine SD is detected, offer the ability to use the VM that exists there, in addition to either being a new "slave" host or destroying the contents of the share and beginning again

* When connecting to the database for the first time, detect and remove old host/VM configurations before adding the new ones.

These two changes would make disaster recovery much more approachable by the customer.

Comment 1 Itamar Heim 2014-06-30 16:24:03 UTC
James - afaik, hosted engine installer allows to connect to an existing hosted engine storage domain when you add a new host.
on the other hand, if you lost the hosted engine, 3.5 adds 'import data domain',

can you please elaborate a bit on what you want?

thanks,
   Itamar

Comment 2 James W. Mills 2014-06-30 21:28:26 UTC
Itamar,

You are correct, if I already have HE deployed, adding a new host to it is supported (yay!).  However, the scenario we were investigating is more of a disaster recovery one.

Imagine a simple setup like this:

* Two HE hosts
* One "private" HE SD (where the manager lives)
* Normal RHEV environment (DCs/SDs/VMs)

Scenario 1 - If you lost the two HE hosts, but everything else was still intact.

In this scenario, a manager VM exists.  When re-installing the initial HE host using the same private SD, it will detect that the private SD exists.  When it does this, it gives you two options, "become a slave" or "remove everything and begin again".  Since this is the first host, it cannot "become a slave".  It would be very nice if a third option "use the existing VM configuration" was there.  This would allow the setup to reuse the existing VM and not force the user to reinstall the OS/RHEV/database backup.

Scenario 2 - If you lost the two HE hosts *and* the private HE SD.

In this scenario, everything is gone and we will need to begin anew.  However, assuming we had a recent "engine-backup", we could reinstall the IS/RHEV, then restore from the backup.  It would be beneficial if the HE installation could automatically "clean up" the existing host/manager VM entries in the DB before trying to add the new host/manager VM entries.


I hope this helps, and please let me know if you need further clarification!
~james

Comment 5 Sandro Bonazzola 2014-09-11 11:11:57 UTC
(In reply to James W. Mills from comment #2)
> Itamar,
> 
> You are correct, if I already have HE deployed, adding a new host to it is
> supported (yay!).  However, the scenario we were investigating is more of a
> disaster recovery one.
> 
> Imagine a simple setup like this:
> 
> * Two HE hosts
> * One "private" HE SD (where the manager lives)
> * Normal RHEV environment (DCs/SDs/VMs)
> 
> Scenario 1 - If you lost the two HE hosts, but everything else was still
> intact.
> 
> In this scenario, a manager VM exists.  When re-installing the initial HE
> host using the same private SD, it will detect that the private SD exists. 
> When it does this, it gives you two options, "become a slave" or "remove
> everything and begin again".  Since this is the first host, it cannot
> "become a slave".  It would be very nice if a third option "use the existing
> VM configuration" was there.  This would allow the setup to reuse the
> existing VM and not force the user to reinstall the OS/RHEV/database backup.

For this scenario, if you have a backup of the original answer file used for deploying the host and the storage is not damaged you can
- re-deploy first host passing --config-append=answerfile.conf
- when it asks to confirm os is installed, just confirm and use --vm-poweroff for shutting down the vm
- when it ask to confirm that the engine is installed, use --check-liveliness to ensure the engine is responding
- move the system to global maintenance
- drop the host from the engine
- confirm that the engine is up

Above sequence should work, but need to be verified and if anything cause the procedure to fail it should be fixed.
Keeping needinfo on me for remembering to verify the process.


> 
> Scenario 2 - If you lost the two HE hosts *and* the private HE SD.
> 
> In this scenario, everything is gone and we will need to begin anew. 
> However, assuming we had a recent "engine-backup", we could reinstall the
> IS/RHEV, then restore from the backup.  It would be beneficial if the HE
> installation could automatically "clean up" the existing host/manager VM
> entries in the DB before trying to add the new host/manager VM entries.

Scenario 2 looks really like a migration from a physical setup to the hosted engine one and should be already covered by the documentation.


> 
> 
> I hope this helps, and please let me know if you need further clarification!
> ~james

Comment 6 Yedidyah Bar David 2014-09-11 11:57:14 UTC
(In reply to Sandro Bonazzola from comment #5)
> Scenario 2 looks really like a migration from a physical setup to the hosted
> engine one and should be already covered by the documentation.

Not exactly - similar, but as James noted, the engine db will already include HE-related data which might need to be cleaned.

Comment 7 Sandro Bonazzola 2014-11-03 13:22:48 UTC
didi, can you take this?

Comment 13 Yaniv Kaul 2015-11-17 13:20:25 UTC
Sandro - what is the work to be done here from our side? (for 4.0)

Comment 14 Sandro Bonazzola 2015-12-14 15:44:23 UTC
As per comment #5 and comment #6 , verify that proposed disaster recovery plan works or provide a different one and move it to doc team.

Comment 15 Martin Sivák 2016-01-05 13:32:51 UTC
Another idea we just had with Simone:

We might offer a periodic snapshot of the engine VM. We can probably do it efficiently every time the VM goes down properly.

That will give us a way to recover a broken VM, but it does not fit the storage domain we are currently requiring/using (we require only 20GiB iirc).

Comment 16 Simone Tiraboschi 2016-01-05 13:40:41 UTC
(In reply to Martin Sivák from comment #15)
> Another idea we just had with Simone:
> 
> We might offer a periodic snapshot of the engine VM. We can probably do it
> efficiently every time the VM goes down properly.

We can even further extend id adding to ha-agent the capability to auto-recover to the 'last working snapshot' assuming that the current liveliness check is enough to understand if a specific snapshot is sane.

Then of course we need to add to hosted-engine tool all the capabilities to create, recover and delete engine VM snapshots according to SD space.
Probably the user could also use the engine to create a snapshot but the engine could not be available after a fault when the user needs to recover it.

Comment 17 Yaniv Lavi 2017-04-03 09:54:08 UTC
I think this is fixed in 4.1 with the storage migration options. Should we move it to 4.1.2?

Comment 18 Simone Tiraboschi 2017-04-04 12:08:59 UTC
(In reply to Yaniv Dary from comment #17)
> I think this is fixed in 4.1 with the storage migration options. Should we
> move it to 4.1.2?


Yes, I think so.

Comment 19 Yaniv Lavi 2017-05-29 13:50:31 UTC

*** This bug has been marked as a duplicate of bug 1420604 ***


Note You need to log in before you can comment on or make changes to this bug.