Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1108334

Summary:	[RFE] Disaster Recovery Plan for Hosted Engine
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	James W. Mills <jamills>
Component:	RFEs	Assignee:	Scott Herold <sherold>
Status:	CLOSED DUPLICATE	QA Contact:	Shai Revivo <srevivo>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.4.0	CC:	benglish, didi, jamills, lpeer, lveyde, mkalinin, rbalakri, sbonazzo, sherold, srevivo, stefano.stagnaro, stirabos, ykawada, ylavi
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---	Flags:	sherold: Triaged+ lsvaty: testing_plan_complete-
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-29 13:50:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Integration	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1116469, 1232136
Bug Blocks:

Description James W. Mills 2014-06-11 18:08:07 UTC

Description of problem:  The introduction of hosted-engine really brought some needed HA functionality to the RHEV Manager.

However, there is no disaster recovery plan, and this is already being brought up by customers.

Imagine a scenario where hosted engine is deployed on a rack of servers, and the server rack fails permanently.  We have no plan in place for a customer to re-deploy hosted engine.

My teammates and I have done some research and experimentation, and we have a rough article on recovering from a disaster of this nature here:

https://access.redhat.com/site/articles/912863

I would love to see hosted-engine incorporate the ability to:

* When an already defined local engine SD is detected, offer the ability to use the VM that exists there, in addition to either being a new "slave" host or destroying the contents of the share and beginning again

* When connecting to the database for the first time, detect and remove old host/VM configurations before adding the new ones.

These two changes would make disaster recovery much more approachable by the customer.

Comment 1 Itamar Heim 2014-06-30 16:24:03 UTC

James - afaik, hosted engine installer allows to connect to an existing hosted engine storage domain when you add a new host.
on the other hand, if you lost the hosted engine, 3.5 adds 'import data domain',

can you please elaborate a bit on what you want?

thanks,
   Itamar

Comment 2 James W. Mills 2014-06-30 21:28:26 UTC

Itamar,

You are correct, if I already have HE deployed, adding a new host to it is supported (yay!).  However, the scenario we were investigating is more of a disaster recovery one.

Imagine a simple setup like this:

* Two HE hosts
* One "private" HE SD (where the manager lives)
* Normal RHEV environment (DCs/SDs/VMs)

Scenario 1 - If you lost the two HE hosts, but everything else was still intact.

In this scenario, a manager VM exists.  When re-installing the initial HE host using the same private SD, it will detect that the private SD exists.  When it does this, it gives you two options, "become a slave" or "remove everything and begin again".  Since this is the first host, it cannot "become a slave".  It would be very nice if a third option "use the existing VM configuration" was there.  This would allow the setup to reuse the existing VM and not force the user to reinstall the OS/RHEV/database backup.

Scenario 2 - If you lost the two HE hosts *and* the private HE SD.

In this scenario, everything is gone and we will need to begin anew.  However, assuming we had a recent "engine-backup", we could reinstall the IS/RHEV, then restore from the backup.  It would be beneficial if the HE installation could automatically "clean up" the existing host/manager VM entries in the DB before trying to add the new host/manager VM entries.


I hope this helps, and please let me know if you need further clarification!
~james

Comment 5 Sandro Bonazzola 2014-09-11 11:11:57 UTC

(In reply to James W. Mills from comment #2)
> Itamar,
> 
> You are correct, if I already have HE deployed, adding a new host to it is
> supported (yay!).  However, the scenario we were investigating is more of a
> disaster recovery one.
> 
> Imagine a simple setup like this:
> 
> * Two HE hosts
> * One "private" HE SD (where the manager lives)
> * Normal RHEV environment (DCs/SDs/VMs)
> 
> Scenario 1 - If you lost the two HE hosts, but everything else was still
> intact.
> 
> In this scenario, a manager VM exists.  When re-installing the initial HE
> host using the same private SD, it will detect that the private SD exists. 
> When it does this, it gives you two options, "become a slave" or "remove
> everything and begin again".  Since this is the first host, it cannot
> "become a slave".  It would be very nice if a third option "use the existing
> VM configuration" was there.  This would allow the setup to reuse the
> existing VM and not force the user to reinstall the OS/RHEV/database backup.

For this scenario, if you have a backup of the original answer file used for deploying the host and the storage is not damaged you can
- re-deploy first host passing --config-append=answerfile.conf
- when it asks to confirm os is installed, just confirm and use --vm-poweroff for shutting down the vm
- when it ask to confirm that the engine is installed, use --check-liveliness to ensure the engine is responding
- move the system to global maintenance
- drop the host from the engine
- confirm that the engine is up

Above sequence should work, but need to be verified and if anything cause the procedure to fail it should be fixed.
Keeping needinfo on me for remembering to verify the process.


> 
> Scenario 2 - If you lost the two HE hosts *and* the private HE SD.
> 
> In this scenario, everything is gone and we will need to begin anew. 
> However, assuming we had a recent "engine-backup", we could reinstall the
> IS/RHEV, then restore from the backup.  It would be beneficial if the HE
> installation could automatically "clean up" the existing host/manager VM
> entries in the DB before trying to add the new host/manager VM entries.

Scenario 2 looks really like a migration from a physical setup to the hosted engine one and should be already covered by the documentation.


> 
> 
> I hope this helps, and please let me know if you need further clarification!
> ~james

Comment 6 Yedidyah Bar David 2014-09-11 11:57:14 UTC

(In reply to Sandro Bonazzola from comment #5)
> Scenario 2 looks really like a migration from a physical setup to the hosted
> engine one and should be already covered by the documentation.

Not exactly - similar, but as James noted, the engine db will already include HE-related data which might need to be cleaned.

Comment 7 Sandro Bonazzola 2014-11-03 13:22:48 UTC

didi, can you take this?

Comment 13 Yaniv Kaul 2015-11-17 13:20:25 UTC

Sandro - what is the work to be done here from our side? (for 4.0)

Comment 14 Sandro Bonazzola 2015-12-14 15:44:23 UTC

As per comment #5 and comment #6 , verify that proposed disaster recovery plan works or provide a different one and move it to doc team.

Comment 15 Martin Sivák 2016-01-05 13:32:51 UTC

Another idea we just had with Simone:

We might offer a periodic snapshot of the engine VM. We can probably do it efficiently every time the VM goes down properly.

That will give us a way to recover a broken VM, but it does not fit the storage domain we are currently requiring/using (we require only 20GiB iirc).

Comment 16 Simone Tiraboschi 2016-01-05 13:40:41 UTC

(In reply to Martin Sivák from comment #15)
> Another idea we just had with Simone:
> 
> We might offer a periodic snapshot of the engine VM. We can probably do it
> efficiently every time the VM goes down properly.

We can even further extend id adding to ha-agent the capability to auto-recover to the 'last working snapshot' assuming that the current liveliness check is enough to understand if a specific snapshot is sane.

Then of course we need to add to hosted-engine tool all the capabilities to create, recover and delete engine VM snapshots according to SD space.
Probably the user could also use the engine to create a snapshot but the engine could not be available after a fault when the user needs to recover it.

Comment 17 Yaniv Lavi 2017-04-03 09:54:08 UTC

I think this is fixed in 4.1 with the storage migration options. Should we move it to 4.1.2?

Comment 18 Simone Tiraboschi 2017-04-04 12:08:59 UTC

(In reply to Yaniv Dary from comment #17)
> I think this is fixed in 4.1 with the storage migration options. Should we
> move it to 4.1.2?


Yes, I think so.

Comment 19 Yaniv Lavi 2017-05-29 13:50:31 UTC


*** This bug has been marked as a duplicate of bug 1420604 ***