Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2014790

Summary:	[Doc] Disaster Recovery Guide does fails to highlight that failovers cannot occur if the SPM is the point of failure if Power Management is not configured
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Allie DeVolder <adevolder>
Component:	Documentation	Assignee:	Steve Goodman <sgoodman>
Status:	CLOSED CURRENTRELEASE	QA Contact:	rhev-docs <rhev-docs>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4.9	CC:	ahadas, ctomasko, gianluca.cecchi, lsurette, mavital, mhicks, pbar, rhoch, sgoodman, srevivo
Target Milestone:	ovirt-4.4.10	Keywords:	Documentation, NoDocsQEReview
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-01-04 09:33:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Allie DeVolder 2021-10-16 18:44:03 UTC

Description of problem:

The Disaster Recovery Guide [https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/disaster_recovery_guide/] omits that if the SPM fails, failover cannot occur without Power Management configured. HA VM leases alone do not allow failover in this particular scenario.


Version-Release number of selected component (if applicable):
4.4.x

How reproducible:
100G

Steps to Reproduce:
1. Follow Disaster Recovery Guide to configure failover environment, omitting Power Management due to existence of VM leases
2. Test environment by triggering a failure on the SPM

Actual results:
Failover doesn't occur because a new SPM cannot be elected without a fence event, nor are HA VMs started on another host because the data center is down without an SPM

Expected results:
Instruction to configure Power Management for this scenario, or alternatively, explanation that this will not work without it.

Additional info:

Comment 1 Gianluca Cecchi 2021-10-25 15:28:28 UTC

Actually I would add these considerations that are very important:
. you configure SPM priority as high in the primary site hosts and SPM priority as low on secondary site hosts
. this means that for sure the SPM role is in charge on one of the hosts at the primary site
. this means that if you have a primary site failure (eg power loss), possibily you have impact also on your network devices inside the primary site and so the fencing device for the SPM host cannot be reached
. this means that VMs do failover, but operations needing SPM role in place cannot be executed (adding new disks, extending existing disks, exporting VMs,...). You need to declare "confirm host has been rebooted" for the SPM host.

Comment 2 Steve Goodman 2021-12-08 16:15:50 UTC

Gianluca, Allie, please review:

I've created a PR for this at:

https://github.com/oVirt/ovirt-site/pull/2653

Links to previews of edited topics are there.

Comment 3 Gianluca Cecchi 2021-12-08 23:08:03 UTC

Hi, thanks for the PR, Steve.
In the Active-Passive configuration the SPM considerations don't apply, because the hosts at secondary site are active, but managed by a second manager, that is different from the primary site manager and also the DC is a different one. Their storage (the replicated one) is not active during normal operations and is activated only during recovery through Ansible playbooks, as a manual step by the sysadmin. When the replicated storage domain has been activated, one of the hosts in secondary site will acquire the SPM role in that environment.

Also, the phrase "Doing so prepares for recovery if you have a primary site failure that impacts network devices inside the primary site, preventing the fencing device for the SPM host from being reachable, such as power loss." is misleading, because seems to tell the user that if they configure SPM high in primary site and low on secondary, they are ok. This is enforced by the following sentence: "If you do not take this precaution..."
The point is to tell the user that there is a weakness: in some scenarios of loss of primary site (such as a power loss where primary site hosts fencing devices are not reachable), the hosts in the seconday site are not able to take over the SPM role, and the user has to take a manual step if he/she wants the full functionality, that is take the host that had the SPM role and select "Confirm host has been rebooted". But of course this manual step is to be done diligently, in the sense that formerly you had the chance to inspect the primary site and be sure that the host is powered off indeed and released the resources.

Comment 4 Steve Goodman 2021-12-14 15:43:37 UTC

Thanks for the comment 3, Gianluca. I made the following change in the PR:

-----
Set the SPM role on a host at the primary site to have precedence. To do so, configure SPM priority as high in the primary site hosts and SPM priority as low on secondary site hosts. If you have a primary site failure that impacts network devices inside the primary site, preventing the fencing device for the SPM host from being reachable, such as power loss, the hosts in the seconday site are not able to take over the SPM role.

In such a scenario virtual machines do a failover, but operations that require the SPM role in place cannot be executed, including adding new disks, extending existing disks, and exporting virtual machines.

To restore full functionality, select *Confirm host has been rebooted* for the SPM host. Perform this step, even if you are not able to confirm that the host is powered off or that it has released its resources.
-----

Does this capture what you said?

Comment 5 Gianluca Cecchi 2021-12-14 16:10:03 UTC

It seems all ok to me, except for last sentence: "To restore full functionality, select *Confirm host has been rebooted* for the SPM host. Perform this step, even if you are not able to confirm that the host is powered off or that it has released its resources."

I don't agree with "even if you are not able to confirm that the host is powered off or that it has released its resources.". In fact this would potentially lead to data corruption if the host in primary site where you lost your connection/access has its storage functions intact.

I think that Disaster Recovery concepts in general are referring to regain a base (typically reduced) functionality in production. So that the base failover of the VMs already accomplishes this task.
The actions that need an SPM role assigned are indeed important but can be separated by the Recovery of an happened Disaster.
So in my opinion it is important to write that the customer has to investigate and detect the actual nature of the Disaster and only with consciousness select "Confirm host has been rebooted" and so regain also the production level capabilities such as creating new disks, extending the existing ones, exporting VMs, ecc...

Gianluca

Comment 6 Steve Goodman 2021-12-16 10:31:44 UTC

I updated the PR as follows:

----
Set the SPM role on a host at the primary site to have precedence. To do so, configure SPM priority as high in the primary site hosts and SPM priority as low on secondary site hosts. If you have a primary site failure that impacts network devices inside the primary site, preventing the fencing device for the SPM host from being reachable, such as power loss, the hosts in the seconday site are not able to take over the SPM role.

In such a scenario virtual machines do a failover, but operations that require the SPM role in place cannot be executed, including adding new disks, extending existing disks, and exporting virtual machines.

To restore full functionality, detect the actual nature of the disaster and after fixing the root cause and rebooting the SPM host, select *Confirm 'Host has been Rebooted'* for the SPM host.
----

Comment 7 Gianluca Cecchi 2021-12-16 18:26:26 UTC

It seems ok to me. It would be nice to have someone else chiming in and give his/her opinion
Gianluca

Comment 8 Arik 2021-12-19 17:05:37 UTC

Pavel, can you please review the changes?

Comment 9 Pavel Bar 2021-12-27 11:20:37 UTC

Left some comments. Most are just style related.

Comment 10 Steve Goodman 2021-12-29 10:47:54 UTC

Comments addressed.

Comment 11 Steve Goodman 2021-12-29 10:49:08 UTC

Richard, can you please do a peer review?

Comment 12 Richard Hoch 2021-12-29 11:44:53 UTC

Steve: a number of minor comments, but otherwise, LGTM!