Bug 2014790 - [Doc] Disaster Recovery Guide does fails to highlight that failovers cannot occur if the SPM is the point of failure if Power Management is not configured
Summary: [Doc] Disaster Recovery Guide does fails to highlight that failovers cannot o...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: Documentation
Version: 4.4.9
Hardware: All
OS: Linux
high
high
Target Milestone: ovirt-4.4.10
: ---
Assignee: Steve Goodman
QA Contact: rhev-docs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-16 18:44 UTC by Allie DeVolder
Modified: 2022-03-17 11:27 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-01-04 09:33:51 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43816 0 None None None 2021-10-16 18:44:59 UTC

Description Allie DeVolder 2021-10-16 18:44:03 UTC
Description of problem:

The Disaster Recovery Guide [https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/disaster_recovery_guide/] omits that if the SPM fails, failover cannot occur without Power Management configured. HA VM leases alone do not allow failover in this particular scenario.


Version-Release number of selected component (if applicable):
4.4.x

How reproducible:
100G

Steps to Reproduce:
1. Follow Disaster Recovery Guide to configure failover environment, omitting Power Management due to existence of VM leases
2. Test environment by triggering a failure on the SPM

Actual results:
Failover doesn't occur because a new SPM cannot be elected without a fence event, nor are HA VMs started on another host because the data center is down without an SPM

Expected results:
Instruction to configure Power Management for this scenario, or alternatively, explanation that this will not work without it.

Additional info:

Comment 1 Gianluca Cecchi 2021-10-25 15:28:28 UTC
Actually I would add these considerations that are very important:
. you configure SPM priority as high in the primary site hosts and SPM priority as low on secondary site hosts
. this means that for sure the SPM role is in charge on one of the hosts at the primary site
. this means that if you have a primary site failure (eg power loss), possibily you have impact also on your network devices inside the primary site and so the fencing device for the SPM host cannot be reached
. this means that VMs do failover, but operations needing SPM role in place cannot be executed (adding new disks, extending existing disks, exporting VMs,...). You need to declare "confirm host has been rebooted" for the SPM host.

Comment 2 Steve Goodman 2021-12-08 16:15:50 UTC
Gianluca, Allie, please review:

I've created a PR for this at:

https://github.com/oVirt/ovirt-site/pull/2653

Links to previews of edited topics are there.

Comment 3 Gianluca Cecchi 2021-12-08 23:08:03 UTC
Hi, thanks for the PR, Steve.
In the Active-Passive configuration the SPM considerations don't apply, because the hosts at secondary site are active, but managed by a second manager, that is different from the primary site manager and also the DC is a different one. Their storage (the replicated one) is not active during normal operations and is activated only during recovery through Ansible playbooks, as a manual step by the sysadmin. When the replicated storage domain has been activated, one of the hosts in secondary site will acquire the SPM role in that environment.

Also, the phrase "Doing so prepares for recovery if you have a primary site failure that impacts network devices inside the primary site, preventing the fencing device for the SPM host from being reachable, such as power loss." is misleading, because seems to tell the user that if they configure SPM high in primary site and low on secondary, they are ok. This is enforced by the following sentence: "If you do not take this precaution..."
The point is to tell the user that there is a weakness: in some scenarios of loss of primary site (such as a power loss where primary site hosts fencing devices are not reachable), the hosts in the seconday site are not able to take over the SPM role, and the user has to take a manual step if he/she wants the full functionality, that is take the host that had the SPM role and select "Confirm host has been rebooted". But of course this manual step is to be done diligently, in the sense that formerly you had the chance to inspect the primary site and be sure that the host is powered off indeed and released the resources.

Comment 4 Steve Goodman 2021-12-14 15:43:37 UTC
Thanks for the comment 3, Gianluca. I made the following change in the PR:

-----
Set the SPM role on a host at the primary site to have precedence. To do so, configure SPM priority as high in the primary site hosts and SPM priority as low on secondary site hosts. If you have a primary site failure that impacts network devices inside the primary site, preventing the fencing device for the SPM host from being reachable, such as power loss, the hosts in the seconday site are not able to take over the SPM role.

In such a scenario virtual machines do a failover, but operations that require the SPM role in place cannot be executed, including adding new disks, extending existing disks, and exporting virtual machines.

To restore full functionality, select *Confirm host has been rebooted* for the SPM host. Perform this step, even if you are not able to confirm that the host is powered off or that it has released its resources.
-----

Does this capture what you said?

Comment 5 Gianluca Cecchi 2021-12-14 16:10:03 UTC
It seems all ok to me, except for last sentence: "To restore full functionality, select *Confirm host has been rebooted* for the SPM host. Perform this step, even if you are not able to confirm that the host is powered off or that it has released its resources."

I don't agree with "even if you are not able to confirm that the host is powered off or that it has released its resources.". In fact this would potentially lead to data corruption if the host in primary site where you lost your connection/access has its storage functions intact.

I think that Disaster Recovery concepts in general are referring to regain a base (typically reduced) functionality in production. So that the base failover of the VMs already accomplishes this task.
The actions that need an SPM role assigned are indeed important but can be separated by the Recovery of an happened Disaster.
So in my opinion it is important to write that the customer has to investigate and detect the actual nature of the Disaster and only with consciousness select "Confirm host has been rebooted" and so regain also the production level capabilities such as creating new disks, extending the existing ones, exporting VMs, ecc...

Gianluca

Comment 6 Steve Goodman 2021-12-16 10:31:44 UTC
I updated the PR as follows:

----
Set the SPM role on a host at the primary site to have precedence. To do so, configure SPM priority as high in the primary site hosts and SPM priority as low on secondary site hosts. If you have a primary site failure that impacts network devices inside the primary site, preventing the fencing device for the SPM host from being reachable, such as power loss, the hosts in the seconday site are not able to take over the SPM role.

In such a scenario virtual machines do a failover, but operations that require the SPM role in place cannot be executed, including adding new disks, extending existing disks, and exporting virtual machines.

To restore full functionality, detect the actual nature of the disaster and after fixing the root cause and rebooting the SPM host, select *Confirm 'Host has been Rebooted'* for the SPM host.
----

Comment 7 Gianluca Cecchi 2021-12-16 18:26:26 UTC
It seems ok to me. It would be nice to have someone else chiming in and give his/her opinion
Gianluca

Comment 8 Arik 2021-12-19 17:05:37 UTC
Pavel, can you please review the changes?

Comment 9 Pavel Bar 2021-12-27 11:20:37 UTC
Left some comments. Most are just style related.

Comment 10 Steve Goodman 2021-12-29 10:47:54 UTC
Comments addressed.

Comment 11 Steve Goodman 2021-12-29 10:49:08 UTC
Richard, can you please do a peer review?

Comment 12 Richard Hoch 2021-12-29 11:44:53 UTC
Steve: a number of minor comments, but otherwise, LGTM!


Note You need to log in before you can comment on or make changes to this bug.