Bug 987675 - [RFE] Implement an automatic way to detect the host is powered down and remove the need in Manual Fencing
Summary: [RFE] Implement an automatic way to detect the host is powered down and remov...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: RFEs
Version: 3.3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Allon Mureinik
QA Contact: yeylon@redhat.com
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-07-23 20:59 UTC by Marina Kalinin
Modified: 2022-04-17 08:11 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-11-11 16:24:47 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:
sherold: Triaged+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 804272 1 None None None 2021-09-09 11:32:29 UTC
Red Hat Issue Tracker RHV-45777 0 None None None 2022-04-17 08:11:56 UTC
Red Hat Knowledge Base (Solution) 423273 0 None None None Never

Internal Links: 804272

Description Marina Kalinin 2013-07-23 20:59:01 UTC
Use case:
RHEV-M is in location A.
Hosts are in locations B and C, part of same RHEV DataCenter (and cluster).
Storage - replicated (by manufacturer) in both locations.
Location B experiences Power Outage, the host becomes non-responsive and VMs are in "Unknown" status on RHEV-M.
Till manual intervention and confirmation the host was rebooted, VMs are stuck and cannot be used by the end user.
Request: detect power outage automatically and perform "manual fence" programmatically, so that the VMs will be "released" to be able to start in location C.

Suggested algorithm:
1) After a number of failed fence attempts (TBD - what is this number, also should be configurable), decide the host is down and perform "manual fence" command from RHEV-M.
2) Use sanlock to lock each volume for extra protection, in case the host is not down and VMs are still running on it.

Comment 3 Andrew Cathrow 2013-07-23 23:31:04 UTC
Is the subject line correct "Implement an automatic way to detect the host is powered down and remove the need in Manual Fencing "


The only way we can validate if a host is down is through power management - which validates that it's really dead.
If this is the requirement then we need to close the RFE.

But if we want to implement fencing through storage and/or sanlock then it's a different matter.

Marina - is the latter the case.
Sean, do we have an open RFE for sanlock based fencing?

Comment 6 Miguel González 2013-07-24 15:40:10 UTC
Suppose the following scenario, 3 Hosts in the same cluster of RHEV, regardless of their physical location
 
Host A: Failed by power problems, or failed in port Management connected in switch
Host B: attempts to fence to host A, and answer "unknown"
Host C: try to fence to host A, answer "unknown"
RHEV-M: tries to fence to host A, answer "unknown".

What I propose to include in roadmap is as follows:

If we add host B, host C and rhev-m, we have 3 votes failed, so RHEV-M should apply "Host as been reboot" automatically, so that the virtual machines resident to host A failure can be turned on 2 other hosts in the cluster.

Thanks

Comment 7 Pablo Iranzo Gómez 2013-07-24 15:47:07 UTC
(In reply to Miguel González from comment #6)
> Suppose the following scenario, 3 Hosts in the same cluster of RHEV,
> regardless of their physical location
>  
> Host A: Failed by power problems, or failed in port Management connected in
> switch
> Host B: attempts to fence to host A, and answer "unknown"
> Host C: try to fence to host A, answer "unknown"
> RHEV-M: tries to fence to host A, answer "unknown".
> 
> What I propose to include in roadmap is as follows:
> 
> If we add host B, host C and rhev-m, we have 3 votes failed, so RHEV-M
> should apply "Host as been reboot" automatically, so that the virtual
> machines resident to host A failure can be turned on 2 other hosts in the
> cluster.
> 
> Thanks


If it's a stretch cluster that could be problematic:

Physical CPD1 and CPD2 with 4 hosts each configured as ONE RHEV DC and CLUSTER
RHEV-Manager provided as a clustered (RHCS) service over HA-LVM with one disk on each CPD's SAN.

In the event of a split-brain, RHEV could think that four hosts failed (for example CPD2), as 4 hosts (host1.cpd1,host2.cpd1,host3.cpd1 and host4.cpd1) will say that 'fence' is 'unknown' for hosts "host*.cpd2", they could try to start vm's that are still running on host*.cpd2 hosts.

We must be very careful on this situations to avoid data loss

Comment 8 Miguel González 2013-07-24 16:09:52 UTC
This is very understandable and I agree with you, however, we must give you the ability to choose the client and management console should offer this functionality.

There are some customers who prefer to preserve the integrity of the data and can expect to recover the failed host, and there are others who prefer to maintain the availability of the service and that the machines are started on another host quickly (And do not wait forever until you return the hosts without problems).

I think it would make sense that the client could have the functionality available and activate when it sees fit.

Comment 9 Pablo Iranzo Gómez 2013-07-24 16:19:03 UTC
(In reply to Miguel González from comment #8)
> This is very understandable and I agree with you, however, we must give you
> the ability to choose the client and management console should offer this
> functionality.
> 
> There are some customers who prefer to preserve the integrity of the data
> and can expect to recover the failed host, and there are others who prefer
> to maintain the availability of the service and that the machines are
> started on another host quickly (And do not wait forever until you return
> the hosts without problems).
> 
> I think it would make sense that the client could have the functionality
> available and activate when it sees fit.

I agree, this can be useful, but we also need to be careful:

Could this be implemented using a custom rules engine with some 'sample' behaviours?

In this way, each customer could check a box for each 'cluster' for enabling or not this behaviour and be able to tune the number of votes or a regexp of hostnames, etc to better select the hosts that will decide.

Regards,
Pablo

Comment 10 Andrew Cathrow 2013-07-25 23:56:59 UTC
if status == unknown then we can't make any presumptions.
Storage based fencing is the best approach.


Note You need to log in before you can comment on or make changes to this bug.