Bug 987675 - [RFE] Implement an automatic way to detect the host is powered down and remove the need in Manual Fencing
[RFE] Implement an automatic way to detect the host is powered down and remov...
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: RFEs (Show other bugs)
3.3.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Allon Mureinik
yeylon@redhat.com
storage
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-23 16:59 EDT by Marina
Modified: 2016-04-18 02:55 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-11 11:24:47 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
sherold: Triaged+


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 423273 None None None Never

  None (edit)
Description Marina 2013-07-23 16:59:01 EDT
Use case:
RHEV-M is in location A.
Hosts are in locations B and C, part of same RHEV DataCenter (and cluster).
Storage - replicated (by manufacturer) in both locations.
Location B experiences Power Outage, the host becomes non-responsive and VMs are in "Unknown" status on RHEV-M.
Till manual intervention and confirmation the host was rebooted, VMs are stuck and cannot be used by the end user.
Request: detect power outage automatically and perform "manual fence" programmatically, so that the VMs will be "released" to be able to start in location C.

Suggested algorithm:
1) After a number of failed fence attempts (TBD - what is this number, also should be configurable), decide the host is down and perform "manual fence" command from RHEV-M.
2) Use sanlock to lock each volume for extra protection, in case the host is not down and VMs are still running on it.
Comment 3 Andrew Cathrow 2013-07-23 19:31:04 EDT
Is the subject line correct "Implement an automatic way to detect the host is powered down and remove the need in Manual Fencing "


The only way we can validate if a host is down is through power management - which validates that it's really dead.
If this is the requirement then we need to close the RFE.

But if we want to implement fencing through storage and/or sanlock then it's a different matter.

Marina - is the latter the case.
Sean, do we have an open RFE for sanlock based fencing?
Comment 6 Miguel González 2013-07-24 11:40:10 EDT
Suppose the following scenario, 3 Hosts in the same cluster of RHEV, regardless of their physical location
 
Host A: Failed by power problems, or failed in port Management connected in switch
Host B: attempts to fence to host A, and answer "unknown"
Host C: try to fence to host A, answer "unknown"
RHEV-M: tries to fence to host A, answer "unknown".

What I propose to include in roadmap is as follows:

If we add host B, host C and rhev-m, we have 3 votes failed, so RHEV-M should apply "Host as been reboot" automatically, so that the virtual machines resident to host A failure can be turned on 2 other hosts in the cluster.

Thanks
Comment 7 Pablo Iranzo Gómez 2013-07-24 11:47:07 EDT
(In reply to Miguel González from comment #6)
> Suppose the following scenario, 3 Hosts in the same cluster of RHEV,
> regardless of their physical location
>  
> Host A: Failed by power problems, or failed in port Management connected in
> switch
> Host B: attempts to fence to host A, and answer "unknown"
> Host C: try to fence to host A, answer "unknown"
> RHEV-M: tries to fence to host A, answer "unknown".
> 
> What I propose to include in roadmap is as follows:
> 
> If we add host B, host C and rhev-m, we have 3 votes failed, so RHEV-M
> should apply "Host as been reboot" automatically, so that the virtual
> machines resident to host A failure can be turned on 2 other hosts in the
> cluster.
> 
> Thanks


If it's a stretch cluster that could be problematic:

Physical CPD1 and CPD2 with 4 hosts each configured as ONE RHEV DC and CLUSTER
RHEV-Manager provided as a clustered (RHCS) service over HA-LVM with one disk on each CPD's SAN.

In the event of a split-brain, RHEV could think that four hosts failed (for example CPD2), as 4 hosts (host1.cpd1,host2.cpd1,host3.cpd1 and host4.cpd1) will say that 'fence' is 'unknown' for hosts "host*.cpd2", they could try to start vm's that are still running on host*.cpd2 hosts.

We must be very careful on this situations to avoid data loss
Comment 8 Miguel González 2013-07-24 12:09:52 EDT
This is very understandable and I agree with you, however, we must give you the ability to choose the client and management console should offer this functionality.

There are some customers who prefer to preserve the integrity of the data and can expect to recover the failed host, and there are others who prefer to maintain the availability of the service and that the machines are started on another host quickly (And do not wait forever until you return the hosts without problems).

I think it would make sense that the client could have the functionality available and activate when it sees fit.
Comment 9 Pablo Iranzo Gómez 2013-07-24 12:19:03 EDT
(In reply to Miguel González from comment #8)
> This is very understandable and I agree with you, however, we must give you
> the ability to choose the client and management console should offer this
> functionality.
> 
> There are some customers who prefer to preserve the integrity of the data
> and can expect to recover the failed host, and there are others who prefer
> to maintain the availability of the service and that the machines are
> started on another host quickly (And do not wait forever until you return
> the hosts without problems).
> 
> I think it would make sense that the client could have the functionality
> available and activate when it sees fit.

I agree, this can be useful, but we also need to be careful:

Could this be implemented using a custom rules engine with some 'sample' behaviours?

In this way, each customer could check a box for each 'cluster' for enabling or not this behaviour and be able to tune the number of votes or a regexp of hostnames, etc to better select the hosts that will decide.

Regards,
Pablo
Comment 10 Andrew Cathrow 2013-07-25 19:56:59 EDT
if status == unknown then we can't make any presumptions.
Storage based fencing is the best approach.

Note You need to log in before you can comment on or make changes to this bug.