Use case: RHEV-M is in location A. Hosts are in locations B and C, part of same RHEV DataCenter (and cluster). Storage - replicated (by manufacturer) in both locations. Location B experiences Power Outage, the host becomes non-responsive and VMs are in "Unknown" status on RHEV-M. Till manual intervention and confirmation the host was rebooted, VMs are stuck and cannot be used by the end user. Request: detect power outage automatically and perform "manual fence" programmatically, so that the VMs will be "released" to be able to start in location C. Suggested algorithm: 1) After a number of failed fence attempts (TBD - what is this number, also should be configurable), decide the host is down and perform "manual fence" command from RHEV-M. 2) Use sanlock to lock each volume for extra protection, in case the host is not down and VMs are still running on it.
Is the subject line correct "Implement an automatic way to detect the host is powered down and remove the need in Manual Fencing " The only way we can validate if a host is down is through power management - which validates that it's really dead. If this is the requirement then we need to close the RFE. But if we want to implement fencing through storage and/or sanlock then it's a different matter. Marina - is the latter the case. Sean, do we have an open RFE for sanlock based fencing?
Suppose the following scenario, 3 Hosts in the same cluster of RHEV, regardless of their physical location Host A: Failed by power problems, or failed in port Management connected in switch Host B: attempts to fence to host A, and answer "unknown" Host C: try to fence to host A, answer "unknown" RHEV-M: tries to fence to host A, answer "unknown". What I propose to include in roadmap is as follows: If we add host B, host C and rhev-m, we have 3 votes failed, so RHEV-M should apply "Host as been reboot" automatically, so that the virtual machines resident to host A failure can be turned on 2 other hosts in the cluster. Thanks
(In reply to Miguel González from comment #6) > Suppose the following scenario, 3 Hosts in the same cluster of RHEV, > regardless of their physical location > > Host A: Failed by power problems, or failed in port Management connected in > switch > Host B: attempts to fence to host A, and answer "unknown" > Host C: try to fence to host A, answer "unknown" > RHEV-M: tries to fence to host A, answer "unknown". > > What I propose to include in roadmap is as follows: > > If we add host B, host C and rhev-m, we have 3 votes failed, so RHEV-M > should apply "Host as been reboot" automatically, so that the virtual > machines resident to host A failure can be turned on 2 other hosts in the > cluster. > > Thanks If it's a stretch cluster that could be problematic: Physical CPD1 and CPD2 with 4 hosts each configured as ONE RHEV DC and CLUSTER RHEV-Manager provided as a clustered (RHCS) service over HA-LVM with one disk on each CPD's SAN. In the event of a split-brain, RHEV could think that four hosts failed (for example CPD2), as 4 hosts (host1.cpd1,host2.cpd1,host3.cpd1 and host4.cpd1) will say that 'fence' is 'unknown' for hosts "host*.cpd2", they could try to start vm's that are still running on host*.cpd2 hosts. We must be very careful on this situations to avoid data loss
This is very understandable and I agree with you, however, we must give you the ability to choose the client and management console should offer this functionality. There are some customers who prefer to preserve the integrity of the data and can expect to recover the failed host, and there are others who prefer to maintain the availability of the service and that the machines are started on another host quickly (And do not wait forever until you return the hosts without problems). I think it would make sense that the client could have the functionality available and activate when it sees fit.
(In reply to Miguel González from comment #8) > This is very understandable and I agree with you, however, we must give you > the ability to choose the client and management console should offer this > functionality. > > There are some customers who prefer to preserve the integrity of the data > and can expect to recover the failed host, and there are others who prefer > to maintain the availability of the service and that the machines are > started on another host quickly (And do not wait forever until you return > the hosts without problems). > > I think it would make sense that the client could have the functionality > available and activate when it sees fit. I agree, this can be useful, but we also need to be careful: Could this be implemented using a custom rules engine with some 'sample' behaviours? In this way, each customer could check a box for each 'cluster' for enabling or not this behaviour and be able to tune the number of votes or a regexp of hostnames, etc to better select the hosts that will decide. Regards, Pablo
if status == unknown then we can't make any presumptions. Storage based fencing is the best approach.