Bug 1840407

Summary:	RFE: A way to implement a more dynamic fence delay so that in the event of a network split, the cluster will fence the node running the fewest resources
Product:	Red Hat Enterprise Linux 7	Reporter:	Ken Gaillot <kgaillot>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED WONTFIX	QA Contact:	cluster-qe <cluster-qe>
Severity:	medium	Docs Contact:
Priority:	high
Version:	7.8	CC:	cluster-maint, cluster-qe, cnewsom, nwahl, phagara, sbradley, slevine
Target Milestone:	rc	Keywords:	FutureFeature
Target Release:	7.9
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1784601	Environment:
Last Closed:	2020-05-27 16:29:12 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1784601
Bug Blocks:

Description Ken Gaillot 2020-05-26 21:35:29 UTC

+++ This bug was initially created as a clone of Bug #1784601 +++

1. Proposed title of this feature request

A way to implement a more dymanic fence delay so that in the event of a network split, the cluster will fence the node running the fewest resources. 

2. What is the nature and description of the request?

The customer has a fence delay to avoid fence races.

The customer described a scenario in which a network split would result in the fencing of the node that is hosting a majority of cluster resources if the resources happen to be running on the node whose stonith resource is not configured with a delay. We discussed that currently the delay attribute is something that must be configured to decide the winner of a fence race ahead of time. The customer has requested that we file an RFE to look into ways for the delay to apply to the node that is hosting a majority of resource to minimize the impact of a fence event that results from a network split. 

--- Additional comment from Ken Gaillot on 2019-12-17 20:29:51 UTC ---

Coincidentally, there was a discussion upstream about this topic recently.[1] The proposed solution design is:

* Users would use the existing "priority" resource meta-attribute to weight each resource.

* A new cluster property "priority-fencing-delay" would set the specified delay on all cluster-initiated fencing targeting the node with the highest cumulative priority of all resources active on it. (It would not apply to manual fencing via stonith_admin, or externally initiated fencing by something like DLM.)

For a simple resource count as described here, it would be sufficient to give every resource a priority of 1. Of course if some resources are more important, they could be given higher priority.

[1] https://github.com/ClusterLabs/fence-agents/pull/308

--- Additional comment from Ken Gaillot ---

This has been fixed upstream as of commit 3dec930.

This is really only useful for 2-node clusters. If the new property is set, and the "priority" meta-attribute is configured for at least one resource, then in a split-brain situation, the node with the highest combined priority of all resources running on it will be more likely to survive.

As an example, with:

  pcs resource defaults priority=1
  pcs property set priority-fencing-delay=15s

and no other priorities, then the node running the most resources will be more likely to survive, because the other node will wait 15 seconds before initiating fencing. If a particular resource is more important than the rest, you can give it a higher priority.

The node running the master role of a promotable clone will get an extra 1 point, if a priority has been configured for that clone.

Comment 2 Patrik Hagara 2020-05-27 08:25:49 UTC

qa_ack+, feature described in comment#0

Comment 3 Ken Gaillot 2020-05-27 16:28:58 UTC

After further consideration of upgrade issues this would cause, this feature will not be included in the upstream release, therefore we will not include it in RHEL, either.

The main problem is that using this feature in a RHEL 7.9 cluster would cause the feature to stop working on an upgrade to a RHEL 8.0-8.2 cluster. We don't want to require a compatibility matrix for RHEL 7->8 upgrades.