Bug 1715426

Summary:	Default the concurrent-fencing cluster property to true
Product:	Red Hat Enterprise Linux 8	Reporter:	Ken Gaillot <kgaillot>
Component:	pacemaker	Assignee:	Ken Gaillot <kgaillot>
Status:	CLOSED ERRATA	QA Contact:	pkomarov
Severity:	medium	Docs Contact:	Steven J. Levine <slevine>
Priority:	high
Version:	8.0	CC:	abeekhof, cluster-maint, cluster-qe, lmanasko, marjones, pasik, pkomarov, sbradley
Target Milestone:	pre-dev-freeze
Target Release:	8.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	pacemaker-2.0.2-1.el8	Doc Type:	Enhancement
Doc Text:	.Pacemaker now defaults the `concurrent-fencing` cluster property to `true` If multiple cluster nodes need to be fenced at the same time, and they use different configured fence devices, Pacemaker will now execute the fencing simultaneously, rather than serialized as before. This can result in greatly sped up recovery in a large cluster when multiple nodes must be fenced.	Story Points:	---
Clone Of:	1710422	Environment:
Last Closed:	2019-11-05 20:57:48 UTC	Type:	Enhancement
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1710422
Bug Blocks:

Description Ken Gaillot 2019-05-30 10:24:18 UTC

+++ This bug was initially created as a clone of Bug #1710422 +++

Description of problem:
A customer hard killed multiple remote nodes and a controller (cluster) node for an OpenStack cluster. The remote nodes and cluster (controller) nodes of the OpenStack cluster needed to be fenced.

The serialized fencing caused  an issue where compute-15 is fenced before controller-0. Fencing of compute-15 could not complete because keystone IP resource could not be started {which is required for fence_compute) until controller-0 was fenced and controller-0 could not be fenced until compute-15 was fenced. 


Steps to Reproduce:
1. Hard kill multiple remote nodes and a controller (cluster) node for
   an OpenStack cluster.

Actual results:
The cluster was stuck in a loop that it could not get out of because of serialized fencing. The serialize fencing caused  an issue where compute-15 is fenced before controller-0. Fencing of compute-15 could not complete because keystone IP resource could not be started until controller-0 was fenced and controller-0 could not be fenced until compute-15 wasa fenced. 

Expected results:
We should not be stuck in a recovery/fencing loop that the cluster cannot get out of.

--- Additional comment from Ken Gaillot on 2019-05-16 15:48:19 UTC ---

A workaround for this issue is to set the concurrent-fencing cluster property to true.

Note that the particular problem in the customer's environment that led to this BZ is possible only when successful fencing depends on some cluster resource being active. That is an inherently risky configuration, though it should be acceptable if (as in OpenStack's case) the fence device is restricted to targets that are not allowed to run the resource being depended on.

I see three potential approaches for fixing this on the pacemaker side:

* Change concurrent-fencing to default to true. Upon investigation, it appears that serialized fencing was necessary only for pacemaker's original fence daemon that was long ago replaced.

* Order remote node fencing (i.e. compute nodes) after cluster node fencing. This is a purely heuristic approach, as it seems more likely that remote nodes would require such a dependent fence configuration. However this would not cover all possible cases and might be undesirable if the DC is the cluster node that needs fencing.

* Order retries of failed fencing attempts after first fencing attempts for other targets, when concurrent-fencing is false. This is definitely a good idea, but the implementation would be more difficult because the scheduler doesn't currently have information about fencing failures. Adding that information would add a small storage increase to all configurations for the benefit of (rare) situations similar to this one. If concurrent-fencing is defaulted to true, the situation will be even rarer.

At this point, I plan on going with solely the first option to address this BZ, with the last option kept on the radar as a possible future enhancement.

Comment 1 Ken Gaillot 2019-05-30 10:32:37 UTC

QA: It is not necessary to reproduce the customer's issue, just verify that fencing is now parallelized by default. If more than one node needs to be fenced at the same time (e.g. kill power to 2 nodes in a 5-node cluster), by default pacemaker would wait until one fencing completed before executing the next. Now, as long as the fence targets have different fence devices (e.g. individual IPMI), pacemaker will execute the fencing in parallel.

This is controlled by the concurrent-fencing cluster property, which previously defaulted to false, and now defaults to true.

Comment 2 Ken Gaillot 2019-06-06 19:33:40 UTC

A compile-time option for this behavior was added in upstream commit 463eb8e

Comment 3 Ken Gaillot 2019-06-08 19:58:25 UTC

Upon further investigation, defaulting concurrent-fencing to true is a good idea but only a partial solution.

Even with concurrent-fencing, if the DC needs to be fenced, the DC's fencing is ordered after any other fencing. This avoids repeated DC elections if multiple cluster nodes need to be fenced, and minimizes the chance of a new DC taking over and missing the need for some of the fencing. Unfortunately that means the problem scenario here can arise if the DC is hosting the IP needed for the remote node fencing.

The proposal to order failed fencing attempts after new fencing attempts is not a good idea because there are equally scenarios where that causes problems. For example, if the cluster node hosting the IP fails, a fencing attempt on it fails, and then a new transition schedules fencing of both the cluster node and the remote node, we would get the deadlock with that design.

The proposal to order remote node fencing after cluster node fencing seems iffy. On the plus side, remote nodes can't be elected DC, so there's no risk of repeated DC elections, and it would never make sense to have cluster node fencing depend on a resource running on remote nodes. On the minus side, if a remote node is being fenced because of a failed connection resource on a cluster node that is also being fenced, and the DC is being fenced, then fencing the cluster nodes first would mean that the connection failure history would be lost before the new DC ran the scheduler.

I'll have to think more about what a full solution might look like, but this is a good step.

QA: Note that fencing will be done in parallel with concurrent-fencing only for non-DC nodes.

Comment 5 Ken Gaillot 2019-06-10 21:52:00 UTC

As an immediate workaround, in addition to setting concurrent-fencing=true, it is possible to ban the dependent resource from the DC. The syntax is:

pcs constraint location <resource-fencing-depends-on> rule score=-200000 "#is_dc" ne string true

I'm suggesting a finite score (as opposed to -INFINITY) so the DC can run the IP as a last resort if no other node can run it for any reason.

To do any better than that, we'll likely need to implement new syntax to indicate fencing dependencies.

Comment 6 Ken Gaillot 2019-06-11 14:08:21 UTC

(In reply to Ken Gaillot from comment #5)
> As an immediate workaround, in addition to setting concurrent-fencing=true,
> it is possible to ban the dependent resource from the DC. The syntax is:
> 
> pcs constraint location <resource-fencing-depends-on> rule score=-200000
> "#is_dc" ne string true

Hold off on this recommendation for a bit, we're still working through the implications.

Thankfully, the problem space is much narrower than thought. If the (original) DC is being fenced because the whole node stopped being functional, then another node has already taken over DC, and concurrent-fencing is sufficient. If the DC is scheduling itself for fencing because of some other problem (e.g. a failed resource stop), then the IP resource may still be functional, and the remote fencing can proceed. The only problem case is when the DC schedules itself for fencing and the IP is nonfunctional.

Comment 7 Ken Gaillot 2019-06-18 17:09:26 UTC

To recap, concurrent-fencing is sufficient to avoid the problem in all but this situation:

* The cluster node that is DC is functional, yet has to schedule itself for fencing (typically this would be for a failed resource stop, or a failure of an action with on-fail=fence configured); and,

* The keystone IP address is running on the DC but not functional.

The suggested workaround for that situation in Comment 5 (a rule keeping the keystone IP off the DC) should work, but with some trade-offs: the IP's resource-stickiness must be lowered below the constraint score, and the IP may move unnecessarily when the DC changes normally. Given the narrow risk, we've decided not to recommend it as a general policy.

If a user is more concerned with the risk than the trade-offs, they can try the workaround.

A potential complete solution, that would work regardless of the value of concurrent-fencing, has been suggested as a new feature in Bug 1721603. However, the work would be significant and the benefit small, so it is a low priority at this time.

Comment 10 pkomarov 2019-09-03 21:13:36 UTC

Verified, 

[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'rpm -q pacemaker'

controller-0 | CHANGED | rc=0 >>
pacemaker-2.0.2-1.el8.x86_64

controller-1 | CHANGED | rc=0 >>
pacemaker-2.0.2-1.el8.x86_64

controller-2 | CHANGED | rc=0 >>
pacemaker-2.0.2-1.el8.x86_64

[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'pcs property --defaults|grep concurrent-fencing'

controller-1 | CHANGED | rc=0 >>
 concurrent-fencing: true

controller-0 | CHANGED | rc=0 >>
 concurrent-fencing: true

controller-2 | CHANGED | rc=0 >>
 concurrent-fencing: true

Comment 11 Steven J. Levine 2019-09-24 21:24:28 UTC

Slight reformat of doc text for release note to isolate release note entry title (which involved small edit to remaining text).

Comment 13 errata-xmlrpc 2019-11-05 20:57:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3385