1710422 – Default the concurrent-fencing cluster property to true

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1710422 - Default the concurrent-fencing cluster property to true

Summary: Default the concurrent-fencing cluster property to true

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	pre-dev-freeze
Target Release:	7.8
Assignee:	Ken Gaillot
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks:	1715426
TreeView+	depends on / blocked

Reported:	2019-05-15 14:06 UTC by Shane Bradley
Modified:	2020-09-21 09:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:	pacemaker-1.1.21-1.el7
Doc Type:	Enhancement
Doc Text:	.Default value of Pacemaker `concurrent-fencing` cluster property now set to `true` Pacemaker now defaults the `concurrent-fencing` cluster property to `true`. If multiple nodes need to be fenced at the same time and they use different configured fence devices, Pacemaker will execute the fencing simultaneously rather than serialized as before. This can greatly speed up recovery in a large cluster when multiple nodes must be fenced.
Clone Of:
Clones:	1715426 (view as bug list)
Environment:
Last Closed:	2020-03-31 19:41:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	4147151	None	None	None	2019-05-15 14:06:40 UTC
Red Hat Knowledge Base (Solution)	4874351	None	None	None	2020-03-04 16:35:10 UTC
Red Hat Product Errata	RHBA-2020:1032	None	None	None	2020-03-31 19:42:35 UTC

Comment 3 Ken Gaillot 2019-05-16 15:48:19 UTC

A workaround for this issue is to set the concurrent-fencing cluster property to true.

Note that the particular problem in the customer's environment that led to this BZ is possible only when successful fencing depends on some cluster resource being active. That is an inherently risky configuration, though it should be acceptable if (as in OpenStack's case) the fence device is restricted to targets that are not allowed to run the resource being depended on.

I see three potential approaches for fixing this on the pacemaker side:

* Change concurrent-fencing to default to true. Upon investigation, it appears that serialized fencing was necessary only for pacemaker's original fence daemon that was long ago replaced.

* Order remote node fencing (i.e. compute nodes) after cluster node fencing. This is a purely heuristic approach, as it seems more likely that remote nodes would require such a dependent fence configuration. However this would not cover all possible cases and might be undesirable if the DC is the cluster node that needs fencing.

* Order retries of failed fencing attempts after first fencing attempts for other targets, when concurrent-fencing is false. This is definitely a good idea, but the implementation would be more difficult because the scheduler doesn't currently have information about fencing failures. Adding that information would add a small storage increase to all configurations for the benefit of (rare) situations similar to this one. If concurrent-fencing is defaulted to true, the situation will be even rarer.

At this point, I plan on going with solely the first option to address this BZ, with the last option kept on the radar as a possible future enhancement.

Comment 4 Ken Gaillot 2019-05-30 10:32:41 UTC

QA: It is not necessary to reproduce the customer's issue, just verify that fencing is now parallelized by default. If more than one node needs to be fenced at the same time (e.g. kill power to 2 nodes in a 5-node cluster), by default pacemaker would wait until one fencing completed before executing the next. Now, as long as the fence targets have different fence devices (e.g. individual IPMI), pacemaker will execute the fencing in parallel.

This is controlled by the concurrent-fencing cluster property, which previously defaulted to false, and now defaults to true.

Comment 6 Steven J. Levine 2019-08-22 19:28:00 UTC

Added title and made slight edit to release note description.

Comment 7 Patrik Hagara 2020-01-31 18:58:18 UTC

> [root@f09-h29-b04-5039ms ~]# rpm -q pacemaker
> pacemaker-1.1.21-4.el7.x86_64

concurrent-fencing now defaults to true:
> [root@f09-h29-b04-5039ms ~]# pcs property --all | grep concurrent-fencing
>  concurrent-fencing: true

aftermath of killing 15 nodes in a 32-node cluster at once (using `halt -f`):
> [root@f09-h29-b04-5039ms ~]# crm_mon -m -1
> Stack: corosync
> Current DC: f09-h20-b03-5039ms (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
> Last updated: Fri Jan 31 19:40:11 2020
> Last change: Fri Jan 31 19:12:58 2020 by root via cibadmin on f09-h29-b04-5039ms
> 
> 32 nodes configured
> 32 resources configured
> 
> Node f09-h17-b03-5039ms: pending
> Node f09-h17-b06-5039ms: pending
> Node f09-h17-b07-5039ms: pending
> Node f09-h20-b01-5039ms: pending
> Node f09-h20-b02-5039ms: pending
> Node f09-h20-b05-5039ms: pending
> Node f09-h20-b06-5039ms: pending
> Node f09-h20-b07-5039ms: pending
> Node f09-h23-b01-5039ms: pending
> Node f09-h23-b02-5039ms: pending
> Node f09-h23-b05-5039ms: pending
> Node f09-h23-b07-5039ms: pending
> Node f09-h26-b01-5039ms: pending
> Node f09-h26-b03-5039ms: pending
> Node f09-h29-b08-5039ms: pending
> Online: [ f09-h17-b02-5039ms f09-h17-b04-5039ms f09-h17-b05-5039ms f09-h17-b08-5039ms f09-h20-b03-5039ms f09-h20-b04-5039ms f09-h20-b08-5039ms f09-h23-b03-5039ms f09-h23-b04-5039ms f09-h23-b06-5039ms f09-h23-b08-5039ms f09-h26-b02-5039ms f09-h26-b04-5039ms f09-h29-b04-5039ms f09-h29-b05-5039ms f09-h29-b06-5039ms f09-h29-b07-5039ms ]
> 
> Active resources:
> 
>  fence-f09-h29-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h29-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h29-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h29-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h29-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b03-5039ms
>  fence-f09-h17-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b04-5039ms
>  fence-f09-h17-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h17-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b03-5039ms
>  fence-f09-h17-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b04-5039ms
>  fence-f09-h17-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h17-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b08-5039ms
>  fence-f09-h17-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b02-5039ms
>  fence-f09-h20-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b04-5039ms
>  fence-f09-h20-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b05-5039ms
>  fence-f09-h20-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b06-5039ms
>  fence-f09-h20-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b04-5039ms
>  fence-f09-h20-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b07-5039ms
>  fence-f09-h20-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h20-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h20-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h23-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h23-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b03-5039ms
>  fence-f09-h23-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b04-5039ms
>  fence-f09-h23-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h23-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b03-5039ms
>  fence-f09-h23-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b04-5039ms
>  fence-f09-h23-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h23-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b08-5039ms
>  fence-f09-h26-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b02-5039ms
>  fence-f09-h26-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b04-5039ms
>  fence-f09-h26-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b04-5039ms
>  fence-f09-h26-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b07-5039ms
> 
> Fencing History:
> * reboot of f09-h17-b07-5039ms successful: delegate=f09-h23-b08-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h23-b02-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h23-b01-5039ms successful: delegate=f09-h23-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h29-b08-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h20-b01-5039ms successful: delegate=f09-h26-b02-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h17-b06-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h26-b01-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h23-b07-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b06-5039ms successful: delegate=f09-h17-b02-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h26-b03-5039ms successful: delegate=f09-h23-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h17-b03-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b07-5039ms successful: delegate=f09-h17-b05-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b05-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b02-5039ms successful: delegate=f09-h29-b05-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:27 2020'
> * reboot of f09-h23-b05-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:27 2020'

result: all nodes fenced within a 3-second interval


disabling concurrent fencing:
> [root@f09-h29-b04-5039ms ~]# pcs property set concurrent-fencing=false
> [root@f09-h29-b04-5039ms ~]# pcs property --all | grep concurrent-fencing
>  concurrent-fencing: false

and again killing 15 out of 32 nodes at once:
> [root@f09-h17-b04-5039ms ~]# crm_mon -m -1
> Stack: corosync
> Current DC: f09-h17-b04-5039ms (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
> Last updated: Fri Jan 31 19:55:13 2020
> Last change: Fri Jan 31 19:51:21 2020 by root via cibadmin on f09-h29-b04-5039ms
> 
> 32 nodes configured
> 32 resources configured
> 
> Node f09-h23-b02-5039ms: UNCLEAN (offline)
> Node f09-h23-b03-5039ms: UNCLEAN (offline)
> Node f09-h23-b04-5039ms: UNCLEAN (offline)
> Node f09-h23-b07-5039ms: UNCLEAN (offline)
> Node f09-h26-b03-5039ms: UNCLEAN (offline)
> Node f09-h29-b04-5039ms: UNCLEAN (offline)
> Node f09-h29-b05-5039ms: UNCLEAN (offline)
> Node f09-h29-b08-5039ms: UNCLEAN (offline)
> Online: [ f09-h17-b02-5039ms f09-h17-b03-5039ms f09-h17-b04-5039ms f09-h17-b05-5039ms f09-h17-b07-5039ms f09-h17-b08-5039ms f09-h20-b01-5039ms f09-h20-b07-5039ms f09-h20-b08-5039ms f09-h23-b05-5039ms f09-h23-b06-5039ms f09-h23-b08-5039ms f09-h26-b01-5039ms f09-h26-b02-5039ms f09-h26-b04-5039ms f09-h29-b06-5039ms f09-h29-b07-5039ms ]
> OFFLINE: [ f09-h17-b06-5039ms f09-h20-b02-5039ms f09-h20-b03-5039ms f09-h20-b04-5039ms f09-h20-b05-5039ms f09-h20-b06-5039ms f09-h23-b01-5039ms ]
> 
> Active resources:
> 
>  fence-f09-h29-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h29-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h29-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h29-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h29-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b03-5039ms
>  fence-f09-h17-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b07-5039ms
>  fence-f09-h17-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h17-b04-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b03-5039ms f09-h20-b01-5039ms ]
>  fence-f09-h17-b05-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b04-5039ms f09-h20-b07-5039ms ]
>  fence-f09-h17-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h17-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b08-5039ms
>  fence-f09-h17-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b02-5039ms
>  fence-f09-h20-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b04-5039ms
>  fence-f09-h20-b02-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b05-5039ms f09-h29-b05-5039ms ]
>  fence-f09-h20-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b06-5039ms
>  fence-f09-h20-b04-5039ms	(stonith:fence_ipmilan):	Started[ f09-h26-b01-5039ms f09-h29-b04-5039ms ]
>  fence-f09-h20-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b07-5039ms
>  fence-f09-h20-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b03-5039ms
>  fence-f09-h20-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h20-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b07-5039ms
>  fence-f09-h23-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b01-5039ms
>  fence-f09-h23-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h23-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h23-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h23-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b07-5039ms
>  fence-f09-h23-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h23-b07-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b02-5039ms f09-h23-b05-5039ms ]
>  fence-f09-h23-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h26-b01-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b08-5039ms f09-h23-b07-5039ms ]
>  fence-f09-h26-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b01-5039ms
>  fence-f09-h26-b03-5039ms	(stonith:fence_ipmilan):	Started[ f09-h26-b02-5039ms f09-h26-b03-5039ms ]
>  fence-f09-h26-b04-5039ms	(stonith:fence_ipmilan):	Started[ f09-h29-b08-5039ms f09-h26-b04-5039ms ]
> 
> Fencing History:
> * reboot of f09-h23-b02-5039ms pending: client=crmd.16632, origin=f09-h17-b04-5039ms
> * reboot of f09-h23-b01-5039ms successful: delegate=f09-h20-b01-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:59 2020'
> * reboot of f09-h20-b06-5039ms successful: delegate=f09-h17-b03-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:44 2020'
> * reboot of f09-h20-b05-5039ms successful: delegate=f09-h29-b07-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:29 2020'
> * reboot of f09-h20-b04-5039ms successful: delegate=f09-h26-b01-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:13 2020'
> * reboot of f09-h20-b03-5039ms successful: delegate=f09-h29-b06-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:53:58 2020'
> * reboot of f09-h20-b02-5039ms successful: delegate=f09-h23-b05-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:53:43 2020'
> * reboot of f09-h17-b06-5039ms successful: delegate=f09-h23-b06-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:53:27 2020'

result: reverted to old behavior -- fencing is serialized (~15s/node)

marking verified in 1.1.21-4.el7

Comment 9 errata-xmlrpc 2020-03-31 19:41:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1032

Note You need to log in before you can comment on or make changes to this bug.