Bug 1710422 - Default the concurrent-fencing cluster property to true
Summary: Default the concurrent-fencing cluster property to true
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pacemaker
Version: 7.6
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: pre-dev-freeze
: 7.8
Assignee: Ken Gaillot
QA Contact: cluster-qe@redhat.com
Steven J. Levine
URL:
Whiteboard:
Depends On:
Blocks: 1715426
TreeView+ depends on / blocked
 
Reported: 2019-05-15 14:06 UTC by Shane Bradley
Modified: 2020-03-31 19:42 UTC (History)
5 users (show)

Fixed In Version: pacemaker-1.1.21-1.el7
Doc Type: Enhancement
Doc Text:
.Default value of Pacemaker `concurrent-fencing` cluster property now set to `true` Pacemaker now defaults the `concurrent-fencing` cluster property to `true`. If multiple nodes need to be fenced at the same time and they use different configured fence devices, Pacemaker will execute the fencing simultaneously rather than serialized as before. This can greatly speed up recovery in a large cluster when multiple nodes must be fenced.
Clone Of:
: 1715426 (view as bug list)
Environment:
Last Closed: 2020-03-31 19:41:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4147151 None None None 2019-05-15 14:06:40 UTC
Red Hat Knowledge Base (Solution) 4874351 None None None 2020-03-04 16:35:10 UTC
Red Hat Product Errata RHBA-2020:1032 None None None 2020-03-31 19:42:35 UTC

Comment 3 Ken Gaillot 2019-05-16 15:48:19 UTC
A workaround for this issue is to set the concurrent-fencing cluster property to true.

Note that the particular problem in the customer's environment that led to this BZ is possible only when successful fencing depends on some cluster resource being active. That is an inherently risky configuration, though it should be acceptable if (as in OpenStack's case) the fence device is restricted to targets that are not allowed to run the resource being depended on.

I see three potential approaches for fixing this on the pacemaker side:

* Change concurrent-fencing to default to true. Upon investigation, it appears that serialized fencing was necessary only for pacemaker's original fence daemon that was long ago replaced.

* Order remote node fencing (i.e. compute nodes) after cluster node fencing. This is a purely heuristic approach, as it seems more likely that remote nodes would require such a dependent fence configuration. However this would not cover all possible cases and might be undesirable if the DC is the cluster node that needs fencing.

* Order retries of failed fencing attempts after first fencing attempts for other targets, when concurrent-fencing is false. This is definitely a good idea, but the implementation would be more difficult because the scheduler doesn't currently have information about fencing failures. Adding that information would add a small storage increase to all configurations for the benefit of (rare) situations similar to this one. If concurrent-fencing is defaulted to true, the situation will be even rarer.

At this point, I plan on going with solely the first option to address this BZ, with the last option kept on the radar as a possible future enhancement.

Comment 4 Ken Gaillot 2019-05-30 10:32:41 UTC
QA: It is not necessary to reproduce the customer's issue, just verify that fencing is now parallelized by default. If more than one node needs to be fenced at the same time (e.g. kill power to 2 nodes in a 5-node cluster), by default pacemaker would wait until one fencing completed before executing the next. Now, as long as the fence targets have different fence devices (e.g. individual IPMI), pacemaker will execute the fencing in parallel.

This is controlled by the concurrent-fencing cluster property, which previously defaulted to false, and now defaults to true.

Comment 6 Steven J. Levine 2019-08-22 19:28:00 UTC
Added title and made slight edit to release note description.

Comment 7 Patrik Hagara 2020-01-31 18:58:18 UTC
> [root@f09-h29-b04-5039ms ~]# rpm -q pacemaker
> pacemaker-1.1.21-4.el7.x86_64

concurrent-fencing now defaults to true:
> [root@f09-h29-b04-5039ms ~]# pcs property --all | grep concurrent-fencing
>  concurrent-fencing: true

aftermath of killing 15 nodes in a 32-node cluster at once (using `halt -f`):
> [root@f09-h29-b04-5039ms ~]# crm_mon -m -1
> Stack: corosync
> Current DC: f09-h20-b03-5039ms (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
> Last updated: Fri Jan 31 19:40:11 2020
> Last change: Fri Jan 31 19:12:58 2020 by root via cibadmin on f09-h29-b04-5039ms
> 
> 32 nodes configured
> 32 resources configured
> 
> Node f09-h17-b03-5039ms: pending
> Node f09-h17-b06-5039ms: pending
> Node f09-h17-b07-5039ms: pending
> Node f09-h20-b01-5039ms: pending
> Node f09-h20-b02-5039ms: pending
> Node f09-h20-b05-5039ms: pending
> Node f09-h20-b06-5039ms: pending
> Node f09-h20-b07-5039ms: pending
> Node f09-h23-b01-5039ms: pending
> Node f09-h23-b02-5039ms: pending
> Node f09-h23-b05-5039ms: pending
> Node f09-h23-b07-5039ms: pending
> Node f09-h26-b01-5039ms: pending
> Node f09-h26-b03-5039ms: pending
> Node f09-h29-b08-5039ms: pending
> Online: [ f09-h17-b02-5039ms f09-h17-b04-5039ms f09-h17-b05-5039ms f09-h17-b08-5039ms f09-h20-b03-5039ms f09-h20-b04-5039ms f09-h20-b08-5039ms f09-h23-b03-5039ms f09-h23-b04-5039ms f09-h23-b06-5039ms f09-h23-b08-5039ms f09-h26-b02-5039ms f09-h26-b04-5039ms f09-h29-b04-5039ms f09-h29-b05-5039ms f09-h29-b06-5039ms f09-h29-b07-5039ms ]
> 
> Active resources:
> 
>  fence-f09-h29-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h29-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h29-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h29-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h29-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b03-5039ms
>  fence-f09-h17-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b04-5039ms
>  fence-f09-h17-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h17-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b03-5039ms
>  fence-f09-h17-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b04-5039ms
>  fence-f09-h17-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h17-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b08-5039ms
>  fence-f09-h17-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b02-5039ms
>  fence-f09-h20-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b04-5039ms
>  fence-f09-h20-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b05-5039ms
>  fence-f09-h20-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b06-5039ms
>  fence-f09-h20-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b04-5039ms
>  fence-f09-h20-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b07-5039ms
>  fence-f09-h20-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h20-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h20-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h23-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h23-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b03-5039ms
>  fence-f09-h23-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b04-5039ms
>  fence-f09-h23-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h23-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b03-5039ms
>  fence-f09-h23-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b04-5039ms
>  fence-f09-h23-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h23-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b08-5039ms
>  fence-f09-h26-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b02-5039ms
>  fence-f09-h26-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b04-5039ms
>  fence-f09-h26-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b04-5039ms
>  fence-f09-h26-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b07-5039ms
> 
> Fencing History:
> * reboot of f09-h17-b07-5039ms successful: delegate=f09-h23-b08-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h23-b02-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h23-b01-5039ms successful: delegate=f09-h23-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h29-b08-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h20-b01-5039ms successful: delegate=f09-h26-b02-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:29 2020'
> * reboot of f09-h17-b06-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h26-b01-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h23-b07-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b06-5039ms successful: delegate=f09-h17-b02-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h26-b03-5039ms successful: delegate=f09-h23-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h17-b03-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b07-5039ms successful: delegate=f09-h17-b05-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b05-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:28 2020'
> * reboot of f09-h20-b02-5039ms successful: delegate=f09-h29-b05-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:27 2020'
> * reboot of f09-h23-b05-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms,
>     last-successful='Fri Jan 31 19:35:27 2020'

result: all nodes fenced within a 3-second interval


disabling concurrent fencing:
> [root@f09-h29-b04-5039ms ~]# pcs property set concurrent-fencing=false
> [root@f09-h29-b04-5039ms ~]# pcs property --all | grep concurrent-fencing
>  concurrent-fencing: false

and again killing 15 out of 32 nodes at once:
> [root@f09-h17-b04-5039ms ~]# crm_mon -m -1
> Stack: corosync
> Current DC: f09-h17-b04-5039ms (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
> Last updated: Fri Jan 31 19:55:13 2020
> Last change: Fri Jan 31 19:51:21 2020 by root via cibadmin on f09-h29-b04-5039ms
> 
> 32 nodes configured
> 32 resources configured
> 
> Node f09-h23-b02-5039ms: UNCLEAN (offline)
> Node f09-h23-b03-5039ms: UNCLEAN (offline)
> Node f09-h23-b04-5039ms: UNCLEAN (offline)
> Node f09-h23-b07-5039ms: UNCLEAN (offline)
> Node f09-h26-b03-5039ms: UNCLEAN (offline)
> Node f09-h29-b04-5039ms: UNCLEAN (offline)
> Node f09-h29-b05-5039ms: UNCLEAN (offline)
> Node f09-h29-b08-5039ms: UNCLEAN (offline)
> Online: [ f09-h17-b02-5039ms f09-h17-b03-5039ms f09-h17-b04-5039ms f09-h17-b05-5039ms f09-h17-b07-5039ms f09-h17-b08-5039ms f09-h20-b01-5039ms f09-h20-b07-5039ms f09-h20-b08-5039ms f09-h23-b05-5039ms f09-h23-b06-5039ms f09-h23-b08-5039ms f09-h26-b01-5039ms f09-h26-b02-5039ms f09-h26-b04-5039ms f09-h29-b06-5039ms f09-h29-b07-5039ms ]
> OFFLINE: [ f09-h17-b06-5039ms f09-h20-b02-5039ms f09-h20-b03-5039ms f09-h20-b04-5039ms f09-h20-b05-5039ms f09-h20-b06-5039ms f09-h23-b01-5039ms ]
> 
> Active resources:
> 
>  fence-f09-h29-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h29-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h29-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h29-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h29-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b03-5039ms
>  fence-f09-h17-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b07-5039ms
>  fence-f09-h17-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h17-b04-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b03-5039ms f09-h20-b01-5039ms ]
>  fence-f09-h17-b05-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b04-5039ms f09-h20-b07-5039ms ]
>  fence-f09-h17-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h17-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b08-5039ms
>  fence-f09-h17-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b02-5039ms
>  fence-f09-h20-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b04-5039ms
>  fence-f09-h20-b02-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b05-5039ms f09-h29-b05-5039ms ]
>  fence-f09-h20-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b06-5039ms
>  fence-f09-h20-b04-5039ms	(stonith:fence_ipmilan):	Started[ f09-h26-b01-5039ms f09-h29-b04-5039ms ]
>  fence-f09-h20-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h29-b07-5039ms
>  fence-f09-h20-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b03-5039ms
>  fence-f09-h20-b07-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b02-5039ms
>  fence-f09-h20-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b07-5039ms
>  fence-f09-h23-b01-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b01-5039ms
>  fence-f09-h23-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b04-5039ms
>  fence-f09-h23-b03-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b05-5039ms
>  fence-f09-h23-b04-5039ms	(stonith:fence_ipmilan):	Started f09-h17-b08-5039ms
>  fence-f09-h23-b05-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b07-5039ms
>  fence-f09-h23-b06-5039ms	(stonith:fence_ipmilan):	Started f09-h20-b08-5039ms
>  fence-f09-h23-b07-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b02-5039ms f09-h23-b05-5039ms ]
>  fence-f09-h23-b08-5039ms	(stonith:fence_ipmilan):	Started f09-h23-b06-5039ms
>  fence-f09-h26-b01-5039ms	(stonith:fence_ipmilan):	Started[ f09-h23-b08-5039ms f09-h23-b07-5039ms ]
>  fence-f09-h26-b02-5039ms	(stonith:fence_ipmilan):	Started f09-h26-b01-5039ms
>  fence-f09-h26-b03-5039ms	(stonith:fence_ipmilan):	Started[ f09-h26-b02-5039ms f09-h26-b03-5039ms ]
>  fence-f09-h26-b04-5039ms	(stonith:fence_ipmilan):	Started[ f09-h29-b08-5039ms f09-h26-b04-5039ms ]
> 
> Fencing History:
> * reboot of f09-h23-b02-5039ms pending: client=crmd.16632, origin=f09-h17-b04-5039ms
> * reboot of f09-h23-b01-5039ms successful: delegate=f09-h20-b01-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:59 2020'
> * reboot of f09-h20-b06-5039ms successful: delegate=f09-h17-b03-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:44 2020'
> * reboot of f09-h20-b05-5039ms successful: delegate=f09-h29-b07-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:29 2020'
> * reboot of f09-h20-b04-5039ms successful: delegate=f09-h26-b01-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:54:13 2020'
> * reboot of f09-h20-b03-5039ms successful: delegate=f09-h29-b06-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:53:58 2020'
> * reboot of f09-h20-b02-5039ms successful: delegate=f09-h23-b05-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:53:43 2020'
> * reboot of f09-h17-b06-5039ms successful: delegate=f09-h23-b06-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms,
>     last-successful='Fri Jan 31 19:53:27 2020'

result: reverted to old behavior -- fencing is serialized (~15s/node)

marking verified in 1.1.21-4.el7

Comment 9 errata-xmlrpc 2020-03-31 19:41:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1032


Note You need to log in before you can comment on or make changes to this bug.