Hide Forgot
A workaround for this issue is to set the concurrent-fencing cluster property to true. Note that the particular problem in the customer's environment that led to this BZ is possible only when successful fencing depends on some cluster resource being active. That is an inherently risky configuration, though it should be acceptable if (as in OpenStack's case) the fence device is restricted to targets that are not allowed to run the resource being depended on. I see three potential approaches for fixing this on the pacemaker side: * Change concurrent-fencing to default to true. Upon investigation, it appears that serialized fencing was necessary only for pacemaker's original fence daemon that was long ago replaced. * Order remote node fencing (i.e. compute nodes) after cluster node fencing. This is a purely heuristic approach, as it seems more likely that remote nodes would require such a dependent fence configuration. However this would not cover all possible cases and might be undesirable if the DC is the cluster node that needs fencing. * Order retries of failed fencing attempts after first fencing attempts for other targets, when concurrent-fencing is false. This is definitely a good idea, but the implementation would be more difficult because the scheduler doesn't currently have information about fencing failures. Adding that information would add a small storage increase to all configurations for the benefit of (rare) situations similar to this one. If concurrent-fencing is defaulted to true, the situation will be even rarer. At this point, I plan on going with solely the first option to address this BZ, with the last option kept on the radar as a possible future enhancement.
QA: It is not necessary to reproduce the customer's issue, just verify that fencing is now parallelized by default. If more than one node needs to be fenced at the same time (e.g. kill power to 2 nodes in a 5-node cluster), by default pacemaker would wait until one fencing completed before executing the next. Now, as long as the fence targets have different fence devices (e.g. individual IPMI), pacemaker will execute the fencing in parallel. This is controlled by the concurrent-fencing cluster property, which previously defaulted to false, and now defaults to true.
Added title and made slight edit to release note description.
> [root@f09-h29-b04-5039ms ~]# rpm -q pacemaker > pacemaker-1.1.21-4.el7.x86_64 concurrent-fencing now defaults to true: > [root@f09-h29-b04-5039ms ~]# pcs property --all | grep concurrent-fencing > concurrent-fencing: true aftermath of killing 15 nodes in a 32-node cluster at once (using `halt -f`): > [root@f09-h29-b04-5039ms ~]# crm_mon -m -1 > Stack: corosync > Current DC: f09-h20-b03-5039ms (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Fri Jan 31 19:40:11 2020 > Last change: Fri Jan 31 19:12:58 2020 by root via cibadmin on f09-h29-b04-5039ms > > 32 nodes configured > 32 resources configured > > Node f09-h17-b03-5039ms: pending > Node f09-h17-b06-5039ms: pending > Node f09-h17-b07-5039ms: pending > Node f09-h20-b01-5039ms: pending > Node f09-h20-b02-5039ms: pending > Node f09-h20-b05-5039ms: pending > Node f09-h20-b06-5039ms: pending > Node f09-h20-b07-5039ms: pending > Node f09-h23-b01-5039ms: pending > Node f09-h23-b02-5039ms: pending > Node f09-h23-b05-5039ms: pending > Node f09-h23-b07-5039ms: pending > Node f09-h26-b01-5039ms: pending > Node f09-h26-b03-5039ms: pending > Node f09-h29-b08-5039ms: pending > Online: [ f09-h17-b02-5039ms f09-h17-b04-5039ms f09-h17-b05-5039ms f09-h17-b08-5039ms f09-h20-b03-5039ms f09-h20-b04-5039ms f09-h20-b08-5039ms f09-h23-b03-5039ms f09-h23-b04-5039ms f09-h23-b06-5039ms f09-h23-b08-5039ms f09-h26-b02-5039ms f09-h26-b04-5039ms f09-h29-b04-5039ms f09-h29-b05-5039ms f09-h29-b06-5039ms f09-h29-b07-5039ms ] > > Active resources: > > fence-f09-h29-b04-5039ms (stonith:fence_ipmilan): Started f09-h17-b02-5039ms > fence-f09-h29-b05-5039ms (stonith:fence_ipmilan): Started f09-h17-b04-5039ms > fence-f09-h29-b06-5039ms (stonith:fence_ipmilan): Started f09-h17-b05-5039ms > fence-f09-h29-b07-5039ms (stonith:fence_ipmilan): Started f09-h17-b08-5039ms > fence-f09-h29-b08-5039ms (stonith:fence_ipmilan): Started f09-h20-b03-5039ms > fence-f09-h17-b02-5039ms (stonith:fence_ipmilan): Started f09-h20-b04-5039ms > fence-f09-h17-b03-5039ms (stonith:fence_ipmilan): Started f09-h20-b08-5039ms > fence-f09-h17-b04-5039ms (stonith:fence_ipmilan): Started f09-h23-b03-5039ms > fence-f09-h17-b05-5039ms (stonith:fence_ipmilan): Started f09-h23-b04-5039ms > fence-f09-h17-b06-5039ms (stonith:fence_ipmilan): Started f09-h23-b06-5039ms > fence-f09-h17-b07-5039ms (stonith:fence_ipmilan): Started f09-h23-b08-5039ms > fence-f09-h17-b08-5039ms (stonith:fence_ipmilan): Started f09-h26-b02-5039ms > fence-f09-h20-b01-5039ms (stonith:fence_ipmilan): Started f09-h26-b04-5039ms > fence-f09-h20-b02-5039ms (stonith:fence_ipmilan): Started f09-h29-b05-5039ms > fence-f09-h20-b03-5039ms (stonith:fence_ipmilan): Started f09-h29-b06-5039ms > fence-f09-h20-b04-5039ms (stonith:fence_ipmilan): Started f09-h29-b04-5039ms > fence-f09-h20-b05-5039ms (stonith:fence_ipmilan): Started f09-h29-b07-5039ms > fence-f09-h20-b06-5039ms (stonith:fence_ipmilan): Started f09-h17-b02-5039ms > fence-f09-h20-b07-5039ms (stonith:fence_ipmilan): Started f09-h17-b05-5039ms > fence-f09-h20-b08-5039ms (stonith:fence_ipmilan): Started f09-h17-b04-5039ms > fence-f09-h23-b01-5039ms (stonith:fence_ipmilan): Started f09-h17-b08-5039ms > fence-f09-h23-b02-5039ms (stonith:fence_ipmilan): Started f09-h20-b03-5039ms > fence-f09-h23-b03-5039ms (stonith:fence_ipmilan): Started f09-h20-b04-5039ms > fence-f09-h23-b04-5039ms (stonith:fence_ipmilan): Started f09-h20-b08-5039ms > fence-f09-h23-b05-5039ms (stonith:fence_ipmilan): Started f09-h23-b03-5039ms > fence-f09-h23-b06-5039ms (stonith:fence_ipmilan): Started f09-h23-b04-5039ms > fence-f09-h23-b07-5039ms (stonith:fence_ipmilan): Started f09-h23-b06-5039ms > fence-f09-h23-b08-5039ms (stonith:fence_ipmilan): Started f09-h23-b08-5039ms > fence-f09-h26-b01-5039ms (stonith:fence_ipmilan): Started f09-h26-b02-5039ms > fence-f09-h26-b02-5039ms (stonith:fence_ipmilan): Started f09-h26-b04-5039ms > fence-f09-h26-b03-5039ms (stonith:fence_ipmilan): Started f09-h29-b04-5039ms > fence-f09-h26-b04-5039ms (stonith:fence_ipmilan): Started f09-h29-b07-5039ms > > Fencing History: > * reboot of f09-h17-b07-5039ms successful: delegate=f09-h23-b08-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:29 2020' > * reboot of f09-h23-b02-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:29 2020' > * reboot of f09-h23-b01-5039ms successful: delegate=f09-h23-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:29 2020' > * reboot of f09-h29-b08-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:29 2020' > * reboot of f09-h20-b01-5039ms successful: delegate=f09-h26-b02-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:29 2020' > * reboot of f09-h17-b06-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h26-b01-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h23-b07-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h20-b06-5039ms successful: delegate=f09-h17-b02-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h26-b03-5039ms successful: delegate=f09-h23-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h17-b03-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h20-b07-5039ms successful: delegate=f09-h17-b05-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h20-b05-5039ms successful: delegate=f09-h20-b03-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:28 2020' > * reboot of f09-h20-b02-5039ms successful: delegate=f09-h29-b05-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:27 2020' > * reboot of f09-h23-b05-5039ms successful: delegate=f09-h26-b04-5039ms, client=crmd.16434, origin=f09-h20-b03-5039ms, > last-successful='Fri Jan 31 19:35:27 2020' result: all nodes fenced within a 3-second interval disabling concurrent fencing: > [root@f09-h29-b04-5039ms ~]# pcs property set concurrent-fencing=false > [root@f09-h29-b04-5039ms ~]# pcs property --all | grep concurrent-fencing > concurrent-fencing: false and again killing 15 out of 32 nodes at once: > [root@f09-h17-b04-5039ms ~]# crm_mon -m -1 > Stack: corosync > Current DC: f09-h17-b04-5039ms (version 1.1.21-4.el7-f14e36fd43) - partition with quorum > Last updated: Fri Jan 31 19:55:13 2020 > Last change: Fri Jan 31 19:51:21 2020 by root via cibadmin on f09-h29-b04-5039ms > > 32 nodes configured > 32 resources configured > > Node f09-h23-b02-5039ms: UNCLEAN (offline) > Node f09-h23-b03-5039ms: UNCLEAN (offline) > Node f09-h23-b04-5039ms: UNCLEAN (offline) > Node f09-h23-b07-5039ms: UNCLEAN (offline) > Node f09-h26-b03-5039ms: UNCLEAN (offline) > Node f09-h29-b04-5039ms: UNCLEAN (offline) > Node f09-h29-b05-5039ms: UNCLEAN (offline) > Node f09-h29-b08-5039ms: UNCLEAN (offline) > Online: [ f09-h17-b02-5039ms f09-h17-b03-5039ms f09-h17-b04-5039ms f09-h17-b05-5039ms f09-h17-b07-5039ms f09-h17-b08-5039ms f09-h20-b01-5039ms f09-h20-b07-5039ms f09-h20-b08-5039ms f09-h23-b05-5039ms f09-h23-b06-5039ms f09-h23-b08-5039ms f09-h26-b01-5039ms f09-h26-b02-5039ms f09-h26-b04-5039ms f09-h29-b06-5039ms f09-h29-b07-5039ms ] > OFFLINE: [ f09-h17-b06-5039ms f09-h20-b02-5039ms f09-h20-b03-5039ms f09-h20-b04-5039ms f09-h20-b05-5039ms f09-h20-b06-5039ms f09-h23-b01-5039ms ] > > Active resources: > > fence-f09-h29-b04-5039ms (stonith:fence_ipmilan): Started f09-h17-b02-5039ms > fence-f09-h29-b05-5039ms (stonith:fence_ipmilan): Started f09-h17-b04-5039ms > fence-f09-h29-b06-5039ms (stonith:fence_ipmilan): Started f09-h17-b05-5039ms > fence-f09-h29-b07-5039ms (stonith:fence_ipmilan): Started f09-h17-b08-5039ms > fence-f09-h29-b08-5039ms (stonith:fence_ipmilan): Started f09-h17-b03-5039ms > fence-f09-h17-b02-5039ms (stonith:fence_ipmilan): Started f09-h17-b07-5039ms > fence-f09-h17-b03-5039ms (stonith:fence_ipmilan): Started f09-h20-b08-5039ms > fence-f09-h17-b04-5039ms (stonith:fence_ipmilan): Started[ f09-h23-b03-5039ms f09-h20-b01-5039ms ] > fence-f09-h17-b05-5039ms (stonith:fence_ipmilan): Started[ f09-h23-b04-5039ms f09-h20-b07-5039ms ] > fence-f09-h17-b06-5039ms (stonith:fence_ipmilan): Started f09-h23-b06-5039ms > fence-f09-h17-b07-5039ms (stonith:fence_ipmilan): Started f09-h23-b08-5039ms > fence-f09-h17-b08-5039ms (stonith:fence_ipmilan): Started f09-h26-b02-5039ms > fence-f09-h20-b01-5039ms (stonith:fence_ipmilan): Started f09-h26-b04-5039ms > fence-f09-h20-b02-5039ms (stonith:fence_ipmilan): Started[ f09-h23-b05-5039ms f09-h29-b05-5039ms ] > fence-f09-h20-b03-5039ms (stonith:fence_ipmilan): Started f09-h29-b06-5039ms > fence-f09-h20-b04-5039ms (stonith:fence_ipmilan): Started[ f09-h26-b01-5039ms f09-h29-b04-5039ms ] > fence-f09-h20-b05-5039ms (stonith:fence_ipmilan): Started f09-h29-b07-5039ms > fence-f09-h20-b06-5039ms (stonith:fence_ipmilan): Started f09-h17-b03-5039ms > fence-f09-h20-b07-5039ms (stonith:fence_ipmilan): Started f09-h17-b02-5039ms > fence-f09-h20-b08-5039ms (stonith:fence_ipmilan): Started f09-h17-b07-5039ms > fence-f09-h23-b01-5039ms (stonith:fence_ipmilan): Started f09-h20-b01-5039ms > fence-f09-h23-b02-5039ms (stonith:fence_ipmilan): Started f09-h17-b04-5039ms > fence-f09-h23-b03-5039ms (stonith:fence_ipmilan): Started f09-h17-b05-5039ms > fence-f09-h23-b04-5039ms (stonith:fence_ipmilan): Started f09-h17-b08-5039ms > fence-f09-h23-b05-5039ms (stonith:fence_ipmilan): Started f09-h20-b07-5039ms > fence-f09-h23-b06-5039ms (stonith:fence_ipmilan): Started f09-h20-b08-5039ms > fence-f09-h23-b07-5039ms (stonith:fence_ipmilan): Started[ f09-h23-b02-5039ms f09-h23-b05-5039ms ] > fence-f09-h23-b08-5039ms (stonith:fence_ipmilan): Started f09-h23-b06-5039ms > fence-f09-h26-b01-5039ms (stonith:fence_ipmilan): Started[ f09-h23-b08-5039ms f09-h23-b07-5039ms ] > fence-f09-h26-b02-5039ms (stonith:fence_ipmilan): Started f09-h26-b01-5039ms > fence-f09-h26-b03-5039ms (stonith:fence_ipmilan): Started[ f09-h26-b02-5039ms f09-h26-b03-5039ms ] > fence-f09-h26-b04-5039ms (stonith:fence_ipmilan): Started[ f09-h29-b08-5039ms f09-h26-b04-5039ms ] > > Fencing History: > * reboot of f09-h23-b02-5039ms pending: client=crmd.16632, origin=f09-h17-b04-5039ms > * reboot of f09-h23-b01-5039ms successful: delegate=f09-h20-b01-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:54:59 2020' > * reboot of f09-h20-b06-5039ms successful: delegate=f09-h17-b03-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:54:44 2020' > * reboot of f09-h20-b05-5039ms successful: delegate=f09-h29-b07-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:54:29 2020' > * reboot of f09-h20-b04-5039ms successful: delegate=f09-h26-b01-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:54:13 2020' > * reboot of f09-h20-b03-5039ms successful: delegate=f09-h29-b06-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:53:58 2020' > * reboot of f09-h20-b02-5039ms successful: delegate=f09-h23-b05-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:53:43 2020' > * reboot of f09-h17-b06-5039ms successful: delegate=f09-h23-b06-5039ms, client=crmd.16632, origin=f09-h17-b04-5039ms, > last-successful='Fri Jan 31 19:53:27 2020' result: reverted to old behavior -- fencing is serialized (~15s/node) marking verified in 1.1.21-4.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1032