1781820 – Option to skip resource recovery after clean node shutdown (Zero Touch)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1781820 - Option to skip resource recovery after clean node shutdown (Zero Touch)

Summary: Option to skip resource recovery after clean node shutdown (Zero Touch)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	pacemaker
Sub Component:
Version:	7.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	7.8
Assignee:	Ken Gaillot
QA Contact:	Markéta Smazová
Docs Contact:	Steven J. Levine
URL:
Whiteboard:
Depends On:	1712584
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-10 16:20 UTC by Chris Feist
Modified:	2021-06-08 15:21 UTC (History)
CC List:	9 users (show)
Fixed In Version:	pacemaker-1.1.21-4.el7
Doc Type:	Enhancement
Doc Text:	.Pacemaker support for configuring resources to remain stopped on clean node shutdown When a cluster node shuts down, Pacemaker’s default response is to stop all resources running on that node and recover them elsewhere. Some users prefer to have high availability only for failures, and to treat clean shutdowns as scheduled outages. To address this, Pacemaker now supports the `shutdown-lock` and `shutdown-lock-limit` cluster properties to specify that resources active on a node when it shuts down should remain stopped until the node next rejoins. Users can now use clean shutdowns as scheduled outages without any manual intervention. For information on configuring resources to remain stopped on a clean node shutdown, see link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html-single/high_availability_add-on_reference/index?lb_target=production#s1-shutdown-lock-HAAR[Configuring Resources to Remain Stopped on Clean Node Shutdown].
Clone Of:	1712584
Environment:
Last Closed:	2020-03-31 19:41:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5225251	None	None	None	2020-09-21 21:17:53 UTC
Red Hat Knowledge Base (Solution)	5227211	None	None	None	2020-09-21 21:18:31 UTC
Red Hat Product Errata	RHBA-2020:1032	None	None	None	2020-03-31 19:42:35 UTC

Comment 4 Ken Gaillot 2020-01-16 20:35:53 UTC

This has been implemented as of https://github.com/ClusterLabs/pacemaker/pull/1974 for the upstream 2.0 series used in RHEL 8. The feature will not be backported to the upstream 1.1 line due to mixed-version cluster issues, so it will be backported directly to the source used in RHEL 7.8.

The interface is via two new cluster properties:

shutdown-lock: The default of false is the current behavior (active resources can be recovered elsewhere when their node is cleanly shut down). If this option is true, resources active on a node when it is cleanly shut down are kept "locked" to that node (not allowed to run elsewhere) until they start again on that node after it rejoins (or for at most shutdown-lock-limit, if set). Stonith resources and Pacemaker Remote connections are never locked. Clone and bundle instances and the master role of promotable clones are currently never locked, though support could be added in a future release.

shutdown-lock-limit: If shutdown-lock is true, and this is set to a nonzero time duration, locked resources will be allowed to start after this much time has passed since the node shutdown was initiated, even if the node has not rejoined.

Locks can be manually cleared using "pcs resource refresh <resource> --node <node>". Both resource and node must be specified.

Shutdown locks work with remote nodes as well as cluster nodes, but lock expiration and manual clearing work only if the remote node's connection resource was disabled. If a remote node is shut down without disabling the resource, the lock will remain in effect until the remote node is brought back up regardless of any shutdown-lock-limit or manual refresh.

Because there is no way to prevent mixed-version cluster issues, this feature should not be used (and is not supported) until all nodes in a cluster have been upgraded to a version that supports it. Otherwise the feature may arbitrarily be effective or not at any given time.

Comment 6 Ken Gaillot 2020-01-17 02:37:00 UTC

QE: To test:

1. Configure a cluster of at least two nodes (so you can shut down one and retain quorum) with at least one resource.
2. If shutdown-lock is not specified or specified as false, shutting down a node running a resource will result in the resource being recovered on another node.
3. If shutdown-lock is specified as true, shutting down a node running a resource will result in the resource being stopped.
- "pcs status" will show the resource as "LOCKED" while the node is down.
- Starting the node again will result in the resource starting there again. (If the cluster has multiple resources, load balancing could result in it moving afterward, unless stickiness or a location preference is used.)
- Running "pcs resource refresh <resource> --node <node>" while the node is down and the resource is locked will result in the lock being removed, and the resource starting elsewhere.
- If shutdown-lock-limit is specified as an amount of time (e.g. 5min), then after that much time has passed since the shutdown was initiated (within the granularity of the cluster-recheck-interval), the resource will start elsewhere even if the node remains down.

shutdown-lock will work for either cluster nodes or remote nodes, but not guest nodes. shutdown-lock-limit and manual refresh will work with remote nodes only if the remote node is stopped via "systemctl stop pacemaker_remote" on the remote node, followed by "pcs resource disable <remote-connection-resource>", followed by the lock expiration or refresh.

Comment 8 Markéta Smazová 2020-02-07 13:55:27 UTC

CASE 1
-------
Property "shutdown-lock" is not specified or is specified as false. Shutting down a node running a resource will result in the resource being recovered on another node (current behavior).

>   [root@virt-003 ~]# rpm -q pacemaker
>   pacemaker-1.1.21-4.el7.x86_64

Verify whether "shutdown-lock" property is false.

>   [root@virt-003 ~]# pcs property list --all | grep shutdown-lock
>    shutdown-lock: false
>    shutdown-lock-limit: 0

There are two resources "second" and "fifth" on virt-004.

>   [root@virt-003 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 11:02:10 2020
>   Last change: Fri Jan 31 10:56:40 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:
>
>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-004
>    fence-virt-005	(stonith:fence_xvm):	Started virt-005
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-003
>    second	(ocf::pacemaker:Dummy):	Started virt-004
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-004
>    sixth	(ocf::pacemaker:Dummy):	Started virt-005

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Shutting down virt-004.

>   [root@virt-003 ~]# qarsh virt-004 pcs cluster stop
>   Stopping Cluster (pacemaker)...
>   Stopping Cluster (corosync)...

Node virt-004 is offline, resources "second" and "fifth" are recovered on other nodes.

>   [root@virt-003 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 11:03:35 2020
>   Last change: Fri Jan 31 10:56:40 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:
>
>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-005
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-003
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Started virt-005

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Start cluster on the node again so that it rejoins the cluster.

>   [root@virt-003 ~]# qarsh virt-004 pcs cluster start
>   Starting Cluster (corosync)...
>   Starting Cluster (pacemaker)...

>   [root@virt-003 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 11:04:00 2020
>   Last change: Fri Jan 31 10:56:40 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:
>
>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-003
>    second	(ocf::pacemaker:Dummy):	Started virt-004
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


CASE 2
-------
Property "shutdown-lock" is specified as true, shutting down a node running a resource will result in the resource being stopped. "pcs status" will show the resource as "LOCKED" while the node is down. Starting the node again will result in the resource starting there again. 

Set "shutdown-lock" to true and verify.

>   [root@virt-003 ~]# pcs property set shutdown-lock=true
>   [root@virt-003 ~]# pcs property list --all | grep shutdown-lock
>    shutdown-lock: true
>    shutdown-lock-limit: 0

There are resources "second" and "sixth" on virt-004.

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Full list of resources:
>
>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-003
>    second	(ocf::pacemaker:Dummy):	Started virt-004
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Shutting down virt-004.

>   [root@virt-003 ~]# qarsh virt-004 pcs cluster stop
>   Stopping Cluster (pacemaker)...
>   Stopping Cluster (corosync)...

Node virt-004 is offline, resources "second" and "sixth" are marked as (LOCKED).

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:
>
>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-003
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Start cluster on the node again so that it rejoins the cluster. Locked resources should get started, not necessarily on the same node (due to eg. load balancing).

>   [root@virt-003 ~]# qarsh virt-004 pcs cluster start
>   Starting Cluster (corosync)...
>   Starting Cluster (pacemaker)...

Resources "second" and "sixth" are recovered on virt-004.

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:
>
>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        clvmd	(ocf::heartbeat:clvm):	Starting virt-004
>        Started: [ virt-003 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-004
>    third	(ocf::pacemaker:Dummy):	Started virt-003
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


CASE 3
-------
Property "shutdown-lock" is specified as true. Shutting down a node and running "pcs resource refresh <resource> --node <node>" (both resource and node must be specified) while the node is down and the resource is locked, will result in the lock being removed, and the resource starting elsewhere.

Verify whether "shutdown-lock" property is true.

>   [root@virt-003 ~]# pcs property list --all | grep shutdown-lock
>    shutdown-lock: true
>    shutdown-lock-limit: 0
>
>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-004
>    third	(ocf::pacemaker:Dummy):	Started virt-003
>    fourth	(ocf::pacemaker:Dummy):	Started virt-003
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Shutting down virt-004.

>   [root@virt-003 ~]# qarsh virt-004 pcs cluster stop
>   Stopping Cluster (pacemaker)...
>   Stopping Cluster (corosync)...

Node virt-004 is offline, resources "second" and "sixth" are marked as (LOCKED).

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-003
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)
>    third	(ocf::pacemaker:Dummy):	Started virt-003
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-005
>    sixth	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Refreshing resource "second", and checking the status. Node virt-004 is still offline, resource "second" restarted on virt-003.

>   [root@virt-003 ~]# pcs resource refresh second --node virt-004
>   Cleaned up second on virt-004

>     * 'second' is locked to node virt-004 due to shutdown
>   Waiting for 1 reply from the CRMd. OK

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-003
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Refreshing resource "sixth" and checking the status. Node virt-004 is still offline, resource "sixth" restarted on virt-003.

>   [root@virt-003 ~]# pcs resource refresh sixth --node virt-004
>   Cleaned up sixth on virt-004

>     * 'sixth' is locked to node virt-004 due to shutdown
>   Waiting for 1 reply from the CRMd. OK

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-003
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Started virt-003

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Start cluster on the node again so that it rejoins the cluster.

>   [root@virt-003 ~]# qarsh virt-004 pcs cluster start
>   Starting Cluster (corosync)...
>   Starting Cluster (pacemaker)...

>   [root@virt-003 ~]# pcs status
>   [...output omitted...]

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-004
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


CASE 4
-------
Property "shutdown-lock" is specified as true, "shutdown-lock-limit" is specified as an amount of time and the node shutdown is initiated. After the "shutdown-lock-limit" expires, locked resources will get unlocked and started elsewhere, while the node remains down. It can take up to one "cluster-recheck-interval" longer than the configured "shutdown-lock-limit" to start the resource recovery (worst-case scenario, RHEL-8.2's "dynamic cluster recheck interval" feature removes this limitation).

Verify whether "shutdown-lock" property is true and check the setting of "shutdown-lock-limit" property.

>   [root@virt-003 ~]# pcs property list --all | grep shutdown-lock
>    shutdown-lock: true
>    shutdown-lock-limit: 0

Set "shutdown-lock-limit" to 5 minutes and verify.

>   [root@virt-003 ~]# pcs property set shutdown-lock-limit=5min
>   [root@virt-003 ~]# pcs property list --all | grep shutdown-lock
>    shutdown-lock: true
>    shutdown-lock-limit: 5min

Verify "cluster-recheck-interval" setting

>   [root@virt-003 ~]# pcs property list --all | grep cluster-recheck-interval
>    cluster-recheck-interval: 15min

>   [root@virt-003 11:58:24 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 11:58:33 2020
>   Last change: Fri Jan 31 11:58:14 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-004
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Shutting down node virt-004.

>   [root@virt-003 11:58:33 ~]# qarsh virt-004 pcs cluster stop
>   Stopping Cluster (pacemaker)...
>   Stopping Cluster (corosync)...

Verifying that the node is down, and resources are marked as (LOCKED).

>   [root@virt-003 11:58:54 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 11:58:59 2020
>   Last change: Fri Jan 31 11:58:14 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-003
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Stopped virt-004 (LOCKED)

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

After at most "shutdown-lock-limit" + "cluster-recheck-interval", the resources are recovered on other nodes and the original node remains down.

>   [root@virt-003 12:03:59 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 12:04:02 2020
>   Last change: Fri Jan 31 11:58:14 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-005 ]
>   OFFLINE: [ virt-004 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-003
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-005 ]
>        Stopped: [ virt-004 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-005
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Started virt-003

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled

Start cluster on the node again so that it rejoins the cluster.

>   [root@virt-003 12:07:03 ~]# qarsh virt-004 pcs cluster start
>   Starting Cluster (corosync)...
>   Starting Cluster (pacemaker)...

>   [root@virt-003 12:07:13 ~]# pcs status
>   Cluster name: STSRHTS2974
>   Stack: corosync
>   Current DC: virt-003 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
>   Last updated: Fri Jan 31 12:07:25 2020
>   Last change: Fri Jan 31 11:58:14 2020 by root via cibadmin on virt-003

>   3 nodes configured
>   15 resources configured

>   Online: [ virt-003 virt-004 virt-005 ]

>   Full list of resources:

>    fence-virt-003	(stonith:fence_xvm):	Started virt-003
>    fence-virt-004	(stonith:fence_xvm):	Started virt-005
>    fence-virt-005	(stonith:fence_xvm):	Started virt-004
>    Clone Set: dlm-clone [dlm]
>        Started: [ virt-003 virt-004 virt-005 ]
>    Clone Set: clvmd-clone [clvmd]
>        Started: [ virt-003 virt-004 virt-005 ]
>    first	(ocf::pacemaker:Dummy):	Started virt-005
>    second	(ocf::pacemaker:Dummy):	Started virt-003
>    third	(ocf::pacemaker:Dummy):	Started virt-004
>    fourth	(ocf::pacemaker:Dummy):	Started virt-005
>    fifth	(ocf::pacemaker:Dummy):	Started virt-003
>    sixth	(ocf::pacemaker:Dummy):	Started virt-004

>   Daemon Status:
>     corosync: active/enabled
>     pacemaker: active/enabled
>     pcsd: active/enabled


marking verified in pacemaker-1.1.21-4.el7

Comment 12 errata-xmlrpc 2020-03-31 19:41:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1032

Note You need to log in before you can comment on or make changes to this bug.