1184763 – pcs cluster stop behavior is not optimal and can lead to fencing nodes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1184763 - pcs cluster stop behavior is not optimal and can lead to fencing nodes

Summary: pcs cluster stop behavior is not optimal and can lead to fencing nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	6.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:	1180506
Blocks:
TreeView+	depends on / blocked

Reported:	2015-01-22 09:08 UTC by Tomas Jelinek
Modified:	2015-07-22 06:16 UTC (History)
CC List:	4 users (show)
Fixed In Version:	pcs-0.9.139-2.el6
Doc Type:	Bug Fix
Doc Text:	* Previously, pcs stopped cluster nodes sequentially one at a time, which caused the cluster resources to be moved from one node to another pointlessly. Consequently, the stop operation took a long time to finish. Also, losing the quorum during the process could result in node fencing. With this update, pcs stops the nodes simultaneously, preventing the resources from being moved around pointlessly and speeding up the stop operation. In addition, pcs prints a warning if stopping the nodes would cause the cluster to lose the quorum. To stop the nodes in this situation, the user is required to add the "--force" option. (BZ#1174801, BZ#1184763)
Clone Of:	1180506
Environment:
Last Closed:	2015-07-22 06:16:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix - loss of quorum warning (12.30 KB, patch) 2015-01-22 09:23 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:1446	0	normal	SHIPPED_LIVE	pcs bug fix and enhancement update	2015-07-20 18:43:57 UTC

Comment 1 Tomas Jelinek 2015-01-22 09:13:56 UTC

Stopping nodes one by one which results in moving services around the cluster has been fixed by bz1174801.

Loss of the quorum during cluster shutdown has been fixed by upstream patches:
https://github.com/feist/pcs/commit/1ab2dd1b13839df7e5e9809cde25ac1dbae42c3d
https://github.com/feist/pcs/commit/5885c90faca010e3bceb3028638629fb69dca36e

Comment 2 Tomas Jelinek 2015-01-22 09:23:05 UTC

Created attachment 982649 [details]
proposed fix - loss of quorum warning

related upstream patch: https://github.com/feist/pcs/commit/200559d8ca0b834f90d4f2ba70e8f7ce403b9726

Comment 3 Tomas Jelinek 2015-01-22 10:09:30 UTC

Test:

[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 3
Expected votes: 3
Quorum: 2
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Offline: 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Standby: 
 Offline: 
[root@rh66-node1:~]# pcs cluster stop rh66-node3
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 
 Offline: rh66-node3 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 
 Standby: 
 Offline: rh66-node3 
[root@rh66-node1:~]# pcs cluster stop rh66-node2
Error: Stopping the node(s) will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop rh66-node1
Error: Stopping the node(s) will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop
Error: Stopping the node will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop --force
Stopping Cluster (pacemaker)... Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status
Error: cluster is not currently running on this node

Comment 5 Tomas Jelinek 2015-01-27 14:18:06 UTC

Before Fix:
[root@rh66-node1 ~]# rpm -q pcs
pcs-0.9.123-9.el6.x86_64
[root@rh66-node1:~]# pcs resource create delay1 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay2 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay3 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs status | grep delay
 delay1 (ocf::heartbeat:Delay): Started rh66-node1 
 delay2 (ocf::heartbeat:Delay): Started rh66-node2 
 delay3 (ocf::heartbeat:Delay): Started rh66-node3
[root@rh66-node1:~]# time pcs cluster stop --all
rh66-node1: Stopping Cluster...
rh66-node2: Stopping Cluster...
rh66-node3: Stopping Cluster...

real    1m27.598s
user    0m0.092s
sys     0m0.017s


[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 3
Expected votes: 3
Quorum: 2  
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Offline: 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Standby: 
 Offline: 
[root@rh66-node1:~]# pcs cluster stop rh66-node3
rh66-node3: Stopping Cluster...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 
 Offline: rh66-node3 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 
 Standby: 
 Offline: rh66-node3 
[root@rh66-node1:~]# pcs cluster stop rh66-node2
rh66-node2: Stopping Cluster...
[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 1
Expected votes: 3
Quorum: 2 Activity blocked
[root@rh66-node1:~]# pcs cluster start rh66-node2
rh66-node2: Starting Cluster...
[root@rh66-node1:~]# pcs cluster stop 
Stopping Cluster...
[root@rh66-node1:~]# pcs status
Error: cluster is not currently running on this node



After Fix:
[root@rh66-node1:~]# rpm -q pcs
pcs-0.9.138-1.el6.x86_64
[root@rh66-node1:~]# pcs resource create delay1 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay2 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay3 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs status | grep delay
 delay1 (ocf::heartbeat:Delay): Started rh66-node1 
 delay2 (ocf::heartbeat:Delay): Started rh66-node2 
 delay3 (ocf::heartbeat:Delay): Started rh66-node3
[root@rh66-node1:~]# time pcs cluster stop --all
rh66-node1: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node2: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (cman)...
rh66-node2: Stopping Cluster (cman)...
rh66-node1: Stopping Cluster (cman)...

real    0m24.300s
user    0m0.205s
sys     0m0.045s


[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 3
Expected votes: 3
Quorum: 2  
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Offline: 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Standby: 
 Offline: 
[root@rh66-node1:~]# pcs cluster stop rh66-node3
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 
 Offline: rh66-node3 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 
 Standby: 
 Offline: rh66-node3 
[root@rh66-node1:~]# pcs cluster stop rh66-node2
Error: Stopping the node(s) will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop
Error: Stopping the node will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop --force
Stopping Cluster (pacemaker)... Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status
Error: cluster is not currently running on this node

Comment 13 Tomas Jelinek 2015-03-02 15:08:23 UTC

patch in upstream: https://github.com/feist/pcs/commit/513661834a0c096ccb6490bba38a31c7273af329

Comment 14 Tomas Jelinek 2015-03-02 16:50:50 UTC

Before fix:

[root@rh66-node1:~]# rpm -q pcs
pcs-0.9.139-1.el6.x86_64
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Offline:
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Standby:
 Offline:
[root@rh66-node1:~]# pcs cluster stop rh66-node2
rh66-node2: Stopping Cluster (pacemaker)...
rh66-node2: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node3
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1 rh66-node3
 Standby:
 Offline: rh66-node2
[root@rh66-node1:~]# pcs cluster node remove rh66-node3
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Successfully destroyed cluster
rh66-node1: Corosync updated
rh66-node2: Corosync updated
[root@rh66-node1:~]# echo $?
0
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1
 Standby:
 Offline: rh66-node2


After fix:

[root@rh66-node1:~]# rpm -q pcs
pcs-0.9.139-2.el6.x86_64
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Offline:
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Standby:
 Offline:
[root@rh66-node1:~]# pcs cluster stop rh66-node2
rh66-node2: Stopping Cluster (pacemaker)...
rh66-node2: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node3
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1 rh66-node3
 Standby:
 Offline: rh66-node2
[root@rh66-node1:~]# pcs cluster node remove rh66-node3
Error: Removing the node will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node3
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1 rh66-node3
 Standby:
 Offline: rh66-node2
[root@rh66-node1:~]# pcs cluster node remove rh66-node3 --force
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Successfully destroyed cluster
rh66-node1: Corosync updated
rh66-node2: Corosync updated
[root@rh66-node1:~]# echo $?
0
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1
 Standby:
 Offline: rh66-node2

Comment 19 errata-xmlrpc 2015-07-22 06:16:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1446.html

Note You need to log in before you can comment on or make changes to this bug.