Bug 1184763

Summary: pcs cluster stop behavior is not optimal and can lead to fencing nodes
Product: Red Hat Enterprise Linux 6 Reporter: Tomas Jelinek <tojeline>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.6CC: cfeist, cluster-maint, rsteiger, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.139-2.el6 Doc Type: Bug Fix
Doc Text:
* Previously, pcs stopped cluster nodes sequentially one at a time, which caused the cluster resources to be moved from one node to another pointlessly. Consequently, the stop operation took a long time to finish. Also, losing the quorum during the process could result in node fencing. With this update, pcs stops the nodes simultaneously, preventing the resources from being moved around pointlessly and speeding up the stop operation. In addition, pcs prints a warning if stopping the nodes would cause the cluster to lose the quorum. To stop the nodes in this situation, the user is required to add the "--force" option. (BZ#1174801, BZ#1184763)
Story Points: ---
Clone Of: 1180506 Environment:
Last Closed: 2015-07-22 06:16:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1180506    
Bug Blocks:    
Attachments:
Description Flags
proposed fix - loss of quorum warning none

Comment 1 Tomas Jelinek 2015-01-22 09:13:56 UTC
Stopping nodes one by one which results in moving services around the cluster has been fixed by bz1174801.

Loss of the quorum during cluster shutdown has been fixed by upstream patches:
https://github.com/feist/pcs/commit/1ab2dd1b13839df7e5e9809cde25ac1dbae42c3d
https://github.com/feist/pcs/commit/5885c90faca010e3bceb3028638629fb69dca36e

Comment 2 Tomas Jelinek 2015-01-22 09:23:05 UTC
Created attachment 982649 [details]
proposed fix - loss of quorum warning

related upstream patch: https://github.com/feist/pcs/commit/200559d8ca0b834f90d4f2ba70e8f7ce403b9726

Comment 3 Tomas Jelinek 2015-01-22 10:09:30 UTC
Test:

[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 3
Expected votes: 3
Quorum: 2
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Offline: 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Standby: 
 Offline: 
[root@rh66-node1:~]# pcs cluster stop rh66-node3
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 
 Offline: rh66-node3 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 
 Standby: 
 Offline: rh66-node3 
[root@rh66-node1:~]# pcs cluster stop rh66-node2
Error: Stopping the node(s) will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop rh66-node1
Error: Stopping the node(s) will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop
Error: Stopping the node will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop --force
Stopping Cluster (pacemaker)... Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status
Error: cluster is not currently running on this node

Comment 5 Tomas Jelinek 2015-01-27 14:18:06 UTC
Before Fix:
[root@rh66-node1 ~]# rpm -q pcs
pcs-0.9.123-9.el6.x86_64
[root@rh66-node1:~]# pcs resource create delay1 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay2 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay3 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs status | grep delay
 delay1 (ocf::heartbeat:Delay): Started rh66-node1 
 delay2 (ocf::heartbeat:Delay): Started rh66-node2 
 delay3 (ocf::heartbeat:Delay): Started rh66-node3
[root@rh66-node1:~]# time pcs cluster stop --all
rh66-node1: Stopping Cluster...
rh66-node2: Stopping Cluster...
rh66-node3: Stopping Cluster...

real    1m27.598s
user    0m0.092s
sys     0m0.017s


[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 3
Expected votes: 3
Quorum: 2  
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Offline: 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Standby: 
 Offline: 
[root@rh66-node1:~]# pcs cluster stop rh66-node3
rh66-node3: Stopping Cluster...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 
 Offline: rh66-node3 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 
 Standby: 
 Offline: rh66-node3 
[root@rh66-node1:~]# pcs cluster stop rh66-node2
rh66-node2: Stopping Cluster...
[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 1
Expected votes: 3
Quorum: 2 Activity blocked
[root@rh66-node1:~]# pcs cluster start rh66-node2
rh66-node2: Starting Cluster...
[root@rh66-node1:~]# pcs cluster stop 
Stopping Cluster...
[root@rh66-node1:~]# pcs status
Error: cluster is not currently running on this node



After Fix:
[root@rh66-node1:~]# rpm -q pcs
pcs-0.9.138-1.el6.x86_64
[root@rh66-node1:~]# pcs resource create delay1 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay2 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs resource create delay3 delay startdelay=10 stopdelay=10
[root@rh66-node1:~]# pcs status | grep delay
 delay1 (ocf::heartbeat:Delay): Started rh66-node1 
 delay2 (ocf::heartbeat:Delay): Started rh66-node2 
 delay3 (ocf::heartbeat:Delay): Started rh66-node3
[root@rh66-node1:~]# time pcs cluster stop --all
rh66-node1: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node2: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (cman)...
rh66-node2: Stopping Cluster (cman)...
rh66-node1: Stopping Cluster (cman)...

real    0m24.300s
user    0m0.205s
sys     0m0.045s


[root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum'
Nodes: 3
Expected votes: 3
Quorum: 2  
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Offline: 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3 
 Standby: 
 Offline: 
[root@rh66-node1:~]# pcs cluster stop rh66-node3
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 
 Offline: rh66-node3 
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 
 Standby: 
 Offline: rh66-node3 
[root@rh66-node1:~]# pcs cluster stop rh66-node2
Error: Stopping the node(s) will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop
Error: Stopping the node will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs cluster stop --force
Stopping Cluster (pacemaker)... Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status
Error: cluster is not currently running on this node

Comment 13 Tomas Jelinek 2015-03-02 15:08:23 UTC
patch in upstream: https://github.com/feist/pcs/commit/513661834a0c096ccb6490bba38a31c7273af329

Comment 14 Tomas Jelinek 2015-03-02 16:50:50 UTC
Before fix:

[root@rh66-node1:~]# rpm -q pcs
pcs-0.9.139-1.el6.x86_64
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Offline:
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Standby:
 Offline:
[root@rh66-node1:~]# pcs cluster stop rh66-node2
rh66-node2: Stopping Cluster (pacemaker)...
rh66-node2: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node3
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1 rh66-node3
 Standby:
 Offline: rh66-node2
[root@rh66-node1:~]# pcs cluster node remove rh66-node3
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Successfully destroyed cluster
rh66-node1: Corosync updated
rh66-node2: Corosync updated
[root@rh66-node1:~]# echo $?
0
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1
 Standby:
 Offline: rh66-node2


After fix:

[root@rh66-node1:~]# rpm -q pcs
pcs-0.9.139-2.el6.x86_64
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Offline:
Pacemaker Nodes:
 Online: rh66-node1 rh66-node2 rh66-node3
 Standby:
 Offline:
[root@rh66-node1:~]# pcs cluster stop rh66-node2
rh66-node2: Stopping Cluster (pacemaker)...
rh66-node2: Stopping Cluster (cman)...
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node3
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1 rh66-node3
 Standby:
 Offline: rh66-node2
[root@rh66-node1:~]# pcs cluster node remove rh66-node3
Error: Removing the node will cause a loss of the quorum, use --force to override
[root@rh66-node1:~]# echo $?
1
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1 rh66-node3
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1 rh66-node3
 Standby:
 Offline: rh66-node2
[root@rh66-node1:~]# pcs cluster node remove rh66-node3 --force
rh66-node3: Stopping Cluster (pacemaker)...
rh66-node3: Successfully destroyed cluster
rh66-node1: Corosync updated
rh66-node2: Corosync updated
[root@rh66-node1:~]# echo $?
0
[root@rh66-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh66-node1
 Offline: rh66-node2
Pacemaker Nodes:
 Online: rh66-node1
 Standby:
 Offline: rh66-node2

Comment 19 errata-xmlrpc 2015-07-22 06:16:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1446.html