Bug 1184763
Summary: | pcs cluster stop behavior is not optimal and can lead to fencing nodes | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Tomas Jelinek <tojeline> | ||||
Component: | pcs | Assignee: | Tomas Jelinek <tojeline> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.6 | CC: | cfeist, cluster-maint, rsteiger, tojeline | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | pcs-0.9.139-2.el6 | Doc Type: | Bug Fix | ||||
Doc Text: |
* Previously, pcs stopped cluster nodes sequentially one at a time, which caused the cluster resources to be moved from one node to another pointlessly. Consequently, the stop operation took a long time to finish. Also, losing the quorum during the process could result in node fencing. With this update, pcs stops the nodes simultaneously, preventing the resources from being moved around pointlessly and speeding up the stop operation. In addition, pcs prints a warning if stopping the nodes would cause the cluster to lose the quorum. To stop the nodes in this situation, the user is required to add the "--force" option. (BZ#1174801, BZ#1184763)
|
Story Points: | --- | ||||
Clone Of: | 1180506 | Environment: | |||||
Last Closed: | 2015-07-22 06:16:00 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1180506 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Comment 1
Tomas Jelinek
2015-01-22 09:13:56 UTC
Created attachment 982649 [details] proposed fix - loss of quorum warning related upstream patch: https://github.com/feist/pcs/commit/200559d8ca0b834f90d4f2ba70e8f7ce403b9726 Test: [root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum' Nodes: 3 Expected votes: 3 Quorum: 2 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Offline: Pacemaker Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Standby: Offline: [root@rh66-node1:~]# pcs cluster stop rh66-node3 rh66-node3: Stopping Cluster (pacemaker)... rh66-node3: Stopping Cluster (cman)... [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 Offline: rh66-node3 Pacemaker Nodes: Online: rh66-node1 rh66-node2 Standby: Offline: rh66-node3 [root@rh66-node1:~]# pcs cluster stop rh66-node2 Error: Stopping the node(s) will cause a loss of the quorum, use --force to override [root@rh66-node1:~]# echo $? 1 [root@rh66-node1:~]# pcs cluster stop rh66-node1 Error: Stopping the node(s) will cause a loss of the quorum, use --force to override [root@rh66-node1:~]# echo $? 1 [root@rh66-node1:~]# pcs cluster stop Error: Stopping the node will cause a loss of the quorum, use --force to override [root@rh66-node1:~]# echo $? 1 [root@rh66-node1:~]# pcs cluster stop --force Stopping Cluster (pacemaker)... Stopping Cluster (cman)... [root@rh66-node1:~]# pcs status Error: cluster is not currently running on this node Before Fix: [root@rh66-node1 ~]# rpm -q pcs pcs-0.9.123-9.el6.x86_64 [root@rh66-node1:~]# pcs resource create delay1 delay startdelay=10 stopdelay=10 [root@rh66-node1:~]# pcs resource create delay2 delay startdelay=10 stopdelay=10 [root@rh66-node1:~]# pcs resource create delay3 delay startdelay=10 stopdelay=10 [root@rh66-node1:~]# pcs status | grep delay delay1 (ocf::heartbeat:Delay): Started rh66-node1 delay2 (ocf::heartbeat:Delay): Started rh66-node2 delay3 (ocf::heartbeat:Delay): Started rh66-node3 [root@rh66-node1:~]# time pcs cluster stop --all rh66-node1: Stopping Cluster... rh66-node2: Stopping Cluster... rh66-node3: Stopping Cluster... real 1m27.598s user 0m0.092s sys 0m0.017s [root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum' Nodes: 3 Expected votes: 3 Quorum: 2 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Offline: Pacemaker Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Standby: Offline: [root@rh66-node1:~]# pcs cluster stop rh66-node3 rh66-node3: Stopping Cluster... [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 Offline: rh66-node3 Pacemaker Nodes: Online: rh66-node1 rh66-node2 Standby: Offline: rh66-node3 [root@rh66-node1:~]# pcs cluster stop rh66-node2 rh66-node2: Stopping Cluster... [root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum' Nodes: 1 Expected votes: 3 Quorum: 2 Activity blocked [root@rh66-node1:~]# pcs cluster start rh66-node2 rh66-node2: Starting Cluster... [root@rh66-node1:~]# pcs cluster stop Stopping Cluster... [root@rh66-node1:~]# pcs status Error: cluster is not currently running on this node After Fix: [root@rh66-node1:~]# rpm -q pcs pcs-0.9.138-1.el6.x86_64 [root@rh66-node1:~]# pcs resource create delay1 delay startdelay=10 stopdelay=10 [root@rh66-node1:~]# pcs resource create delay2 delay startdelay=10 stopdelay=10 [root@rh66-node1:~]# pcs resource create delay3 delay startdelay=10 stopdelay=10 [root@rh66-node1:~]# pcs status | grep delay delay1 (ocf::heartbeat:Delay): Started rh66-node1 delay2 (ocf::heartbeat:Delay): Started rh66-node2 delay3 (ocf::heartbeat:Delay): Started rh66-node3 [root@rh66-node1:~]# time pcs cluster stop --all rh66-node1: Stopping Cluster (pacemaker)... rh66-node3: Stopping Cluster (pacemaker)... rh66-node2: Stopping Cluster (pacemaker)... rh66-node3: Stopping Cluster (cman)... rh66-node2: Stopping Cluster (cman)... rh66-node1: Stopping Cluster (cman)... real 0m24.300s user 0m0.205s sys 0m0.045s [root@rh66-node1:~]# cman_tool status | grep 'Nodes\|Expected\|Quorum' Nodes: 3 Expected votes: 3 Quorum: 2 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Offline: Pacemaker Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Standby: Offline: [root@rh66-node1:~]# pcs cluster stop rh66-node3 rh66-node3: Stopping Cluster (pacemaker)... rh66-node3: Stopping Cluster (cman)... [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 Offline: rh66-node3 Pacemaker Nodes: Online: rh66-node1 rh66-node2 Standby: Offline: rh66-node3 [root@rh66-node1:~]# pcs cluster stop rh66-node2 Error: Stopping the node(s) will cause a loss of the quorum, use --force to override [root@rh66-node1:~]# echo $? 1 [root@rh66-node1:~]# pcs cluster stop Error: Stopping the node will cause a loss of the quorum, use --force to override [root@rh66-node1:~]# echo $? 1 [root@rh66-node1:~]# pcs cluster stop --force Stopping Cluster (pacemaker)... Stopping Cluster (cman)... [root@rh66-node1:~]# pcs status Error: cluster is not currently running on this node patch in upstream: https://github.com/feist/pcs/commit/513661834a0c096ccb6490bba38a31c7273af329 Before fix: [root@rh66-node1:~]# rpm -q pcs pcs-0.9.139-1.el6.x86_64 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Offline: Pacemaker Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Standby: Offline: [root@rh66-node1:~]# pcs cluster stop rh66-node2 rh66-node2: Stopping Cluster (pacemaker)... rh66-node2: Stopping Cluster (cman)... [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node3 Offline: rh66-node2 Pacemaker Nodes: Online: rh66-node1 rh66-node3 Standby: Offline: rh66-node2 [root@rh66-node1:~]# pcs cluster node remove rh66-node3 rh66-node3: Stopping Cluster (pacemaker)... rh66-node3: Successfully destroyed cluster rh66-node1: Corosync updated rh66-node2: Corosync updated [root@rh66-node1:~]# echo $? 0 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 Offline: rh66-node2 Pacemaker Nodes: Online: rh66-node1 Standby: Offline: rh66-node2 After fix: [root@rh66-node1:~]# rpm -q pcs pcs-0.9.139-2.el6.x86_64 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Offline: Pacemaker Nodes: Online: rh66-node1 rh66-node2 rh66-node3 Standby: Offline: [root@rh66-node1:~]# pcs cluster stop rh66-node2 rh66-node2: Stopping Cluster (pacemaker)... rh66-node2: Stopping Cluster (cman)... [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node3 Offline: rh66-node2 Pacemaker Nodes: Online: rh66-node1 rh66-node3 Standby: Offline: rh66-node2 [root@rh66-node1:~]# pcs cluster node remove rh66-node3 Error: Removing the node will cause a loss of the quorum, use --force to override [root@rh66-node1:~]# echo $? 1 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 rh66-node3 Offline: rh66-node2 Pacemaker Nodes: Online: rh66-node1 rh66-node3 Standby: Offline: rh66-node2 [root@rh66-node1:~]# pcs cluster node remove rh66-node3 --force rh66-node3: Stopping Cluster (pacemaker)... rh66-node3: Successfully destroyed cluster rh66-node1: Corosync updated rh66-node2: Corosync updated [root@rh66-node1:~]# echo $? 0 [root@rh66-node1:~]# pcs status nodes both Corosync Nodes: Online: rh66-node1 Offline: rh66-node2 Pacemaker Nodes: Online: rh66-node1 Standby: Offline: rh66-node2 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1446.html |