Bug 1380372

Summary: when pcs cannot stop pacemaker on a node, it does not stop cman/corosync on the remaining nodes
Product: Red Hat Enterprise Linux 6 Reporter: Tomas Jelinek <tojeline>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 6.8CC: cfeist, cluster-maint, idevat, omular, rsteiger, swgreenl, tojeline
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.155-1.el6 Doc Type: Bug Fix
Doc Text:
Cause: User wants to stop a cluster with some of the nodes unreachable. Consequence: Pcs exits gracefully with an error, saying the nodes are unreachable. Corosync/cman is however left running on the reachable nodes. Fix: Make pcs proceed and stop corosync/cman on reachable nodes. Result: Cluster is fully stopped on all nodes where possible.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-21 11:04:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
proposed fix none

Description Tomas Jelinek 2016-09-29 12:18:39 UTC
Description of problem:
When stopping multiple nodes, pcs first stops pacemaker on the nodes, and only after pacemaker on all the nodes has stopped, it proceeds with stopping corosync / cman. If pcs is not able to stop pacemaker on at least one node, it exits with an error leaving corosync on the rest of the nodes running.


Version-Release number of selected component (if applicable):
pcs-0.9.148-7.el6_8.1.x86_64, pcs-0.9.152-8.el7.x86_64


How reproducible:
always, easily


Steps to Reproduce:
[root@rh68-node1:~]# service pcsd stop
Stopping pcsd:                                             [  OK  ]

[root@rh68-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh68-node1 rh68-node2 rh68-node3
 Offline:
Pacemaker Nodes:
 Online: rh68-node1 rh68-node2 rh68-node3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

[root@rh68-node1:~]# pcs cluster stop --all
rh68-node1: Unable to connect to rh68-node1 ([Errno 111] Connection refused)
rh68-node3: Stopping Cluster (pacemaker)...
rh68-node2: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
rh68-node1: Unable to connect to rh68-node1 ([Errno 111] Connection refused)

[root@rh68-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh68-node1 rh68-node2 rh68-node3
 Offline:
Pacemaker Nodes:
 Online: rh68-node1
 Standby: rh68-node2 rh68-node3
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:


Actual results:
corosync running on node2 and node3, where pacemaker has been stopped


Expected results:
corosync stopped on node2 and node3


Additional info:

Comment 1 Tomas Jelinek 2016-09-29 12:28:34 UTC
originally reported here: http://clusterlabs.org/pipermail/users/2016-September/004152.html

Comment 2 Tomas Jelinek 2016-09-29 12:33:08 UTC
Workaround: run "pcs cluster stop" again and specify only the nodes on which pacemaker has been stopped successfully.

Comment 4 Scott Greenlese 2016-09-30 19:33:34 UTC
Tomas, 

Thank you very much for converting my pacemaker issue to this bugzilla. I had no idea that was even possible. 

I was able to cleanly reproduce this problem again today, so if you need any more information about my config, pacemaker or corosync logs, I will be happy to provide them. 

Now that I have a new RedHat bugzilla account, am I at liberty to open bugs here myself in the future?  

Thank you for your support. 

- Scott Greenlese  -  Linux on System Z - System Test - IBM Corp.

Comment 5 Scott Greenlese 2016-09-30 20:35:38 UTC
Another question.  I don't see any way for me to get on the .cc for email notifications.  In fact, I see that my mail ID swgreenl.com is "excluded".  

Is it possible for me to get on the Mail To list? 

Thanks again... - Scott

Comment 6 Scott Greenlese 2016-09-30 20:39:10 UTC
Please strike that last comment.  I see that I am, in fact on the .cc mailing list.  (The exclusion was when it sent out my comment to the .cc list, it excluded me because I was the author. )

Sorry for the confusion.

Comment 7 Tomas Jelinek 2016-10-03 07:57:29 UTC
(In reply to Scott Greenlese from comment #4)
> Tomas, 
> 
> Thank you very much for converting my pacemaker issue to this bugzilla. I
> had no idea that was even possible. 
> 
> I was able to cleanly reproduce this problem again today, so if you need any
> more information about my config, pacemaker or corosync logs, I will be
> happy to provide them.

Thanks, I am good. I have all information I need to reproduce and fix the bug.

> 
> Now that I have a new RedHat bugzilla account, am I at liberty to open bugs
> here myself in the future?

I am not 100% sure what permissions new accounts have but I believe you can open bugs here.

> 
> Thank you for your support. 
> 
> - Scott Greenlese  -  Linux on System Z - System Test - IBM Corp.

Comment 8 Tomas Jelinek 2016-10-14 11:38:21 UTC
Created attachment 1210505 [details]
proposed fix

Test:

[root@rh68-node1:~]# pcs status pcsd
  rh68-node1: Online
  rh68-node2: Online
  rh68-node3: Offline
[root@rh68-node1:~]# pcs cluster stop --all
rh68-node3: Unable to connect to rh68-node3 ([Errno 113] No route to host)
rh68-node1: Stopping Cluster (pacemaker)...
rh68-node2: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
rh68-node3: Unable to connect to rh68-node3 ([Errno 113] No route to host)
rh68-node3: Not stopping cluster - node is unreachable
rh68-node1: Stopping Cluster (cman)...
rh68-node2: Stopping Cluster (cman)...
Error: unable to stop all nodes
[root@rh68-node1:~]# echo $?
1
[root@rh68-node1:~]# service cman status
corosync is stopped
[root@rh68-node1:~]# service corosync status
corosync is stopped

Comment 9 Ivan Devat 2016-11-07 15:35:01 UTC
Before Fix:

[vm-rhel67-1 ~] $ rpm -q pcs
pcs-0.9.154-1.el6.x86_64

[vm-rhel67-1 ~] $ service pcsd stop
Stopping pcsd:                                             [  OK  ]
[vm-rhel67-1 ~] $ pcs status pcsd
  vm-rhel67-1: Offline
  vm-rhel67-2: Online
  vm-rhel67-3: Online
[vm-rhel67-1 ~] $ pcs cluster stop --all
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
vm-rhel67-3: Stopping Cluster (pacemaker)...
vm-rhel67-2: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
[vm-rhel67-1 ~] $ echo $?
1
[vm-rhel67-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel67-1 vm-rhel67-2
 Offline: vm-rhel67-3
Pacemaker Nodes:
 Online: vm-rhel67-1
 Standby: vm-rhel67-2
 Maintenance:
 Offline: vm-rhel67-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:


After Fix:

[vm-rhel67-1 ~] $ rpm -q pcs
pcs-0.9.155-1.el6.x86_64

[vm-rhel67-1 ~] $ service pcsd stop
Stopping pcsd:                                             [  OK  ]
[vm-rhel67-1 ~] $ pcs status pcsd
  vm-rhel67-1: Offline
  vm-rhel67-3: Online
  vm-rhel67-2: Online
[vm-rhel67-1 ~] $ pcs cluster stop --all
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
vm-rhel67-2: Stopping Cluster (pacemaker)...
vm-rhel67-3: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
vm-rhel67-1: Not stopping cluster - node is unreachable
vm-rhel67-3: Stopping Cluster (cman)...
vm-rhel67-2: Stopping Cluster (cman)...
Error: unable to stop all nodes
[vm-rhel67-1 ~] $ echo $?
1
[vm-rhel67-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel67-1
 Offline: vm-rhel67-2 vm-rhel67-3
Pacemaker Nodes:
 Online: vm-rhel67-1
 Standby:
 Maintenance:
 Offline: vm-rhel67-2 vm-rhel67-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 13 errata-xmlrpc 2017-03-21 11:04:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0707.html