1380372 – when pcs cannot stop pacemaker on a node, it does not stop cman/corosync on the remaining nodes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1380372 - when pcs cannot stop pacemaker on a node, it does not stop cman/corosync on the remaining nodes

Summary: when pcs cannot stop pacemaker on a node, it does not stop cman/corosync on t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	pcs
Sub Component:
Version:	6.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Tomas Jelinek
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-29 12:18 UTC by Tomas Jelinek
Modified:	2017-03-21 11:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:	pcs-0.9.155-1.el6
Doc Type:	Bug Fix
Doc Text:	Cause: User wants to stop a cluster with some of the nodes unreachable. Consequence: Pcs exits gracefully with an error, saying the nodes are unreachable. Corosync/cman is however left running on the reachable nodes. Fix: Make pcs proceed and stop corosync/cman on reachable nodes. Result: Cluster is fully stopped on all nodes where possible.
Clone Of:
Environment:
Last Closed:	2017-03-21 11:04:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
proposed fix (4.90 KB, patch) 2016-10-14 11:38 UTC, Tomas Jelinek	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0707	0	normal	SHIPPED_LIVE	pcs bug fix update	2017-03-21 12:40:33 UTC

Description Tomas Jelinek 2016-09-29 12:18:39 UTC

Description of problem:
When stopping multiple nodes, pcs first stops pacemaker on the nodes, and only after pacemaker on all the nodes has stopped, it proceeds with stopping corosync / cman. If pcs is not able to stop pacemaker on at least one node, it exits with an error leaving corosync on the rest of the nodes running.


Version-Release number of selected component (if applicable):
pcs-0.9.148-7.el6_8.1.x86_64, pcs-0.9.152-8.el7.x86_64


How reproducible:
always, easily


Steps to Reproduce:
[root@rh68-node1:~]# service pcsd stop
Stopping pcsd:                                             [  OK  ]

[root@rh68-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh68-node1 rh68-node2 rh68-node3
 Offline:
Pacemaker Nodes:
 Online: rh68-node1 rh68-node2 rh68-node3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

[root@rh68-node1:~]# pcs cluster stop --all
rh68-node1: Unable to connect to rh68-node1 ([Errno 111] Connection refused)
rh68-node3: Stopping Cluster (pacemaker)...
rh68-node2: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
rh68-node1: Unable to connect to rh68-node1 ([Errno 111] Connection refused)

[root@rh68-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh68-node1 rh68-node2 rh68-node3
 Offline:
Pacemaker Nodes:
 Online: rh68-node1
 Standby: rh68-node2 rh68-node3
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:


Actual results:
corosync running on node2 and node3, where pacemaker has been stopped


Expected results:
corosync stopped on node2 and node3


Additional info:

Comment 1 Tomas Jelinek 2016-09-29 12:28:34 UTC

originally reported here: http://clusterlabs.org/pipermail/users/2016-September/004152.html

Comment 2 Tomas Jelinek 2016-09-29 12:33:08 UTC

Workaround: run "pcs cluster stop" again and specify only the nodes on which pacemaker has been stopped successfully.

Comment 4 Scott Greenlese 2016-09-30 19:33:34 UTC

Tomas, 

Thank you very much for converting my pacemaker issue to this bugzilla. I had no idea that was even possible. 

I was able to cleanly reproduce this problem again today, so if you need any more information about my config, pacemaker or corosync logs, I will be happy to provide them. 

Now that I have a new RedHat bugzilla account, am I at liberty to open bugs here myself in the future?  

Thank you for your support. 

- Scott Greenlese  -  Linux on System Z - System Test - IBM Corp.

Comment 5 Scott Greenlese 2016-09-30 20:35:38 UTC

Another question.  I don't see any way for me to get on the .cc for email notifications.  In fact, I see that my mail ID swgreenl.com is "excluded".  

Is it possible for me to get on the Mail To list? 

Thanks again... - Scott

Comment 6 Scott Greenlese 2016-09-30 20:39:10 UTC

Please strike that last comment.  I see that I am, in fact on the .cc mailing list.  (The exclusion was when it sent out my comment to the .cc list, it excluded me because I was the author. )

Sorry for the confusion.

Comment 7 Tomas Jelinek 2016-10-03 07:57:29 UTC

(In reply to Scott Greenlese from comment #4)
> Tomas, 
> 
> Thank you very much for converting my pacemaker issue to this bugzilla. I
> had no idea that was even possible. 
> 
> I was able to cleanly reproduce this problem again today, so if you need any
> more information about my config, pacemaker or corosync logs, I will be
> happy to provide them.

Thanks, I am good. I have all information I need to reproduce and fix the bug.

> 
> Now that I have a new RedHat bugzilla account, am I at liberty to open bugs
> here myself in the future?

I am not 100% sure what permissions new accounts have but I believe you can open bugs here.

> 
> Thank you for your support. 
> 
> - Scott Greenlese  -  Linux on System Z - System Test - IBM Corp.

Comment 8 Tomas Jelinek 2016-10-14 11:38:21 UTC

Created attachment 1210505 [details]
proposed fix

Test:

[root@rh68-node1:~]# pcs status pcsd
  rh68-node1: Online
  rh68-node2: Online
  rh68-node3: Offline
[root@rh68-node1:~]# pcs cluster stop --all
rh68-node3: Unable to connect to rh68-node3 ([Errno 113] No route to host)
rh68-node1: Stopping Cluster (pacemaker)...
rh68-node2: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
rh68-node3: Unable to connect to rh68-node3 ([Errno 113] No route to host)
rh68-node3: Not stopping cluster - node is unreachable
rh68-node1: Stopping Cluster (cman)...
rh68-node2: Stopping Cluster (cman)...
Error: unable to stop all nodes
[root@rh68-node1:~]# echo $?
1
[root@rh68-node1:~]# service cman status
corosync is stopped
[root@rh68-node1:~]# service corosync status
corosync is stopped

Comment 9 Ivan Devat 2016-11-07 15:35:01 UTC

Before Fix:

[vm-rhel67-1 ~] $ rpm -q pcs
pcs-0.9.154-1.el6.x86_64

[vm-rhel67-1 ~] $ service pcsd stop
Stopping pcsd:                                             [  OK  ]
[vm-rhel67-1 ~] $ pcs status pcsd
  vm-rhel67-1: Offline
  vm-rhel67-2: Online
  vm-rhel67-3: Online
[vm-rhel67-1 ~] $ pcs cluster stop --all
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
vm-rhel67-3: Stopping Cluster (pacemaker)...
vm-rhel67-2: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
[vm-rhel67-1 ~] $ echo $?
1
[vm-rhel67-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel67-1 vm-rhel67-2
 Offline: vm-rhel67-3
Pacemaker Nodes:
 Online: vm-rhel67-1
 Standby: vm-rhel67-2
 Maintenance:
 Offline: vm-rhel67-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:


After Fix:

[vm-rhel67-1 ~] $ rpm -q pcs
pcs-0.9.155-1.el6.x86_64

[vm-rhel67-1 ~] $ service pcsd stop
Stopping pcsd:                                             [  OK  ]
[vm-rhel67-1 ~] $ pcs status pcsd
  vm-rhel67-1: Offline
  vm-rhel67-3: Online
  vm-rhel67-2: Online
[vm-rhel67-1 ~] $ pcs cluster stop --all
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
vm-rhel67-2: Stopping Cluster (pacemaker)...
vm-rhel67-3: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
vm-rhel67-1: Unable to connect to vm-rhel67-1 ([Errno 111] Connection refused)
vm-rhel67-1: Not stopping cluster - node is unreachable
vm-rhel67-3: Stopping Cluster (cman)...
vm-rhel67-2: Stopping Cluster (cman)...
Error: unable to stop all nodes
[vm-rhel67-1 ~] $ echo $?
1
[vm-rhel67-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel67-1
 Offline: vm-rhel67-2 vm-rhel67-3
Pacemaker Nodes:
 Online: vm-rhel67-1
 Standby:
 Maintenance:
 Offline: vm-rhel67-2 vm-rhel67-3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 13 errata-xmlrpc 2017-03-21 11:04:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0707.html

Note You need to log in before you can comment on or make changes to this bug.