Bug 1225423

Summary: pcs should allow to remove a dead node from a cluster
Product: Red Hat Enterprise Linux 7 Reporter: Tomas Jelinek <tojeline>
Component: pcsAssignee: Tomas Jelinek <tojeline>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: amaumene, cfeist, cluster-maint, idevat, j_t_williams, michele, mlisik, rsteiger, sankarshan, tojeline, vcojot
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: pcs-0.9.152-5.el7 Doc Type: Bug Fix
Doc Text:
Cause: User wants to remove a powered off node from a cluster. Consequence: Pcs does not remove the node as it cannot connect to it and remove the cluster configuration files from it. Fix: Skip removing configuration files from the node if the user used --force flag. Result: It is possible to remove a powered off node from the cluster.
Story Points: ---
Clone Of:
: 1382633 (view as bug list) Environment:
Last Closed: 2016-11-03 20:54:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1305654, 1382633    
Attachments:
Description Flags
proposed fix
none
proposed fix web UI none

Description Tomas Jelinek 2015-05-27 10:41:36 UTC
Description of problem:
It is not possible to remove a node from a cluster if pcsd is not running on the node or the node itself is not running.

Version-Release number of selected component (if applicable):
pcs-0.9.137-15.el7

How reproducible:
always

Steps to Reproduce:
1. create a cluster
2. shutdown a node
3. try to remove the node from the cluster using 'pcs cluster node remove <nodename>'

Actual results:
Error: pcsd is not running on <nodename>

Expected results:
Node is removed from the cluster. We probably want to warn user first and allow removal of the node only when --force switch is used.

Additional info:
workaround:
1. run 'pcs cluster localnode remove <nodename>' on all remaining nodes
2. run 'pcs cluster reload corosync' on one node
3. run 'crm_node -R <nodename> --force' on one node

Comment 3 Alexandre Maumené 2016-01-13 10:47:19 UTC
Hi,

I also hit the bug, thanks for the workaround.

Regards,

Comment 4 Tomas Jelinek 2016-07-19 15:12:18 UTC
Created attachment 1181676 [details]
proposed fix

Test:

> Let's have a three node cluster
[root@rh72-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh72-node1 rh72-node2 rh72-node3
 Offline:
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2 rh72-node3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

> Power off one node ...
[root@rh72-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh72-node1 rh72-node2
 Offline: rh72-node3
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2
 Standby:
 Maintenance:
 Offline: rh72-node3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

> ... and remove it from the cluster
[root@rh72-node1:~]# pcs cluster node remove rh72-node3
Error: pcsd is not running on rh72-node3, use --force to override
[root@rh72-node1:~]# pcs cluster node remove rh72-node3 --force
rh72-node3: Unable to connect to rh72-node3 ([Errno 113] No route to host)
rh72-node3: Unable to connect to rh72-node3 ([Errno 113] No route to host)
Warning: unable to destroy cluster
rh72-node3: Unable to connect to rh72-node3 ([Errno 113] No route to host)
rh72-node2: Corosync updated
rh72-node1: Corosync updated
[root@rh72-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh72-node1 rh72-node2
 Offline:
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 5 Tomas Jelinek 2016-07-20 08:03:02 UTC
Created attachment 1181958 [details]
proposed fix web UI

fix for web UI

Comment 6 Ivan Devat 2016-07-28 13:39:26 UTC
Setup:
[vm-rhel72-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel72-1 vm-rhel72-2 vm-rhel72-3
 Offline:
Pacemaker Nodes:
 Online: vm-rhel72-1 vm-rhel72-2 vm-rhel72-3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

> Power off one node ...
[vm-rhel72-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Offline: vm-rhel72-2
Pacemaker Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Standby:
 Maintenance:
 Offline: vm-rhel72-2
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:


Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs           
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs cluster node remove vm-rhel72-2
Error: pcsd is not running on vm-rhel72-2

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64
[vm-rhel72-1 ~] $ pcs cluster node remove vm-rhel72-2
Error: pcsd is not running on vm-rhel72-2, use --force to override
[vm-rhel72-1 ~] $ pcs cluster node remove vm-rhel72-2 --force
vm-rhel72-2: Unable to connect to vm-rhel72-2 ([Errno 111] Connection refused)
vm-rhel72-2: Unable to connect to vm-rhel72-2 ([Errno 111] Connection refused)
Warning: unable to destroy cluster
vm-rhel72-2: Unable to connect to vm-rhel72-2 ([Errno 111] Connection refused)
vm-rhel72-1: Corosync updated
vm-rhel72-3: Corosync updated
[vm-rhel72-1 ~] $ pcs status nodes both 
Corosync Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Offline:
Pacemaker Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 10 Tomas Jelinek 2016-10-18 14:46:09 UTC
*** Bug 1376209 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2016-11-03 20:54:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html