Bug 1225423

Summary:

pcs should allow to remove a dead node from a cluster

Product:

Red Hat Enterprise Linux 7

Reporter:

Tomas Jelinek <tojeline>

Component:

pcs

Assignee:

Tomas Jelinek <tojeline>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

7.2

CC:

amaumene, cfeist, cluster-maint, idevat, j_t_williams, michele, mlisik, rsteiger, sankarshan, tojeline, vcojot

Target Milestone:

Keywords:

ZStream

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

pcs-0.9.152-5.el7

Doc Type:

Bug Fix

Doc Text:

Cause: User wants to remove a powered off node from a cluster. Consequence: Pcs does not remove the node as it cannot connect to it and remove the cluster configuration files from it. Fix: Skip removing configuration files from the node if the user used --force flag. Result: It is possible to remove a powered off node from the cluster.

Story Points:

---

Clone Of:

Clones:

1382633 (view as bug list)

Environment:

Last Closed:

2016-11-03 20:54:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1305654, 1382633

Attachments:

Description	Flags
proposed fix	none
proposed fix web UI	none

Description Tomas Jelinek 2015-05-27 10:41:36 UTC

Description of problem:
It is not possible to remove a node from a cluster if pcsd is not running on the node or the node itself is not running.

Version-Release number of selected component (if applicable):
pcs-0.9.137-15.el7

How reproducible:
always

Steps to Reproduce:
1. create a cluster
2. shutdown a node
3. try to remove the node from the cluster using 'pcs cluster node remove <nodename>'

Actual results:
Error: pcsd is not running on <nodename>

Expected results:
Node is removed from the cluster. We probably want to warn user first and allow removal of the node only when --force switch is used.

Additional info:
workaround:
1. run 'pcs cluster localnode remove <nodename>' on all remaining nodes
2. run 'pcs cluster reload corosync' on one node
3. run 'crm_node -R <nodename> --force' on one node

Comment 3 Alexandre Maumené 2016-01-13 10:47:19 UTC

Hi,

I also hit the bug, thanks for the workaround.

Regards,

Comment 4 Tomas Jelinek 2016-07-19 15:12:18 UTC

Created attachment 1181676 [details]
proposed fix

Test:

> Let's have a three node cluster
[root@rh72-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh72-node1 rh72-node2 rh72-node3
 Offline:
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2 rh72-node3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

> Power off one node ...
[root@rh72-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh72-node1 rh72-node2
 Offline: rh72-node3
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2
 Standby:
 Maintenance:
 Offline: rh72-node3
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

> ... and remove it from the cluster
[root@rh72-node1:~]# pcs cluster node remove rh72-node3
Error: pcsd is not running on rh72-node3, use --force to override
[root@rh72-node1:~]# pcs cluster node remove rh72-node3 --force
rh72-node3: Unable to connect to rh72-node3 ([Errno 113] No route to host)
rh72-node3: Unable to connect to rh72-node3 ([Errno 113] No route to host)
Warning: unable to destroy cluster
rh72-node3: Unable to connect to rh72-node3 ([Errno 113] No route to host)
rh72-node2: Corosync updated
rh72-node1: Corosync updated
[root@rh72-node1:~]# pcs status nodes both
Corosync Nodes:
 Online: rh72-node1 rh72-node2
 Offline:
Pacemaker Nodes:
 Online: rh72-node1 rh72-node2
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 5 Tomas Jelinek 2016-07-20 08:03:02 UTC

Created attachment 1181958 [details]
proposed fix web UI

fix for web UI

Comment 6 Ivan Devat 2016-07-28 13:39:26 UTC

Setup:
[vm-rhel72-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel72-1 vm-rhel72-2 vm-rhel72-3
 Offline:
Pacemaker Nodes:
 Online: vm-rhel72-1 vm-rhel72-2 vm-rhel72-3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

> Power off one node ...
[vm-rhel72-1 ~] $ pcs status nodes both
Corosync Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Offline: vm-rhel72-2
Pacemaker Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Standby:
 Maintenance:
 Offline: vm-rhel72-2
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:


Before Fix:
[vm-rhel72-1 ~] $ rpm -q pcs           
pcs-0.9.152-4.el7.x86_64
[vm-rhel72-1 ~] $ pcs cluster node remove vm-rhel72-2
Error: pcsd is not running on vm-rhel72-2

After Fix:
[vm-rhel72-1 ~] $ rpm -q pcs
pcs-0.9.152-5.el7.x86_64
[vm-rhel72-1 ~] $ pcs cluster node remove vm-rhel72-2
Error: pcsd is not running on vm-rhel72-2, use --force to override
[vm-rhel72-1 ~] $ pcs cluster node remove vm-rhel72-2 --force
vm-rhel72-2: Unable to connect to vm-rhel72-2 ([Errno 111] Connection refused)
vm-rhel72-2: Unable to connect to vm-rhel72-2 ([Errno 111] Connection refused)
Warning: unable to destroy cluster
vm-rhel72-2: Unable to connect to vm-rhel72-2 ([Errno 111] Connection refused)
vm-rhel72-1: Corosync updated
vm-rhel72-3: Corosync updated
[vm-rhel72-1 ~] $ pcs status nodes both 
Corosync Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Offline:
Pacemaker Nodes:
 Online: vm-rhel72-1 vm-rhel72-3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

Comment 10 Tomas Jelinek 2016-10-18 14:46:09 UTC

*** Bug 1376209 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2016-11-03 20:54:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2596.html