Red Hat Bugzilla – Bug 339471
Impossible to remove a dead node from cman
Last modified: 2009-04-16 19:00:05 EDT
Description of problem:
A dead node makes it impossible to make changes to the cluster config
Version-Release number of selected component (if applicable):
Have not attempted to reproduce. This is a live cluster.
Steps to Reproduce:
1. Removed node from cluster using conga
2. ccs_tool lsnode list remaining nodes
3. cman_tool nodes list remaining nodes and removed node
4. ccs_tool update ... fails because it can't contact the removed node
There should be a way to remove a dead node from cman. For example, you have a
hardware failure. It takes time to replace that node. During that time you
cannot make any updates to the cluster configuration.
A workaround was to reboot each remaining node of the cluster indivually. This
cleared up cman_tool nodes. When all nodes where rebooted changes to the
configuration where possible again.
Should I also create a bugzille for the fact that conga/luci did not do a
cman_tool leave remove ? Or is that a know bug ? (luci version was 0.9.2-6.el5)
It seems to me that a failed node should not prevent ccs_tool from
Could this be an issue with Conga? I've never seen this behavior before. Added
one of the Conga developers to the BZ CC list to perhaps help answer this question.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
IIRC there was a bug in ccsd prior to 5.1 (I think the fix went out as a
z-stream update) that caused ccsd to report failure when it you attempted to
propagate a new configuration and there was at least one node that was not a
member or estranged. It also happened if you were using qdisk (it'd try to send
the new conf to node 0). See bug #244867 for more info. I just checked the
sources for 2.0.64-1.0.1.el5 and the fix for that bug is not in there. Upgrading
to the latest cman package ought to fix this.
I'm unable to recreate this bug by using command-line tools. I guess I am unsure
what step #1 (aka removing a node from a running cluster) actually means. It
seems that removing the node means actually removing it from the cluster.conf
file. This seems to be the case since it is stated that 'ccs_tool lsnode' lists
only the remaining nodes, and 'ccs_tool lsnode' parses the cluster.conf file
directory. So the node must be removed from the cluster.conf file. Its unclear
that the node in question is still a member of the cluster or not, so I tested
Running 'ccs_tool update /etc/cluster/cluster.conf' worked for me every time.
Specifically, when the node was removed from the cluster.conf and still in the
cluster and also when the node was removed from the cluster.conf but also left
the running cluster.
Ah. Comment #4 seems to explain the problem. I was testing on a RHEL5.1 machine,
which has the fix that Ryan referenced. I think the problem reported here has
been fixed in 5.1.
Closing as dup of #244867, which is fixed in 5.1.
*** This bug has been marked as a duplicate of 244867 ***