Description of problem: Sometimes after one of nodes is fenced/manually rebooted it can't rejoin cluster. All I got is cman: cman_tool: Node is already active failed when 'service cman start' Version-Release number of selected component (if applicable): RH 4U4 cmanic-7.6.0-5.rhel4 magma-devel-1.0.6-0 cman-devel-1.0.11-0 magma-1.0.6-0 ccs-1.0.7-0 cman-1.0.11-0 magma-plugins-1.0.9-0 cman-kernheaders-2.6.9-45.8 cman-kernel-smp-2.6.9-45.8 rgmanager-1.9.54-3.228823test ccs-devel-1.0.7-0 cman-kernel-2.6.9-45.8 How reproducible: hmm hard to say, sometimes reboot of node is enough Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: cluster.conf <?xml version="1.0"?> <cluster config_version="62" name="PROcluster"> <fence_daemon post_fail_delay="0" post_join_delay="25"/> <clusternodes> <clusternode name="node1" votes="1"> <fence> <method name="1"> <device name="node1-ilo"/> </method> </fence> </clusternode> <clusternode name="node2" votes="1"> <fence> <method name="1"> <device name="node2-ilo"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_ilo" hostname="node1-ilo" login="fence" name="node1-ilo" passwd="PASS"/> <fencedevice agent="fence_ilo" hostname="node2-ilo" login="fence" name="node2-ilo" passwd="PASS"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="cluster-fail" ordered="0" restricted="1"> <failoverdomainnode name="node2" priority="1"/> <failoverdomainnode name="node1" priority="1"/> </failoverdomain> <failoverdomain name="cluster1-fail" restricted="1"> <failoverdomainnode name="node1" priority="1"/> </failoverdomain> <failoverdomain name="cluster2-fail" restricted="1"> <failoverdomainnode name="node2" priority="1"/> </failoverdomain> </failoverdomains> <resources> [...] </resources> <service [...] </service> </rm> </cluster>
some logs: Jun 4 13:56:52 node2 ccsd[31968]: Starting ccsd 1.0.7: Jun 4 13:56:53 node2 ccsd[31968]: Built: Nov 30 2006 17:17:18 Jun 4 13:56:53 node2 ccsd[31968]: Copyright (C) Red Hat, Inc. 2004 All rights reserved. Jun 4 13:56:53 node2 ccsd: succeeded Jun 4 13:56:53 node2 kernel: CMAN 2.6.9-45.8 (built Jan 17 2007 16:47:20) installed Jun 4 13:56:53 node2 kernel: DLM 2.6.9-44.3 (built Jan 17 2007 16:48:30) installed Jun 4 13:56:53 node2 ccsd[31968]: cluster.conf (cluster name = PROcluster,version = 62) found. Jun 4 13:56:53 node2 ccsd[31968]: Remote copy of cluster.conf is from quorate node. Jun 4 13:56:53 node2 ccsd[31968]: Local version # : 62 Jun 4 13:56:53 node2 ccsd[31968]: Remote version #: 62 Jun 4 13:56:53 node2 ccsd[31968]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.1 Jun 4 13:56:53 node2 ccsd[31968]: Initial status:: Inquorate Jun 4 13:58:53 node2 cman: Timed-out waiting for cluster failed Jun 4 14:00:53 node2 fenced: startup failed Jun 4 14:00:53 node2 rgmanager: clurgmgrd startup succeeded Jun 4 14:00:53 node2 clurgmgrd[3026]: <notice> Waiting for quorum to form
Main question is: how to force second (first?) node to join cluster? Only solution that works for sure is to reboot both simultaneously so we have clear situation - but this solution is unacceptable if we assume that HA cluster should work without any interrupts.
It's hard for me to make sense of this report so I'll make a few suggestions instead. 1) Check that iLO is doing a HARD reboot of the machine when it fences it. If the system has been properly rebooted then there is no reason why cman would still be running which is the error you have. 2) Check the state of the system before and after the event. Is cman running or loaded somewhere else? does the remaining node spot that the node goes down correctly. 3) Check your init scripts. "Node already active" says that cman is already running, which should not happen on a properly configured system if the startup script is run at boot time. Just check EVERYTHING, this is very likely a configuration error somewhere but without more (and matching) information and logs it's impossible to judge from here.
(In reply to comment #3) > It's hard for me to make sense of this report so I'll make a few suggestions > instead. sorry, If You need some specified info, pls ask, I'll try to be more clear > 1) Check that iLO is doing a HARD reboot of the machine when it fences it. If > the system has been properly rebooted then there is no reason why cman would > still be running which is the error you have. it was power cycled, cman is starting but fails to join cluster - status is joining. When I try to manually cman_tool join i got info "Node is already active failed when 'service cmanstart'" - of course it was stupid as i saw that it was working and tried to join cluster. > 2) Check the state of the system before and after the event. Is cman running or > loaded somewhere else? does the remaining node spot that the node goes down > correctly. yes, first node saw that node left cluster (clustat shows node2 - offline) > 3) Check your init scripts. "Node already active" says that cman is already > running, which should not happen on a properly configured system if the startup > script is run at boot time. as I wrote before, I've made stupid thing... started cman and it was trying to join cluster, when I've manually set cman_tool join > Just check EVERYTHING, this is very likely a configuration error somewhere but > without more (and matching) information and logs it's impossible to judge from here. So I'll try be more clear: Situation - node2 had problems with load and stucked threads. We have rebooted it manually using iLO (power cycled). After it started it can't rejoin to cluster - Jun 4 13:58:53 node2 cman: Timed-out waiting for cluster failed Node1 i working properly: Node Votes Exp Sts Name 1 1 1 M node1 2 1 1 X node2 Protocol version: 5.0.1 Config version: 62 Cluster name: PROcluster Cluster ID: 55067 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 1 Total_votes: 1 Quorum: 1 Active subsystems: 13 Node name: node1 Node ID: 1 Node addresses: x.y.z.129 on Node2: Node Votes Exp Sts Name Protocol version: 5.0.1 Config version: 62 Cluster name: PROcluster Cluster ID: 55067 Cluster Member: No Membership state: Joining and since yesterday nothing changed... I had to reboot node2 few times due to other maintenance tasks - but still can't join cluster
We had such problems few times before and only solution that I've found out was to reboot both nodes. Now I wouldn't like to do it - so I would like to receive info, how to make node2 joining successful.
ahh! My guess is that you've hit bug# 387081. You'll need to upgrade to the latest cman-kernel packages at least.
In addition to that you might also want to look at bz# 444751.
I'll close this bug as a duplicate of that last bug. If you upgrade to the latest packages (ideally 4.7 or latest 4.6z) and it recurs then feel free to reopen it. If you can't get to the very latest versions (to be honest I'm not sure when they get released!) see the workaround program in the comments of the last BZ I mentioned. *** This bug has been marked as a duplicate of 444751 ***