449961 – Problem with rejoining fenced node (2-node cluster

Bug 449961 - Problem with rejoining fenced node (2-node cluster

Summary: Problem with rejoining fenced node (2-node cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 444751
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	cman
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Christine Caulfield
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-06-04 14:32 UTC by Tomasz Jaszowski
Modified:	2009-04-16 20:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-06-05 14:29:21 UTC
Embargoed:

Attachments	(Terms of Use)

Description Tomasz Jaszowski 2008-06-04 14:32:27 UTC

Description of problem:
Sometimes after one of nodes is fenced/manually rebooted it can't rejoin cluster.

All I got is cman: cman_tool: Node is already active failed when 'service cman
start'

Version-Release number of selected component (if applicable):
RH 4U4

cmanic-7.6.0-5.rhel4
magma-devel-1.0.6-0
cman-devel-1.0.11-0
magma-1.0.6-0
ccs-1.0.7-0
cman-1.0.11-0
magma-plugins-1.0.9-0
cman-kernheaders-2.6.9-45.8
cman-kernel-smp-2.6.9-45.8
rgmanager-1.9.54-3.228823test
ccs-devel-1.0.7-0
cman-kernel-2.6.9-45.8


How reproducible:
hmm hard to say, sometimes reboot of node is enough

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
cluster.conf

<?xml version="1.0"?>
<cluster config_version="62" name="PROcluster">
        <fence_daemon post_fail_delay="0" post_join_delay="25"/>
        <clusternodes>
                <clusternode name="node1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node1-ilo"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="node2-ilo"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="node1-ilo"
login="fence" name="node1-ilo" passwd="PASS"/>
                <fencedevice agent="fence_ilo" hostname="node2-ilo"
login="fence" name="node2-ilo" passwd="PASS"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="cluster-fail" ordered="0"
restricted="1">
                                <failoverdomainnode name="node2" priority="1"/>
                                <failoverdomainnode name="node1" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="cluster1-fail" restricted="1">
                                <failoverdomainnode name="node1" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="cluster2-fail" restricted="1">
                                <failoverdomainnode name="node2" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
[...]
                </resources>
                <service
[...]
                </service>
        </rm>
</cluster>

Comment 1 Tomasz Jaszowski 2008-06-04 14:35:13 UTC

some logs:

Jun  4 13:56:52 node2 ccsd[31968]: Starting ccsd 1.0.7:
Jun  4 13:56:53 node2 ccsd[31968]:  Built: Nov 30 2006 17:17:18
Jun  4 13:56:53 node2 ccsd[31968]:  Copyright (C) Red Hat, Inc.  2004  All
rights reserved.
Jun  4 13:56:53 node2 ccsd:  succeeded
Jun  4 13:56:53 node2 kernel: CMAN 2.6.9-45.8 (built Jan 17 2007 16:47:20) installed
Jun  4 13:56:53 node2 kernel: DLM 2.6.9-44.3 (built Jan 17 2007 16:48:30) installed
Jun  4 13:56:53 node2 ccsd[31968]: cluster.conf (cluster name =
PROcluster,version = 62) found.
Jun  4 13:56:53 node2 ccsd[31968]: Remote copy of cluster.conf is from quorate node.
Jun  4 13:56:53 node2 ccsd[31968]:  Local version # : 62
Jun  4 13:56:53 node2 ccsd[31968]:  Remote version #: 62
Jun  4 13:56:53 node2 ccsd[31968]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.1.7.1
Jun  4 13:56:53 node2 ccsd[31968]: Initial status:: Inquorate
Jun  4 13:58:53 node2 cman: Timed-out waiting for cluster failed
Jun  4 14:00:53 node2 fenced: startup failed
Jun  4 14:00:53 node2 rgmanager: clurgmgrd startup succeeded
Jun  4 14:00:53 node2 clurgmgrd[3026]: <notice> Waiting for quorum to form

Comment 2 Tomasz Jaszowski 2008-06-04 14:39:23 UTC

Main question is:
how to force second (first?) node to join cluster? 

Only solution that works for sure is to reboot both simultaneously so we have
clear situation - but this solution is unacceptable if we assume that HA cluster
should work without any interrupts.

Comment 3 Christine Caulfield 2008-06-05 13:43:02 UTC

It's hard for me to make sense of this report so I'll make a few suggestions
instead.

1) Check that iLO is doing a HARD reboot of the machine when it fences it. If
the system has been properly rebooted then there is no reason why cman would
still be running which is the error you have.

2) Check the state of the system before and after the event. Is cman running or
loaded somewhere else? does the remaining node spot that the node goes down
correctly.

3) Check your init scripts. "Node already active" says that cman is already
running, which should not happen on a properly configured system if the startup
script is run at boot time.

Just check EVERYTHING, this is very likely a configuration error somewhere but
without more (and matching) information and logs it's impossible to judge from here.

Comment 4 Tomasz Jaszowski 2008-06-05 14:01:09 UTC

(In reply to comment #3)
> It's hard for me to make sense of this report so I'll make a few suggestions
> instead.

sorry, If You need some specified info, pls ask, I'll try to be more clear

> 1) Check that iLO is doing a HARD reboot of the machine when it fences it. If
> the system has been properly rebooted then there is no reason why cman would
> still be running which is the error you have.

it was power cycled, cman is starting but fails to join cluster - status is
joining. When I try to manually cman_tool join i got info "Node is already
active failed when 'service cmanstart'" - of course it was stupid as i saw that
it was working and tried to join cluster.

> 2) Check the state of the system before and after the event. Is cman running or
> loaded somewhere else? does the remaining node spot that the node goes down
> correctly.

yes, first node saw that node left cluster (clustat shows node2 - offline)

> 3) Check your init scripts. "Node already active" says that cman is already
> running, which should not happen on a properly configured system if the startup
> script is run at boot time.

as I wrote before, I've made stupid thing... started cman and it was trying to
join cluster, when I've manually set cman_tool join

> Just check EVERYTHING, this is very likely a configuration error somewhere but
> without more (and matching) information and logs it's impossible to judge from
here.



So I'll try be more clear:
Situation - node2 had problems with load and stucked threads. We have rebooted
it manually using iLO (power cycled). After it started it can't rejoin to
cluster - Jun  4 13:58:53 node2 cman: Timed-out waiting for cluster failed

Node1 i working properly:

Node  Votes Exp Sts  Name
   1    1    1   M   node1
   2    1    1   X   node2


Protocol version: 5.0.1
Config version: 62
Cluster name: PROcluster
Cluster ID: 55067
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 13
Node name: node1
Node ID: 1
Node addresses: x.y.z.129


on Node2:
Node  Votes Exp Sts  Name




Protocol version: 5.0.1
Config version: 62
Cluster name: PROcluster
Cluster ID: 55067
Cluster Member: No
Membership state: Joining


and since yesterday nothing changed... I had to reboot node2 few times due to
other maintenance tasks - but still can't join cluster

Comment 5 Tomasz Jaszowski 2008-06-05 14:07:13 UTC

We had such problems few times before and only solution that I've found out was
to reboot both nodes.

Now I wouldn't like to do it - so I would like to receive info, how to make
node2 joining successful.

Comment 6 Christine Caulfield 2008-06-05 14:22:50 UTC

ahh!

My guess is that you've hit bug# 387081. You'll need to upgrade to the latest
cman-kernel packages at least.

Comment 7 Christine Caulfield 2008-06-05 14:27:26 UTC

In addition to that you might also want to look at bz# 444751.

Comment 8 Christine Caulfield 2008-06-05 14:29:21 UTC

I'll close this bug as a duplicate of that last bug. If you upgrade to the
latest packages (ideally 4.7 or latest 4.6z) and it recurs then feel free to
reopen it.

If you can't get to the very latest versions (to be honest I'm not sure when
they get released!) see the workaround program in the comments of the last BZ I
mentioned.

*** This bug has been marked as a duplicate of 444751 ***

Note You need to log in before you can comment on or make changes to this bug.