Bug 436542

Summary: Node fails to rejoin cluster after a power reset
Product: Red Hat Enterprise Linux 5 Reporter: Afom T. Michael <tmichael>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED INSUFFICIENT_DATA QA Contact: GFS Bugs <gfs-bugs>
Severity: low Docs Contact:
Priority: low    
Version: 5.1CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-16 14:12:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster.conf none

Comment 1 Christine Caulfield 2008-03-10 08:29:20 UTC
Can we have a LOT more information please ? 

What do you mean by "fails to rejoin the cluster" ? are there messages on the
"failed" node ? any on the other nodes?

What does 'cman_tool nodes' / 'cman_tool status' say? on the "failed" node, on
the other nodes.

Are the "failed" nodes really being powered down rather than just reset?

Does aisexec start?

Are there any fencing issues/messages ?

What's in cluster.conf?

And anything else that might seem to be relevant


Comment 2 Afom T. Michael 2008-03-11 18:53:51 UTC
By 'failed' I mean after power cycle/reset, the node doesn't rejoin the cluster.
After 'ipmitool power reset', sometimes the node just stays down and not power
up. In other cases, cman service doesn't start. In the latter situation, here is
what I see:
    [root@ora3 ~]# service cman status
    groupd is stopped
    [root@ora3 ~]# service qdiskd status
    qdiskd (pid 3842) is running...
    [root@ora3 ~]# service openais status
    aisexec is stopped
    [root@ora3 ~]# service ipmi status
    ipmi_msghandler module loaded.
    ipmi_si module loaded.
    ipmi_devintf module loaded.
    /dev/ipmi0 exists.
And in log, there is a repeated messages of "ccsd[3817]: Unable to connect to
cluster infrastructure after XXX seconds."

On the other nodes of the cluster:
[root@ora1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2008-03-11 12:32:36  /dev/sdn1
   1   M   4600   2008-03-11 12:30:28  ora1
   2   M   4648   2008-03-11 13:44:08  ora2
   3   X   4636                        ora3
   4   M   4628   2008-03-11 12:32:11  ora4
[root@ora1 ~]# cman_tool status
Version: 6.0.1
Config Version: 31
Cluster Name: ora64xzq
Cluster Id: 26725
Cluster Member: Yes
Cluster Generation: 4652
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 6
Quorum: 4
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: ora1
Node ID: 1
Multicast addresses: 225.0.0.12
Node addresses: 192.168.33.87

Comment 3 Afom T. Michael 2008-03-11 18:54:35 UTC
Created attachment 297659 [details]
cluster.conf

Comment 4 Afom T. Michael 2008-03-11 19:07:16 UTC
(In reply to comment #2)
> By 'failed' I mean after power cycle/reset, the node doesn't rejoin the cluster.
> After 'ipmitool power reset', sometimes the node just stays down and not power
> up. In other cases, cman service doesn't start. In the latter situation, here is
> what I see:
>     [root@ora3 ~]# service cman status
>     groupd is stopped
>     [root@ora3 ~]# service qdiskd status
>     qdiskd (pid 3842) is running...
>     [root@ora3 ~]# service openais status
>     aisexec is stopped
>     [root@ora3 ~]# service ipmi status
>     ipmi_msghandler module loaded.
>     ipmi_si module loaded.
>     ipmi_devintf module loaded.
>     /dev/ipmi0 exists.
> And in log, there is a repeated messages of "ccsd[3817]: Unable to connect to
> cluster infrastructure after XXX seconds."
If the node is rebooted at this point, it rejoins as expected.

> 
> On the other nodes of the cluster:
> [root@ora1 ~]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    0   M      0   2008-03-11 12:32:36  /dev/sdn1
>    1   M   4600   2008-03-11 12:30:28  ora1
>    2   M   4648   2008-03-11 13:44:08  ora2
>    3   X   4636                        ora3
>    4   M   4628   2008-03-11 12:32:11  ora4
> [root@ora1 ~]# cman_tool status
> Version: 6.0.1
> Config Version: 31
> Cluster Name: ora64xzq
> Cluster Id: 26725
> Cluster Member: Yes
> Cluster Generation: 4652
> Membership state: Cluster-Member
> Nodes: 3
> Expected votes: 4
> Total votes: 6
> Quorum: 4
> Active subsystems: 8
> Flags:
> Ports Bound: 0 11
> Node name: ora1
> Node ID: 1
> Multicast addresses: 225.0.0.12
> Node addresses: 192.168.33.87


Comment 5 Christine Caulfield 2008-03-12 09:58:20 UTC
Are there any relevant messages in syslog on the non-joining node? On other nodes?

Does it work if you start cman manually afterwards ?