Bug 436542 - Node fails to rejoin cluster after a power reset
Node fails to rejoin cluster after a power reset
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.1
All Linux
low Severity low
: rc
: ---
Assigned To: Christine Caulfield
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-07 14:46 EST by Afom T. Michael
Modified: 2009-04-16 18:51 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-02-16 09:12:40 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cluster.conf (1.59 KB, application/octet-stream)
2008-03-11 14:54 EDT, Afom T. Michael
no flags Details

  None (edit)
Comment 1 Christine Caulfield 2008-03-10 04:29:20 EDT
Can we have a LOT more information please ? 

What do you mean by "fails to rejoin the cluster" ? are there messages on the
"failed" node ? any on the other nodes?

What does 'cman_tool nodes' / 'cman_tool status' say? on the "failed" node, on
the other nodes.

Are the "failed" nodes really being powered down rather than just reset?

Does aisexec start?

Are there any fencing issues/messages ?

What's in cluster.conf?

And anything else that might seem to be relevant
Comment 2 Afom T. Michael 2008-03-11 14:53:51 EDT
By 'failed' I mean after power cycle/reset, the node doesn't rejoin the cluster.
After 'ipmitool power reset', sometimes the node just stays down and not power
up. In other cases, cman service doesn't start. In the latter situation, here is
what I see:
    [root@ora3 ~]# service cman status
    groupd is stopped
    [root@ora3 ~]# service qdiskd status
    qdiskd (pid 3842) is running...
    [root@ora3 ~]# service openais status
    aisexec is stopped
    [root@ora3 ~]# service ipmi status
    ipmi_msghandler module loaded.
    ipmi_si module loaded.
    ipmi_devintf module loaded.
    /dev/ipmi0 exists.
And in log, there is a repeated messages of "ccsd[3817]: Unable to connect to
cluster infrastructure after XXX seconds."

On the other nodes of the cluster:
[root@ora1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   0   M      0   2008-03-11 12:32:36  /dev/sdn1
   1   M   4600   2008-03-11 12:30:28  ora1
   2   M   4648   2008-03-11 13:44:08  ora2
   3   X   4636                        ora3
   4   M   4628   2008-03-11 12:32:11  ora4
[root@ora1 ~]# cman_tool status
Version: 6.0.1
Config Version: 31
Cluster Name: ora64xzq
Cluster Id: 26725
Cluster Member: Yes
Cluster Generation: 4652
Membership state: Cluster-Member
Nodes: 3
Expected votes: 4
Total votes: 6
Quorum: 4
Active subsystems: 8
Flags:
Ports Bound: 0 11
Node name: ora1
Node ID: 1
Multicast addresses: 225.0.0.12
Node addresses: 192.168.33.87
Comment 3 Afom T. Michael 2008-03-11 14:54:35 EDT
Created attachment 297659 [details]
cluster.conf
Comment 4 Afom T. Michael 2008-03-11 15:07:16 EDT
(In reply to comment #2)
> By 'failed' I mean after power cycle/reset, the node doesn't rejoin the cluster.
> After 'ipmitool power reset', sometimes the node just stays down and not power
> up. In other cases, cman service doesn't start. In the latter situation, here is
> what I see:
>     [root@ora3 ~]# service cman status
>     groupd is stopped
>     [root@ora3 ~]# service qdiskd status
>     qdiskd (pid 3842) is running...
>     [root@ora3 ~]# service openais status
>     aisexec is stopped
>     [root@ora3 ~]# service ipmi status
>     ipmi_msghandler module loaded.
>     ipmi_si module loaded.
>     ipmi_devintf module loaded.
>     /dev/ipmi0 exists.
> And in log, there is a repeated messages of "ccsd[3817]: Unable to connect to
> cluster infrastructure after XXX seconds."
If the node is rebooted at this point, it rejoins as expected.

> 
> On the other nodes of the cluster:
> [root@ora1 ~]# cman_tool nodes
> Node  Sts   Inc   Joined               Name
>    0   M      0   2008-03-11 12:32:36  /dev/sdn1
>    1   M   4600   2008-03-11 12:30:28  ora1
>    2   M   4648   2008-03-11 13:44:08  ora2
>    3   X   4636                        ora3
>    4   M   4628   2008-03-11 12:32:11  ora4
> [root@ora1 ~]# cman_tool status
> Version: 6.0.1
> Config Version: 31
> Cluster Name: ora64xzq
> Cluster Id: 26725
> Cluster Member: Yes
> Cluster Generation: 4652
> Membership state: Cluster-Member
> Nodes: 3
> Expected votes: 4
> Total votes: 6
> Quorum: 4
> Active subsystems: 8
> Flags:
> Ports Bound: 0 11
> Node name: ora1
> Node ID: 1
> Multicast addresses: 225.0.0.12
> Node addresses: 192.168.33.87
Comment 5 Christine Caulfield 2008-03-12 05:58:20 EDT
Are there any relevant messages in syslog on the non-joining node? On other nodes?

Does it work if you start cman manually afterwards ?

Note You need to log in before you can comment on or make changes to this bug.