Bug 213747 - cman has incorrect view of membership
cman has incorrect view of membership
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Christine Caulfield
Cluster QE
: TestBlocker
: 214300 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-11-02 15:33 EST by Abhijith Das
Modified: 2009-04-16 18:29 EDT (History)
9 users (show)

See Also:
Fixed In Version: RC
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-02-07 21:22:32 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
whitetank/test/testclm2 query output from all nodes (607 bytes, application/x-gzip)
2006-11-02 15:44 EST, Abhijith Das
no flags Details
/var/log/messages from the smoke cluster (559.54 KB, application/x-gzip)
2006-11-02 15:48 EST, Abhijith Das
no flags Details
/var/log/messages on the smoke cluster nodes: (544.07 KB, application/x-gzip)
2006-11-03 17:17 EST, Abhijith Das
no flags Details
syslog extract from kool (5.07 KB, text/plain)
2006-11-06 08:25 EST, Christine Caulfield
no flags Details

  None (edit)
Description Abhijith Das 2006-11-02 15:33:53 EST
Description of problem:
I was running revolver on the Smoke Cluster with no IO, but with 1 GFS filesystem. 

[root@smoke sts-root]# ./gfs/bin/revolver -f var/share/resource_files/smoke.xml
-l $PWD -i 0 -L  NONE -I -t 1

The test failed with : 
Could not connect to salem:5008, 111: Connection refused
Could not connect to salem:5008, 111: Connection refused
group recovery (cman_tool nodes) is hung
run_iteration() method failed at /root/sts-root/lib/FI_engine.pm line 21.

Additional info:
I was chatting on IRC with sdake about this and he thinks the cman membership is
different from that of the clm service. Will add the various logs and chat-logs
below.
Comment 1 Abhijith Das 2006-11-02 15:36:24 EST
Chat logs about this BZ:

abhi> riley91, any clue why the current cluster state is funky?
abhi> the nodes don't seem to agree on membership
riley91> i was examing the core that droped nov 1
riley91> abhi what command are you suing to look at membership
abhi> cman_tool nodes
riley91> what node doesn't agree
abhi> winston and kool
abhi> they think salem is not part of the cluster
* riley91 sighs
riley91> well I am not sure what the problem is although totem has the proper
membership
riley91> on those nodes
abhi> hmm
riley91> the last event out of totem is a member joined 10.15.89.57 which is salem
* rkenna_ has quit (Ping timeout: 610 seconds)
riley91> for some reason cman has an improper membership
abhi> group_tool output looks sane
riley91> file a defect
riley91> assign to pjc
riley91> put logs of /var/log/messages
riley91> i'll talk to him tonight about it
riley91> has this happened before ?
riley91> put logs of group_tool -v output
abhi> ok.
riley91> put logs of cman_tool nodes for the offending nodes
abhi> ok
riley91> it looks to me something is wrong in the cman plugin
riley91> cman plugin has defect of some kind
abhi> okay
riley91> the way I know is the clm service that is sa forum complaint has the
proper membership
riley91> please put this in the logs too
riley91> err in the bz
riley91>  /home/devel/whitetank/test/testclm2 query
riley91> it shows that all 5 nodes are members
riley91> if this returned the same membership as cman i'd think totem was not
delivering the callbacks properly
riley91> however the membership of testclm2 and cman_tool are different
riley91> testclm2 has proper membership
riley91> cman_tool has improper membership
riley91> post a log of this conversation too
riley91> gotta go to lunch be back in 30 mins
abhi> ok.
riley91> thanks alot abhi
abhi> bz will be ready by then :)
riley91> i think patrick just needs to add some debug printouts to cman to help
debug this issue
Comment 2 Abhijith Das 2006-11-02 15:40:05 EST
group_tool -v outputs:
[root@camel /home/devel/whitetank]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010003 none        
[1 2 3 4 5]
dlm              1     clvmd    00010002 none        
[1 2 3 4 5]
dlm              1     soot     00030002 none        
[1 2 3 4 5]
gfs              2     soot     00020002 none        
[1 2 3 4 5]

[root@merit /home/devel/whitetank]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010003 none        
[1 2 3 4 5]
dlm              1     clvmd    00010002 none        
[1 2 3 4 5]
dlm              1     soot     00030002 none        
[1 2 3 4 5]
gfs              2     soot     00020002 none        
[1 2 3 4 5]

[root@winston /home/devel/whitetank]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010003 none        
[1 2 3 4 5]
dlm              1     clvmd    00010002 none        
[1 2 3 4 5]
dlm              1     soot     00030002 none        
[1 2 3 4 5]
gfs              2     soot     00020002 none        
[1 2 3 4 5]

[root@kool /home/devel/whitetank]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010003 none        
[1 2 3 4 5]
dlm              1     clvmd    00010002 none        
[1 2 3 4 5]
dlm              1     soot     00030002 none        
[1 2 3 4 5]
gfs              2     soot     00020002 none        
[1 2 3 4 5]

[root@salem /home/devel/whitetank]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010003 none        
[1 2 3 4 5]
dlm              1     clvmd    00010002 none        
[1 2 3 4 5]
dlm              1     soot     00030002 none        
[1 2 3 4 5]
gfs              2     soot     00020002 none        
[1 2 3 4 5]

cman_tool nodes outputs:
[root@camel /home/devel/whitetank]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  50660   2006-11-02 11:44:44  camel
   2   M  50680   2006-11-02 11:44:49  merit
   3   M  50676   2006-11-02 11:44:44  winston
   4   M  50676   2006-11-02 11:44:44  kool
   5   M  50684   2006-11-02 11:46:53  salem

[root@merit /home/devel/whitetank]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  50680   2006-11-02 11:44:58  camel
   2   M  50652   2006-11-02 11:44:58  merit
   3   M  50680   2006-11-02 11:44:58  winston
   4   M  50680   2006-11-02 11:44:58  kool
   5   M  50684   2006-11-02 11:47:01  salem

[root@winston /home/devel/whitetank]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  50676   2006-11-02 11:45:41  camel
   2   M  50680   2006-11-02 11:45:46  merit
   3   M  50640   2006-11-02 11:42:43  winston
   4   M  50648   2006-11-02 11:42:43  kool
   5   X  50648                        salem

[root@kool /home/devel/whitetank]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  50676   2006-11-02 11:47:10  camel
   2   M  50680   2006-11-02 11:47:15  merit
   3   M  50648   2006-11-02 11:44:12  winston
   4   M  50260   2006-11-02 10:27:32  kool
   5   X  50636                        salem

[root@salem /home/devel/whitetank]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M  50684   2006-11-02 11:51:01  camel
   2   M  50684   2006-11-02 11:51:01  merit
   3   M  50684   2006-11-02 11:51:01  winston
   4   M  50684   2006-11-02 11:51:01  kool
   5   M  50680   2006-11-02 11:51:01  salem
Comment 3 Abhijith Das 2006-11-02 15:44:58 EST
Created attachment 140182 [details]
whitetank/test/testclm2 query output from all nodes

whitetank/test/testclm2 query output from all nodes
Comment 4 Abhijith Das 2006-11-02 15:48:49 EST
Created attachment 140183 [details]
/var/log/messages from the smoke cluster
Comment 5 Christine Caulfield 2006-11-03 05:12:28 EST
cman_tool status is the most useful command to use here. It would show that
salem is a "disallowed" member because it rejoined without using cman_tool join.

These messages appear in syslog:

kool.lab.varlog:Nov  2 11:47:15 kool openais[1940]: [MAIN ] Node salem not
joined to cman because it has rejoined an inquorate cluster 
winston.lab.varlog:Nov  2 11:45:41 winston openais[1748]: [MAIN ] Node salem not
joined to cman because it has rejoined an inquorate cluster 

The other two nodes are showing salem as a member because they joined after it
did and don't know of its "rogue" status.

Leaving aside the question of whether the node should have been fenced, I think
the only safe wasy to deal with this is to mark any new nodes that join the
cluster while its in this state as "AISONLY" too.
Comment 6 Christine Caulfield 2006-11-03 10:09:19 EST
The actual solution had to be a little more drastic than that. we can't allow a
node to join a cluster with "Disallowed" nodes in it because we don't know which
portion of the cluster is the one it ought to ally itself with.


Checking in cman_tool/main.c;
/cvs/cluster/cluster/cman/cman_tool/main.c,v  <--  main.c
new revision: 1.51; previous revision: 1.50
done
Checking in daemon/cnxman-private.h;
/cvs/cluster/cluster/cman/daemon/cnxman-private.h,v  <--  cnxman-private.h
new revision: 1.26; previous revision: 1.25
done
Checking in daemon/commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.54; previous revision: 1.53
done
Comment 7 Kiersten (Kerri) Anderson 2006-11-03 10:15:29 EST
Problem found during smoke testing of the cluster components.  Devel ACK, patch
already posted for inclusion in beta2 builds.
Comment 8 RHEL Product and Program Management 2006-11-03 10:26:34 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux release.  Product Management has requested further review
of this request by Red Hat Engineering.  This request is not yet committed for
inclusion in release.
Comment 9 Abhijith Das 2006-11-03 17:03:02 EST
After pulling in pjc's patch I ran revolver again and ran into a similar issue.
The revolver iteration killed 2 nodes
================================================================================
Scenario iteration 18.2 started at Fri Nov  3 14:16:48 CST 2006
Sleeping 1 minute(s) to let the I/O get its lock count up...
Scenario: DLM kill Quorum minus one

Those picked to face the revolver... merit camel
Feeling lucky merit? Well do ya? Go'head make my day...
Didn't receive heartbeat for 2 seconds
Feeling lucky camel? Well do ya? Go'head make my day...
Didn't receive heartbeat for 2 seconds

The cluster didn't come up after this attempt and revolved timed out. I logged
in and saw that camel, winston and salem had failed to start cman. The other two
nodes disagree on 'disallowed nodes'. I'll add cluster status info, logs, IRC
chats etc to the bz.

pjc, the smoke cluster is undisturbed since the crash in case you need to login
and check things out.
Comment 10 Abhijith Das 2006-11-03 17:04:28 EST
Chat with riley and dct on IRC about this

abhi riley91, camel, winston and salem didn't come up
abhi kool and merit have a borked view of the cluster
riley91 why didn't camel winston and salem come up ?
abhi kool says Nov  3 15:27:08 kool ccsd[1923]: Error while processing connect:
Connection re
abhi fused 
abhi Nov  3 15:27:13 kool ccsd[1923]: Cluster is not quorate.  Refusing connection.
abhi  
abhi over and over 
* stanko has quit (Quit: Leaving)
riley91 dont run any commands right now
abhi ok
riley91 but did service cman start start on camel ?
riley91 or did that fail in some way
riley91 [root@merit ~]# group_tool -v
riley91 type             level name     id       state node id local_done
riley91 fence            0     default  00010002 FAIL_ALL_STOPPED 3 300030003 -1
riley91 [2 3 4 5]
riley91 dlm              1     clvmd    00020002 FAIL_ALL_STOPPED 3 300030003 -1
riley91 [2 3 4 5]
abhi [root@camel ~]# service cman status
abhi groupd is stopped
riley91 ok this is some kind of cman problem
riley91 cman is killing aisexec
riley91 Nov  3 14:20:45 camel openais[1736]: CMAN: Joined a cluster with
disallowed nodes. must die
riley91 i think patrick's fix may need some more work
abhi riley91, pjc commented in his fix about this
abhi https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213747#c5
riley91 ok well there is an interaction between cman and groupd which is not
considered with the change
riley91 i suggest reopen bug with all log info
riley91 probably keep the smoke cluster where it is for now so he can look at it
abhi what bothers me is that how did the cluster lose quorum?
abhi The test case only knocked out 2 nodes
riley91 there are only 2 nodes
riley91 all of this killing of aisexec is wrong
feist riley91:  ok, you should be all set with the bench's on their new apc
unit.  let me know if you have any issues.
riley91 what needs to happen is the nodes need to be fenced
riley91 imo
abhi Scenario iteration 18.2 started at Fri Nov  3 14:16:48 CST 2006
abhi Sleeping 1 minute(s) to let the I/O get its lock count up...
abhi Scenario: DLM kill Quorum minus one
abhi Those picked to face the revolver... merit camel
abhi Feeling lucky merit? Well do ya? Go'head make my day...
abhi Didn't receive heartbeat for 2 seconds
abhi Feeling lucky camel? Well do ya? Go'head make my day...
abhi Didn't receive heartbeat for 2 seconds
riley91 feist cool thanks
riley91 abhi reopen the bug 
riley91 abhi and put all the logs in which would be interesting
abhi riley91, ok
riley91 or maybe ask kevin before reopening
riley91 but definately put logs in saying the patch seems to not get the job done
riley91 or there is some other problem
abhi riley91, ok
riley91 perhaps dct can diagnose after testing of ckpt fix
riley91 after i finish ckpt fix building i'll diagnose more
riley91 and comment in bug if i find anything else
dct__ what happened almost certainly is that revolver killed 2 nodes, then
aisexec was momentarily disconnected on one of the live nodes
dct__ the two nodes that we're disconnected consider the disconnected node to be
failed
dct__ i.e. you now have 3 nodes down: 2 from revolver, 1 from network separation
dct__ when the separated node comes back, the remaining two don't let it back in
the cluster since it wasn't started with cman_tool
dct__ the 2 "normal" nodes don't have quorum, so they're not going to fence anyone
dct__ i.e. it sounds like everything was going about right --it's largely a
matter of interpretting what happened correctly
riley91 dct groupd is stuck
dct__ how do you mean?
dct__ in the scenario above, groupd won't do anything because there's no quorum
riley91 [root@merit ~]# group_tool -v
riley91 type             level name     id       state node id local_done
riley91 fence            0     default  00010002 FAIL_ALL_STOPPED 3 300030003 -1
riley91 [2 3 4 5]
riley91 dlm              1     clvmd    00020002 FAIL_ALL_STOPPED 3 300030003 -1
riley91 [2 3 4 5]
dct__ yes, it's intentionally not doing anything because there's no quorum
riley91 ok
kanderso what do you mean by not started with cman_tool ?
dct__ I should note that in the output of group_tool probably
abhi dct__, what you just said sounds like what might've happened... but this
network-separation causing a node to appear dead... what can we do about it?
riley91 first we dont know that it was a network seperation
riley91 or do you know dave
dct__ kanderso: aisexec can get separated from teh cluster, then it
automatically rejoins -- cman "filters" this auto-rejoined node out of the cluster
dct__ we don't know for sure about the network separation, but it's something
like that
abhi dct__, is there a log of this rejoining we can look for?
dct__ might be avoidable with some work, might no be -- don't know without some
deep analysis
dct__ abhi: yes, pjc made node of those messages to look for in the bz
dct__ and cman_tool status may show you some info about invalid nodes that have
been separated and then merged back in
abhi here's something weird :
abhi Disallowed nodes: winston kool 
abhi [root@merit ~]# 
dct__ yes
abhi Disallowed nodes: merit winston 
abhi [root@kool ~]# 
abhi merit thinks winston and kool are disallowed
abhi kool thinks merit and winston are disallowed
dct__ and winston and kook probably think merit is disallowed
abhi winston is dead
dct__ something more might have gone wrong, not sure
abhi ok. I'll just add stuff to the bz and reopen it
abhi gonna leave the cluster in the same state for pjc to look at later
dct__ but part of the explanation are these "disallowed nodes" where aisexec has
been separated momentarily
riley91 the timeout is 5 seconds
riley91 i dont see how we could have a momentary disconnection
riley91 unless cman blocks somewhere
riley91 perhaps by using gethostbyaddr
dct__ good point
riley91 or sleep
riley91 openais shouldn't have any of that
riley91 before we ship we need to find the source of this problem
riley91 my best guess if there is a token loss during normal operation now it is
caused by blocking in cman or openais executive
* mnc has quit (Remote host closed the connection)
riley91 on the smoke cluster we shoud set the token loss timer to some very
large value
riley91 and see if we get this kind of failure
riley91 abhi at what time did you have the crash
abhi around 2:20 pm today
abhi CST
Comment 11 Abhijith Das 2006-11-03 17:10:17 EST
cman_tool status on all nodes:

[root@camel ~]# cman_tool status
cman_tool: Cannot open connection to cman, is it running ?

[root@merit ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 57288
Membership state: Cluster-Member
Nodes: 1
Expected votes: 5
Total votes: 1
Quorum: 3 Activity blocked
Active subsystems: 7
Flags: DisallowedNodes 
Ports Bound: 0 11  
Node name: merit
Node ID: 2
Multicast addresses: 239.192.13.156 
Node addresses: 10.15.89.54 
Disallowed nodes: winston kool 

[root@winston ~]# cman_tool status
cman_tool: Cannot open connection to cman, is it running ?

[root@kool ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 57288
Membership state: Cluster-Member
Nodes: 1
Expected votes: 5
Total votes: 1
Quorum: 3 Activity blocked
Active subsystems: 7
Flags: DisallowedNodes 
Ports Bound: 0 11  
Node name: kool
Node ID: 4
Multicast addresses: 239.192.13.156 
Node addresses: 10.15.89.56 
Disallowed nodes: merit winston 

[root@salem ~]# cman_tool status
cman_tool: Cannot open connection to cman, is it running ?

cman_tool nodes on all nodes:

[root@camel ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?

[root@merit ~]# cman_tool nodes
NOTE: There are 2 disallowed nodes,
      members list may seem inconsistent across the cluster
Node  Sts   Inc   Joined               Name
   1   X      0                        camel
   2   M  57248   2006-11-03 14:20:40  merit
   3   d  57260   2006-11-03 14:20:40  winston
   4   d  57260   2006-11-03 14:20:40  kool
   5   X  57260                        salem

[root@winston ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?

[root@kool ~]# cman_tool nodes
NOTE: There are 2 disallowed nodes,
      members list may seem inconsistent across the cluster
Node  Sts   Inc   Joined               Name
   1   X  57224                        camel
   2   d  57260   2006-11-03 14:22:55  merit
   3   d  57244   2006-11-03 14:20:12  winston
   4   M  56140   2006-11-03 10:48:51  kool
   5   X  57232                        salem

[root@salem ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?

group_tool -v on all nodes:

[root@camel ~]# group_tool -v
Unable to connect to groupd.  Is it running?

[root@merit ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 FAIL_ALL_STOPPED 3 300030003 -1
[2 3 4 5]
dlm              1     clvmd    00020002 FAIL_ALL_STOPPED 3 300030003 -1
[2 3 4 5]

[root@winston ~]# group_tool -v
Unable to connect to groupd.  Is it running?

[root@kool ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010002 FAIL_ALL_STOPPED 2 200020003 -1
[2 3 4 5]
dlm              1     clvmd    00020002 FAIL_ALL_STOPPED 2 200020003 -1
[2 3 4 5]
dlm              1     soot     00040002 FAIL_ALL_STOPPED 3 300020003 -1
[3 4 5]
gfs              2     soot     00030002 FAIL_ALL_STOPPED 3 300020003 -1
[3 4 5]

[root@salem ~]# group_tool -v
Unable to connect to groupd.  Is it running?
Comment 12 Abhijith Das 2006-11-03 17:14:25 EST
"/home/devel/whitetank/test/testclm2 query" on all nodes:

[root@camel ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: service cannot be provided now, try later

[root@merit ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: successful

AIS version supported: B.1.1
called: ClusterTrack, status: successful

  view number: 14322
  number of items: 2

    node index within sequence: 0
    cluster node: 2
    address: 10.15.89.54
    name: 10.15.89.54
    member: true
    booted: 2006-11-03 14:20:09
    initial view number: 0
    cluster change: node has not changed

    node index within sequence: 1
    cluster node: 4
    address: 10.15.89.56
    name: 10.15.89.56
    member: true
    booted: 2006-11-03 10:47:56
    initial view number: 14035
    cluster change: node has not changed

called: Finalize, status: successful

[root@winston ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: service cannot be provided now, try later

[root@kool ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: successful

AIS version supported: B.1.1
called: ClusterTrack, status: successful

  view number: 14322
  number of items: 2

    node index within sequence: 0
    cluster node: 4
    address: 10.15.89.56
    name: 10.15.89.56
    member: true
    booted: 2006-11-03 10:47:56
    initial view number: 0
    cluster change: node has not changed

    node index within sequence: 1
    cluster node: 2
    address: 10.15.89.54
    name: 10.15.89.54
    member: true
    booted: 2006-11-03 14:20:09
    initial view number: 14312
    cluster change: node has not changed

called: Finalize, status: successful

[root@salem ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: service cannot be provided now, try later
Comment 13 Abhijith Das 2006-11-03 17:17:27 EST
Created attachment 140329 [details]
/var/log/messages on the smoke cluster nodes:
Comment 14 Christine Caulfield 2006-11-06 05:08:17 EST
oops, if an AISONLY node left the cluster it didn't get flagged as DEAD.

Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55; previous revision: 1.54
done

The delays I'm not sure about. there are certainly no sleep()s in cman and the
only calls to netdb are when reading CCS. The latter can happen in running node
if a new node joins the cluster with a newer CCS version but I don't think
that's what is happening here.
Comment 15 Christine Caulfield 2006-11-06 08:25:21 EST
Created attachment 140467 [details]
syslog extract from kool

Looking at the syslog on kool, I can't see why openais has split the cluster
(unless I've misunderstood something) - there is certainly not 5 seconds
between the full 4-node cluster being established and 3 nodes leaving.

merit shows something very similar.
Comment 17 Christine Caulfield 2006-11-07 03:33:03 EST
*** Bug 214300 has been marked as a duplicate of this bug. ***
Comment 18 Corey Marthaler 2006-11-07 14:59:12 EST
If 214300 is a dup of this, then this needs to be marked as a QE test blocker.
Comment 20 Kiersten (Kerri) Anderson 2006-11-09 15:33:18 EST
Moving to MODIFIED for the code changes for the output and its relationship to
the openais changes in bz 214290.
Comment 21 Corey Marthaler 2006-11-13 18:19:11 EST
I appeared to have hit this again with the latest builds.
[root@taft-02 ~]# rpm -q cman
cman-2.0.35-2.el5
[root@taft-02 ~]# rpm -q openais
openais-0.80.1-15.el5


I noticed this while running clvmd regression tests. One of the commands hung
due to one of the nodes mysteriously dropping out of the cluster.

[root@taft-01 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 FAIL_ALL_STOPPED 2 200030003 -1
[1 2 3 4]
dlm              1     clvmd    00090001 FAIL_ALL_STOPPED 2 200030003 -1
[1 2 3 4]

[root@taft-02 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     clvmd    00090001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]

[root@taft-03 ~]# group_tool -v
Unable to connect to groupd.  Is it running?

[root@taft-04 ~]# group_tool -v
type             level name     id       state node id local_done
fence            0     default  00010001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm              1     clvmd    00090001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]


Taft-03 is the node that went down:
Nov 13 15:01:25 taft-03 openais[6227]: [TOTEM] Storing new sequence id for ring 3c
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] New Configuration:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.69)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] Members Left:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] Members Joined:
Nov 13 15:01:25 taft-03 openais[6227]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] New Configuration:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.69)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] Members Left:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ] Members Joined:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-03 openais[6227]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:25 taft-03 openais[6227]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:25 taft-03 gfs_controld[6255]: cluster is down, exiting
Nov 13 15:01:25 taft-03 dlm_controld[6250]: cluster is down, exiting
Nov 13 15:01:25 taft-03 fenced[6245]: cluster is down, exiting
Nov 13 15:01:25 taft-03 kernel: dlm: closing connection to node 3
Nov 13 15:01:46 taft-03 ccsd[6221]: Unable to connect to cluster infrastructure
after 30 second
s.
Nov 13 15:02:16 taft-03 ccsd[6221]: Unable to connect to cluster infrastructure
after 60 second
s.
Nov 13 15:02:46 taft-03 ccsd[6221]: Unable to connect to cluster infrastructure
after 90 second
s.


Taft-02 is the machine with the hung process:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 36 rep 10.15.89.68
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru b high delivered b received
flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [2] member 10.15.89.70:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 48 rep 10.15.89.70
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru b high delivered b received
flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Did not need to originate any
messages in recove
ry.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Storing new sequence id for ring 38
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Left:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Left:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and will provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:25 taft-02 openais[6231]: [MAIN ] Node taft-04 not joined to cman
because it has rejoined an inquorate cluster
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering GATHER state from 11.  
               Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Saving state aru 2
high seq received 2
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering COMMIT state.          
               Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering RECOVERY
state.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [0] member 10.15.89.67:
               Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq
56 rep 10.15.89.67
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru 2 high delivered 2 received
flag 0          Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [1]
member 10.15.89.68:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 56 rep
10.15.89.67            Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru 2 high
delivered 2 received flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [2] member 10.15.89.69:
               Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq
56 rep 10.15.89.69
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru b high delivered b received
flag 0          Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [3]
member 10.15.89.70:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 56 rep
10.15.89.67            Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru 2 high
delivered 2 received flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Did not need to originate any
messages in recovery.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Storing new sequence id for ring
3c             Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] CLM CONFIGURATION
CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] New Configuration:              
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.68)           
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Left:                   
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and will provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] CLM CONFIGURATION CHANGE        
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.67)           
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.69)           
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Left:                   
               Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.69)           
               Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is
within the primary component and wi
ll provide service.                                                            
               Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering
OPERATIONAL state.
Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] The token was lost in the
OPERATIONAL state.    Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] Receive
multicast socket recv buffer size (26214
2 bytes).                                                                      
               Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] Transmit multicast
socket send buffer size (2621
42 bytes).
Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] entering GATHER state from 2.
Nov 13 15:01:34 taft-02 openais[6231]: [TOTEM] entering GATHER state from 0.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] Saving state aru 1 high seq
received 1
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] entering COMMIT state.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] entering RECOVERY state.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] position [0] member 10.15.89.67:
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] previous ring seq 60 rep 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] aru 1 high delivered 1 received
flag 0
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] position [1] member 10.15.89.68:
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] previous ring seq 60 rep 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] aru 1 high delivered 1 received
flag 0
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] position [2] member 10.15.89.70:
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] previous ring seq 60 rep 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] aru 1 high delivered 1 received
flag 0
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] Did not need to originate any
messages in recove
ry.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] Storing new sequence id for ring 40
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] New Configuration:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] Members Left:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.69)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] Members Joined:
Nov 13 15:01:35 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] New Configuration:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.67)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.68)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ]  r(0) ip(10.15.89.70)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] Members Left:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] Members Joined:
Nov 13 15:01:35 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:35 taft-02 openais[6231]: [MAIN ] Node taft-01 not joined to cman
because it has r
ejoined an inquorate cluster
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] got nodejoin message 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] got nodejoin message 10.15.89.68
Nov 13 15:01:35 taft-02 openais[6231]: [CLM  ] got nodejoin message 10.15.89.70
Nov 13 15:01:35 taft-02 openais[6231]: [CPG  ] got joinlist message from node 2
Nov 13 15:01:35 taft-02 openais[6231]: [CPG  ] got joinlist message from node 4
Nov 13 15:01:35 taft-02 openais[6231]: [IPC  ] Disabling flow control - HW mark
[0/800].
Nov 13 15:01:35 taft-02 openais[6231]: [CPG  ] got joinlist message from node 1
Comment 22 Christine Caulfield 2006-11-14 05:48:44 EST
This time OpenAIS has hit the 5 second token timeout:

Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] The token was lost in the
OPERATIONAL state.   
Comment 24 Steven Dake 2006-11-14 18:13:09 EST
may be fixed by resolution of bz 215626.
Comment 25 Corey Marthaler 2006-11-22 13:45:23 EST
Have not been able to reproduce with the latest build, openais-0.80.1-17.el5.
Marking verified and will open a new bz if seen again.
Comment 26 RHEL Product and Program Management 2007-02-07 21:22:33 EST
A package has been built which should help the problem described in 
this bug report. This report is therefore being closed with a resolution 
of CURRENTRELEASE. You may reopen this bug report if the solution does 
not work for you.
Comment 27 Nate Straz 2007-12-13 12:21:48 EST
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.

Note You need to log in before you can comment on or make changes to this bug.