Bug 213747
| Summary: | cman has incorrect view of membership | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Abhijith Das <adas> | ||||||||||
| Component: | cman | Assignee: | Christine Caulfield <ccaulfie> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 5.0 | CC: | cfeist, cluster-maint, cmarthal, djansa, kanderso, rkenna, rpeterso, sdake, teigland | ||||||||||
| Target Milestone: | --- | Keywords: | TestBlocker | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | RC | Doc Type: | Bug Fix | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2007-02-08 02:22:32 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Abhijith Das
2006-11-02 20:33:53 UTC
Chat logs about this BZ: abhi> riley91, any clue why the current cluster state is funky? abhi> the nodes don't seem to agree on membership riley91> i was examing the core that droped nov 1 riley91> abhi what command are you suing to look at membership abhi> cman_tool nodes riley91> what node doesn't agree abhi> winston and kool abhi> they think salem is not part of the cluster * riley91 sighs riley91> well I am not sure what the problem is although totem has the proper membership riley91> on those nodes abhi> hmm riley91> the last event out of totem is a member joined 10.15.89.57 which is salem * rkenna_ has quit (Ping timeout: 610 seconds) riley91> for some reason cman has an improper membership abhi> group_tool output looks sane riley91> file a defect riley91> assign to pjc riley91> put logs of /var/log/messages riley91> i'll talk to him tonight about it riley91> has this happened before ? riley91> put logs of group_tool -v output abhi> ok. riley91> put logs of cman_tool nodes for the offending nodes abhi> ok riley91> it looks to me something is wrong in the cman plugin riley91> cman plugin has defect of some kind abhi> okay riley91> the way I know is the clm service that is sa forum complaint has the proper membership riley91> please put this in the logs too riley91> err in the bz riley91> /home/devel/whitetank/test/testclm2 query riley91> it shows that all 5 nodes are members riley91> if this returned the same membership as cman i'd think totem was not delivering the callbacks properly riley91> however the membership of testclm2 and cman_tool are different riley91> testclm2 has proper membership riley91> cman_tool has improper membership riley91> post a log of this conversation too riley91> gotta go to lunch be back in 30 mins abhi> ok. riley91> thanks alot abhi abhi> bz will be ready by then :) riley91> i think patrick just needs to add some debug printouts to cman to help debug this issue group_tool -v outputs: [root@camel /home/devel/whitetank]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010002 none [1 2 3 4 5] dlm 1 soot 00030002 none [1 2 3 4 5] gfs 2 soot 00020002 none [1 2 3 4 5] [root@merit /home/devel/whitetank]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010002 none [1 2 3 4 5] dlm 1 soot 00030002 none [1 2 3 4 5] gfs 2 soot 00020002 none [1 2 3 4 5] [root@winston /home/devel/whitetank]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010002 none [1 2 3 4 5] dlm 1 soot 00030002 none [1 2 3 4 5] gfs 2 soot 00020002 none [1 2 3 4 5] [root@kool /home/devel/whitetank]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010002 none [1 2 3 4 5] dlm 1 soot 00030002 none [1 2 3 4 5] gfs 2 soot 00020002 none [1 2 3 4 5] [root@salem /home/devel/whitetank]# group_tool -v type level name id state node id local_done fence 0 default 00010003 none [1 2 3 4 5] dlm 1 clvmd 00010002 none [1 2 3 4 5] dlm 1 soot 00030002 none [1 2 3 4 5] gfs 2 soot 00020002 none [1 2 3 4 5] cman_tool nodes outputs: [root@camel /home/devel/whitetank]# cman_tool nodes Node Sts Inc Joined Name 1 M 50660 2006-11-02 11:44:44 camel 2 M 50680 2006-11-02 11:44:49 merit 3 M 50676 2006-11-02 11:44:44 winston 4 M 50676 2006-11-02 11:44:44 kool 5 M 50684 2006-11-02 11:46:53 salem [root@merit /home/devel/whitetank]# cman_tool nodes Node Sts Inc Joined Name 1 M 50680 2006-11-02 11:44:58 camel 2 M 50652 2006-11-02 11:44:58 merit 3 M 50680 2006-11-02 11:44:58 winston 4 M 50680 2006-11-02 11:44:58 kool 5 M 50684 2006-11-02 11:47:01 salem [root@winston /home/devel/whitetank]# cman_tool nodes Node Sts Inc Joined Name 1 M 50676 2006-11-02 11:45:41 camel 2 M 50680 2006-11-02 11:45:46 merit 3 M 50640 2006-11-02 11:42:43 winston 4 M 50648 2006-11-02 11:42:43 kool 5 X 50648 salem [root@kool /home/devel/whitetank]# cman_tool nodes Node Sts Inc Joined Name 1 M 50676 2006-11-02 11:47:10 camel 2 M 50680 2006-11-02 11:47:15 merit 3 M 50648 2006-11-02 11:44:12 winston 4 M 50260 2006-11-02 10:27:32 kool 5 X 50636 salem [root@salem /home/devel/whitetank]# cman_tool nodes Node Sts Inc Joined Name 1 M 50684 2006-11-02 11:51:01 camel 2 M 50684 2006-11-02 11:51:01 merit 3 M 50684 2006-11-02 11:51:01 winston 4 M 50684 2006-11-02 11:51:01 kool 5 M 50680 2006-11-02 11:51:01 salem Created attachment 140182 [details]
whitetank/test/testclm2 query output from all nodes
whitetank/test/testclm2 query output from all nodes
Created attachment 140183 [details]
/var/log/messages from the smoke cluster
cman_tool status is the most useful command to use here. It would show that salem is a "disallowed" member because it rejoined without using cman_tool join. These messages appear in syslog: kool.lab.varlog:Nov 2 11:47:15 kool openais[1940]: [MAIN ] Node salem not joined to cman because it has rejoined an inquorate cluster winston.lab.varlog:Nov 2 11:45:41 winston openais[1748]: [MAIN ] Node salem not joined to cman because it has rejoined an inquorate cluster The other two nodes are showing salem as a member because they joined after it did and don't know of its "rogue" status. Leaving aside the question of whether the node should have been fenced, I think the only safe wasy to deal with this is to mark any new nodes that join the cluster while its in this state as "AISONLY" too. The actual solution had to be a little more drastic than that. we can't allow a node to join a cluster with "Disallowed" nodes in it because we don't know which portion of the cluster is the one it ought to ally itself with. Checking in cman_tool/main.c; /cvs/cluster/cluster/cman/cman_tool/main.c,v <-- main.c new revision: 1.51; previous revision: 1.50 done Checking in daemon/cnxman-private.h; /cvs/cluster/cluster/cman/daemon/cnxman-private.h,v <-- cnxman-private.h new revision: 1.26; previous revision: 1.25 done Checking in daemon/commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.54; previous revision: 1.53 done Problem found during smoke testing of the cluster components. Devel ACK, patch already posted for inclusion in beta2 builds. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering. This request is not yet committed for inclusion in release. After pulling in pjc's patch I ran revolver again and ran into a similar issue. The revolver iteration killed 2 nodes ================================================================================ Scenario iteration 18.2 started at Fri Nov 3 14:16:48 CST 2006 Sleeping 1 minute(s) to let the I/O get its lock count up... Scenario: DLM kill Quorum minus one Those picked to face the revolver... merit camel Feeling lucky merit? Well do ya? Go'head make my day... Didn't receive heartbeat for 2 seconds Feeling lucky camel? Well do ya? Go'head make my day... Didn't receive heartbeat for 2 seconds The cluster didn't come up after this attempt and revolved timed out. I logged in and saw that camel, winston and salem had failed to start cman. The other two nodes disagree on 'disallowed nodes'. I'll add cluster status info, logs, IRC chats etc to the bz. pjc, the smoke cluster is undisturbed since the crash in case you need to login and check things out. Chat with riley and dct on IRC about this abhi riley91, camel, winston and salem didn't come up abhi kool and merit have a borked view of the cluster riley91 why didn't camel winston and salem come up ? abhi kool says Nov 3 15:27:08 kool ccsd[1923]: Error while processing connect: Connection re abhi fused abhi Nov 3 15:27:13 kool ccsd[1923]: Cluster is not quorate. Refusing connection. abhi abhi over and over * stanko has quit (Quit: Leaving) riley91 dont run any commands right now abhi ok riley91 but did service cman start start on camel ? riley91 or did that fail in some way riley91 [root@merit ~]# group_tool -v riley91 type level name id state node id local_done riley91 fence 0 default 00010002 FAIL_ALL_STOPPED 3 300030003 -1 riley91 [2 3 4 5] riley91 dlm 1 clvmd 00020002 FAIL_ALL_STOPPED 3 300030003 -1 riley91 [2 3 4 5] abhi [root@camel ~]# service cman status abhi groupd is stopped riley91 ok this is some kind of cman problem riley91 cman is killing aisexec riley91 Nov 3 14:20:45 camel openais[1736]: CMAN: Joined a cluster with disallowed nodes. must die riley91 i think patrick's fix may need some more work abhi riley91, pjc commented in his fix about this abhi https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213747#c5 riley91 ok well there is an interaction between cman and groupd which is not considered with the change riley91 i suggest reopen bug with all log info riley91 probably keep the smoke cluster where it is for now so he can look at it abhi what bothers me is that how did the cluster lose quorum? abhi The test case only knocked out 2 nodes riley91 there are only 2 nodes riley91 all of this killing of aisexec is wrong feist riley91: ok, you should be all set with the bench's on their new apc unit. let me know if you have any issues. riley91 what needs to happen is the nodes need to be fenced riley91 imo abhi Scenario iteration 18.2 started at Fri Nov 3 14:16:48 CST 2006 abhi Sleeping 1 minute(s) to let the I/O get its lock count up... abhi Scenario: DLM kill Quorum minus one abhi Those picked to face the revolver... merit camel abhi Feeling lucky merit? Well do ya? Go'head make my day... abhi Didn't receive heartbeat for 2 seconds abhi Feeling lucky camel? Well do ya? Go'head make my day... abhi Didn't receive heartbeat for 2 seconds riley91 feist cool thanks riley91 abhi reopen the bug riley91 abhi and put all the logs in which would be interesting abhi riley91, ok riley91 or maybe ask kevin before reopening riley91 but definately put logs in saying the patch seems to not get the job done riley91 or there is some other problem abhi riley91, ok riley91 perhaps dct can diagnose after testing of ckpt fix riley91 after i finish ckpt fix building i'll diagnose more riley91 and comment in bug if i find anything else dct__ what happened almost certainly is that revolver killed 2 nodes, then aisexec was momentarily disconnected on one of the live nodes dct__ the two nodes that we're disconnected consider the disconnected node to be failed dct__ i.e. you now have 3 nodes down: 2 from revolver, 1 from network separation dct__ when the separated node comes back, the remaining two don't let it back in the cluster since it wasn't started with cman_tool dct__ the 2 "normal" nodes don't have quorum, so they're not going to fence anyone dct__ i.e. it sounds like everything was going about right --it's largely a matter of interpretting what happened correctly riley91 dct groupd is stuck dct__ how do you mean? dct__ in the scenario above, groupd won't do anything because there's no quorum riley91 [root@merit ~]# group_tool -v riley91 type level name id state node id local_done riley91 fence 0 default 00010002 FAIL_ALL_STOPPED 3 300030003 -1 riley91 [2 3 4 5] riley91 dlm 1 clvmd 00020002 FAIL_ALL_STOPPED 3 300030003 -1 riley91 [2 3 4 5] dct__ yes, it's intentionally not doing anything because there's no quorum riley91 ok kanderso what do you mean by not started with cman_tool ? dct__ I should note that in the output of group_tool probably abhi dct__, what you just said sounds like what might've happened... but this network-separation causing a node to appear dead... what can we do about it? riley91 first we dont know that it was a network seperation riley91 or do you know dave dct__ kanderso: aisexec can get separated from teh cluster, then it automatically rejoins -- cman "filters" this auto-rejoined node out of the cluster dct__ we don't know for sure about the network separation, but it's something like that abhi dct__, is there a log of this rejoining we can look for? dct__ might be avoidable with some work, might no be -- don't know without some deep analysis dct__ abhi: yes, pjc made node of those messages to look for in the bz dct__ and cman_tool status may show you some info about invalid nodes that have been separated and then merged back in abhi here's something weird : abhi Disallowed nodes: winston kool abhi [root@merit ~]# dct__ yes abhi Disallowed nodes: merit winston abhi [root@kool ~]# abhi merit thinks winston and kool are disallowed abhi kool thinks merit and winston are disallowed dct__ and winston and kook probably think merit is disallowed abhi winston is dead dct__ something more might have gone wrong, not sure abhi ok. I'll just add stuff to the bz and reopen it abhi gonna leave the cluster in the same state for pjc to look at later dct__ but part of the explanation are these "disallowed nodes" where aisexec has been separated momentarily riley91 the timeout is 5 seconds riley91 i dont see how we could have a momentary disconnection riley91 unless cman blocks somewhere riley91 perhaps by using gethostbyaddr dct__ good point riley91 or sleep riley91 openais shouldn't have any of that riley91 before we ship we need to find the source of this problem riley91 my best guess if there is a token loss during normal operation now it is caused by blocking in cman or openais executive * mnc has quit (Remote host closed the connection) riley91 on the smoke cluster we shoud set the token loss timer to some very large value riley91 and see if we get this kind of failure riley91 abhi at what time did you have the crash abhi around 2:20 pm today abhi CST cman_tool status on all nodes:
[root@camel ~]# cman_tool status
cman_tool: Cannot open connection to cman, is it running ?
[root@merit ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 57288
Membership state: Cluster-Member
Nodes: 1
Expected votes: 5
Total votes: 1
Quorum: 3 Activity blocked
Active subsystems: 7
Flags: DisallowedNodes
Ports Bound: 0 11
Node name: merit
Node ID: 2
Multicast addresses: 239.192.13.156
Node addresses: 10.15.89.54
Disallowed nodes: winston kool
[root@winston ~]# cman_tool status
cman_tool: Cannot open connection to cman, is it running ?
[root@kool ~]# cman_tool status
Version: 6.0.1
Config Version: 1
Cluster Name: smoke
Cluster Id: 3471
Cluster Member: Yes
Cluster Generation: 57288
Membership state: Cluster-Member
Nodes: 1
Expected votes: 5
Total votes: 1
Quorum: 3 Activity blocked
Active subsystems: 7
Flags: DisallowedNodes
Ports Bound: 0 11
Node name: kool
Node ID: 4
Multicast addresses: 239.192.13.156
Node addresses: 10.15.89.56
Disallowed nodes: merit winston
[root@salem ~]# cman_tool status
cman_tool: Cannot open connection to cman, is it running ?
cman_tool nodes on all nodes:
[root@camel ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?
[root@merit ~]# cman_tool nodes
NOTE: There are 2 disallowed nodes,
members list may seem inconsistent across the cluster
Node Sts Inc Joined Name
1 X 0 camel
2 M 57248 2006-11-03 14:20:40 merit
3 d 57260 2006-11-03 14:20:40 winston
4 d 57260 2006-11-03 14:20:40 kool
5 X 57260 salem
[root@winston ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?
[root@kool ~]# cman_tool nodes
NOTE: There are 2 disallowed nodes,
members list may seem inconsistent across the cluster
Node Sts Inc Joined Name
1 X 57224 camel
2 d 57260 2006-11-03 14:22:55 merit
3 d 57244 2006-11-03 14:20:12 winston
4 M 56140 2006-11-03 10:48:51 kool
5 X 57232 salem
[root@salem ~]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?
group_tool -v on all nodes:
[root@camel ~]# group_tool -v
Unable to connect to groupd. Is it running?
[root@merit ~]# group_tool -v
type level name id state node id local_done
fence 0 default 00010002 FAIL_ALL_STOPPED 3 300030003 -1
[2 3 4 5]
dlm 1 clvmd 00020002 FAIL_ALL_STOPPED 3 300030003 -1
[2 3 4 5]
[root@winston ~]# group_tool -v
Unable to connect to groupd. Is it running?
[root@kool ~]# group_tool -v
type level name id state node id local_done
fence 0 default 00010002 FAIL_ALL_STOPPED 2 200020003 -1
[2 3 4 5]
dlm 1 clvmd 00020002 FAIL_ALL_STOPPED 2 200020003 -1
[2 3 4 5]
dlm 1 soot 00040002 FAIL_ALL_STOPPED 3 300020003 -1
[3 4 5]
gfs 2 soot 00030002 FAIL_ALL_STOPPED 3 300020003 -1
[3 4 5]
[root@salem ~]# group_tool -v
Unable to connect to groupd. Is it running?
"/home/devel/whitetank/test/testclm2 query" on all nodes:
[root@camel ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: service cannot be provided now, try later
[root@merit ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: successful
AIS version supported: B.1.1
called: ClusterTrack, status: successful
view number: 14322
number of items: 2
node index within sequence: 0
cluster node: 2
address: 10.15.89.54
name: 10.15.89.54
member: true
booted: 2006-11-03 14:20:09
initial view number: 0
cluster change: node has not changed
node index within sequence: 1
cluster node: 4
address: 10.15.89.56
name: 10.15.89.56
member: true
booted: 2006-11-03 10:47:56
initial view number: 14035
cluster change: node has not changed
called: Finalize, status: successful
[root@winston ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: service cannot be provided now, try later
[root@kool ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: successful
AIS version supported: B.1.1
called: ClusterTrack, status: successful
view number: 14322
number of items: 2
node index within sequence: 0
cluster node: 4
address: 10.15.89.56
name: 10.15.89.56
member: true
booted: 2006-11-03 10:47:56
initial view number: 0
cluster change: node has not changed
node index within sequence: 1
cluster node: 2
address: 10.15.89.54
name: 10.15.89.54
member: true
booted: 2006-11-03 14:20:09
initial view number: 14312
cluster change: node has not changed
called: Finalize, status: successful
[root@salem ~]# /home/devel/whitetank/test/testclm2 query
called: Initialize, status: service cannot be provided now, try later
Created attachment 140329 [details]
/var/log/messages on the smoke cluster nodes:
oops, if an AISONLY node left the cluster it didn't get flagged as DEAD. Checking in commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.55; previous revision: 1.54 done The delays I'm not sure about. there are certainly no sleep()s in cman and the only calls to netdb are when reading CCS. The latter can happen in running node if a new node joins the cluster with a newer CCS version but I don't think that's what is happening here. Created attachment 140467 [details]
syslog extract from kool
Looking at the syslog on kool, I can't see why openais has split the cluster
(unless I've misunderstood something) - there is certainly not 5 seconds
between the full 4-node cluster being established and 3 nodes leaving.
merit shows something very similar.
*** Bug 214300 has been marked as a duplicate of this bug. *** If 214300 is a dup of this, then this needs to be marked as a QE test blocker. Moving to MODIFIED for the code changes for the output and its relationship to the openais changes in bz 214290. I appeared to have hit this again with the latest builds.
[root@taft-02 ~]# rpm -q cman
cman-2.0.35-2.el5
[root@taft-02 ~]# rpm -q openais
openais-0.80.1-15.el5
I noticed this while running clvmd regression tests. One of the commands hung
due to one of the nodes mysteriously dropping out of the cluster.
[root@taft-01 ~]# group_tool -v
type level name id state node id local_done
fence 0 default 00010001 FAIL_ALL_STOPPED 2 200030003 -1
[1 2 3 4]
dlm 1 clvmd 00090001 FAIL_ALL_STOPPED 2 200030003 -1
[1 2 3 4]
[root@taft-02 ~]# group_tool -v
type level name id state node id local_done
fence 0 default 00010001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm 1 clvmd 00090001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
[root@taft-03 ~]# group_tool -v
Unable to connect to groupd. Is it running?
[root@taft-04 ~]# group_tool -v
type level name id state node id local_done
fence 0 default 00010001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
dlm 1 clvmd 00090001 FAIL_ALL_STOPPED 1 100030003 -1
[1 2 3 4]
Taft-03 is the node that went down:
Nov 13 15:01:25 taft-03 openais[6227]: [TOTEM] Storing new sequence id for ring 3c
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] New Configuration:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.69)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] Members Left:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] Members Joined:
Nov 13 15:01:25 taft-03 openais[6227]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] New Configuration:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.69)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] Members Left:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] Members Joined:
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-03 openais[6227]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-03 openais[6227]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:25 taft-03 openais[6227]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:25 taft-03 gfs_controld[6255]: cluster is down, exiting
Nov 13 15:01:25 taft-03 dlm_controld[6250]: cluster is down, exiting
Nov 13 15:01:25 taft-03 fenced[6245]: cluster is down, exiting
Nov 13 15:01:25 taft-03 kernel: dlm: closing connection to node 3
Nov 13 15:01:46 taft-03 ccsd[6221]: Unable to connect to cluster infrastructure
after 30 second
s.
Nov 13 15:02:16 taft-03 ccsd[6221]: Unable to connect to cluster infrastructure
after 60 second
s.
Nov 13 15:02:46 taft-03 ccsd[6221]: Unable to connect to cluster infrastructure
after 90 second
s.
Taft-02 is the machine with the hung process:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 36 rep 10.15.89.68
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru b high delivered b received
flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [2] member 10.15.89.70:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 48 rep 10.15.89.70
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru b high delivered b received
flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Did not need to originate any
messages in recove
ry.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Storing new sequence id for ring 38
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Left:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Left:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and will provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:25 taft-02 openais[6231]: [MAIN ] Node taft-04 not joined to cman
because it has rejoined an inquorate cluster
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering GATHER state from 11.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Saving state aru 2
high seq received 2
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering COMMIT state.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering RECOVERY
state.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [0] member 10.15.89.67:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq
56 rep 10.15.89.67
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru 2 high delivered 2 received
flag 0 Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [1]
member 10.15.89.68:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 56 rep
10.15.89.67 Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru 2 high
delivered 2 received flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [2] member 10.15.89.69:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq
56 rep 10.15.89.69
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru b high delivered b received
flag 0 Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] position [3]
member 10.15.89.70:
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] previous ring seq 56 rep
10.15.89.67 Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] aru 2 high
delivered 2 received flag 0
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Did not need to originate any
messages in recovery.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] Storing new sequence id for ring
3c Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] CLM CONFIGURATION
CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Left:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and will provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] New Configuration:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.69)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Left:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] Members Joined:
Nov 13 15:01:25 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.69)
Nov 13 15:01:25 taft-02 openais[6231]: [SYNC ] This node is
within the primary component and wi
ll provide service.
Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering
OPERATIONAL state.
Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] The token was lost in the
OPERATIONAL state. Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] Receive
multicast socket recv buffer size (26214
2 bytes).
Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] Transmit multicast
socket send buffer size (2621
42 bytes).
Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] entering GATHER state from 2.
Nov 13 15:01:34 taft-02 openais[6231]: [TOTEM] entering GATHER state from 0.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] Saving state aru 1 high seq
received 1
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] entering COMMIT state.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] entering RECOVERY state.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] position [0] member 10.15.89.67:
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] previous ring seq 60 rep 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] aru 1 high delivered 1 received
flag 0
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] position [1] member 10.15.89.68:
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] previous ring seq 60 rep 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] aru 1 high delivered 1 received
flag 0
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] position [2] member 10.15.89.70:
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] previous ring seq 60 rep 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] aru 1 high delivered 1 received
flag 0
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] Did not need to originate any
messages in recove
ry.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] Storing new sequence id for ring 40
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] New Configuration:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] Members Left:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.69)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] Members Joined:
Nov 13 15:01:35 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] New Configuration:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.67)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.68)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] r(0) ip(10.15.89.70)
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] Members Left:
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] Members Joined:
Nov 13 15:01:35 taft-02 openais[6231]: [SYNC ] This node is within the primary
component and wi
ll provide service.
Nov 13 15:01:35 taft-02 openais[6231]: [TOTEM] entering OPERATIONAL state.
Nov 13 15:01:35 taft-02 openais[6231]: [MAIN ] Node taft-01 not joined to cman
because it has r
ejoined an inquorate cluster
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] got nodejoin message 10.15.89.67
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] got nodejoin message 10.15.89.68
Nov 13 15:01:35 taft-02 openais[6231]: [CLM ] got nodejoin message 10.15.89.70
Nov 13 15:01:35 taft-02 openais[6231]: [CPG ] got joinlist message from node 2
Nov 13 15:01:35 taft-02 openais[6231]: [CPG ] got joinlist message from node 4
Nov 13 15:01:35 taft-02 openais[6231]: [IPC ] Disabling flow control - HW mark
[0/800].
Nov 13 15:01:35 taft-02 openais[6231]: [CPG ] got joinlist message from node 1
This time OpenAIS has hit the 5 second token timeout: Nov 13 15:01:25 taft-02 openais[6231]: [TOTEM] entering OPERATIONAL state. Nov 13 15:01:30 taft-02 openais[6231]: [TOTEM] The token was lost in the OPERATIONAL state. may be fixed by resolution of bz 215626. Have not been able to reproduce with the latest build, openais-0.80.1-17.el5. Marking verified and will open a new bz if seen again. A package has been built which should help the problem described in this bug report. This report is therefore being closed with a resolution of CURRENTRELEASE. You may reopen this bug report if the solution does not work for you. Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed. |