Red Hat Bugzilla – Bug 251966
merge of openais partitions and disallowed cman nodes
Last modified: 2009-04-16 18:30:07 EDT
Description of problem:
This was discovered while working on bug #251082, and appears in
that bug's comment #9:
I had revolver crash again at 2:00 in the morning. This time it had
different symptoms, and aisexec was still running, so I asked Dave
Teigland to take a look. According to Dave:
<dct> openais/cpg seems to be the source of the problem with the roth nodes
<dct> the problem seems to be in syncing cpg state when nodes join
The scenario was this:
roth-01 and -03 were shot, and -02 was left to pick up the pieces.
Now roth-02 group_tool -v claims:
fence 0 default 00010002 JOIN_STOP_WAIT 1100020001 1
But roth-01 and -03 both claim:
fence 0 default 00010003 JOIN_START_WAIT 2 200030001 1
[1 2 3]
which seems wrong to me. I'll keep the cluster in this state so you
can look at it with gdb, since it's still running. However, I need
my cluster back soon in order to work on other problems.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Run revolver long enough
group_tool -v reports the nodes being in an abnormal state.
group_tool -v should report no problems.
According to Dave Teigland:
"The problem on those nodes seems to be cpg syncing.
One node was left as a cpg member when two nodes were killed,
then the two nodes came back and joined the cpg, but didn't see the
old node there. So it appears that the cpg state from the one node
that wasn't killed never got synced to the two nodes when they rejoined.
Information was obtained by looking at group_tool dump, at the
"0:default confchg" lines.
"All three nodes are cpg members, 1 and 3 are killed, 2 is the only
one remaining, then 3 comes back, joins the cpg and its confchg says
it's the only member (which it isn't), then 1 comes back, joins the
cpg, and is told that it (1) and 3 are the two members, then finally
1 and 3 are told that 2 is a cpg member (which it was already)."
Here is the output:
1186642340 0:default confchg left 0 joined 1 total 2
1186642344 0:default confchg left 0 joined 1 total 3
1186642465 0:default confchg left 1 joined 0 total 2
1186642465 0:default confchg removed node 1 reason 3
1186642574 0:default confchg left 0 joined 1 total 3
1186642687 0:default confchg left 1 joined 0 total 2
1186642687 0:default confchg removed node 3 reason 3
1186642781 0:default confchg left 0 joined 1 total 3
1186642895 0:default confchg left 2 joined 0 total 1
1186642895 0:default confchg removed node 1 reason 3
1186642895 0:default confchg removed node 3 reason 3
1186643010 0:default confchg left 0 joined 1 total 2
1186643010 0:default confchg left 0 joined 1 total 3
"Now compare that with the same from nodes 1 and 3 and you'll see the
from 3: 1186643006 0:default confchg left 0 joined 1 total 1
from 3: 1186643006 0:default confchg left 0 joined 1 total 2
from 3: 1186643010 0:default confchg left 0 joined 1 total 3
node 2 sees a sequence of membership as: 1,2,3 -> 2 -> 2,1 -> 2,1,3
node 3 sees a sequence of membership as: 3 -> 1,3 -> 1,2,3
node 1 sees a sequence of membership as: 1,3 -> 1,2,3
Created attachment 161197 [details]
Syslog from roth-01, gzipped.
Created attachment 161199 [details]
Syslog from roth-02, gzipped.
Created attachment 161201 [details]
Syslog from roth-03, gzipped.
Adding Patrick to the cc list because of cpg's involvement.
A possible scenario for membersihp events is
node1: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5
node2: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5
node 3: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5
node 4: 4 -> 4, 5 -> 1, 2, 3, 4, 5
node 5: 5 -> 1, 2, 3, 4, 5
each configuration has a different ring id. The way checkpoint handles this
scenario is if all nodes are transitioning from same previous configuraration no
synchronization is needed - otherwise a full synchronization occurs.
Does groupd handle the scenario outlines above, and if so how does it differ
from the scenario described in comment #0 ?
That looks illegal to me, I think node 5 would have to see:
5 -> 4,5 -> 1,2,3,4,5
My initial analysis was wrong; I failed to notice that there were two cluster
partitions. The following now seems to explain it.
cman membership: 1,2,3
cpg membership: 1,2,3
1 and 3 die
1 and 3 come back and form partitioned, quorate cluster
cman membership partition A: 2
cpg membership in partition A: 2
cman membership partition B: 1,3
cpg membership in partition B: 1,3
When partition B becomes quorate, it will fence the inquorate partition A
(in this case node 2).
Before the nodes in partition B fence node 2, the two partitions merge,
and 1,3 see 2 join the cluster.
The groupd/fenced/dlm/gfs infrastructure above openais assumes that when a
node joins the cluster, it has been newly started and has no existing
cpg/groupd/fenced/dlm/gfs state associated with it.
This is what the ais-only/disallowed state in cman is intended to handle.
When the two partitions merge at the openais level, they should remain
separate at the cman level: 1,3 showing 2 as disallowed, and 2 showing
1,3 as disallowed. Then, 1,3 would fence 2 as intended and things would
continue operating fine.
So, this looks like a case where the cman disallowed-node detection should
have kicked in but didn't.
Whatever the problem is this is likely a blocker for 5.1 since the result is
total cluster faliure.
Patrick can you look at the logs and see if things are operating properly for
cman for the disallowed case as per Dave's assessment?
It looks to me as if during recovery openais has a configuration change (a new
node coming up) which may have disrupted the last round of recovery by either
CPG or cman. Hard to tell without more collaboration on this issue and perhaps
a special debug build of openais.
This is a case that the disallowed code doesn't handle. Because nodes 1 & 3 have
been down they don't know that node 2 has been up while they were down and came
back. So when it tries to join the cluster they are happy to let it do so
because they can't tell that the cluster was previously partitioned, as far as
they can tell, node 2 is simply a new node for the cluster
It's clearly a flaw in the disallowed code because it can't detect two clusters
coming up partitioned like that and then merging. Steve: is there anything we
can do with ring IDs here?
add sdake as cc.
Created attachment 161864 [details]
Patch to implement 'dirty' bit
This patch implements and idea we discussed on IRC on Friday. It adds an API
called (to be used by groupd) that sets a bit inside cman to indicate that it
has state (a dirty" bit).
Basically, a node that has state cannot join a cluster with another node that
also has state.
Looks good to me. Hitting this situation naturally is very rare, so we'll
have to force it to happen to test it. Let's commit this on HEAD and I'll
do the same with the groupd change and then we can work on testing it.
I ran into this on the tank cluster while running revolver on GFS. I left the
cluster for inspection. What data needs to be collected for this bug?
To know if you hit this bug or not we'd need to see the groupd logs from
all the nodes: group_tool dump > groupd.txt
I appear to have hit this over the weekend on the taft cluster. I'll attach the
Created attachment 173681 [details]
Created attachment 173701 [details]
Created attachment 173721 [details]
Created attachment 173741 [details]
Created attachment 173901 [details]
group_tool dump output from all nodes in tank cluster.
The taft nodes in comments 16-19 do look like this bug:
1,2,3,4 are running
2,3,4 are killed
2,3,4 come back and form their own cluster, partitioned from 1
the two partitions (1 vs 2,3,4) then merge
1187993027 0:default confchg left 0 joined 1 total 1
1187993027 0:default process_node_join 3
1187993027 0:default cpg add node 3 total 1
1187993027 0:default confchg left 0 joined 1 total 2
1187993027 0:default process_node_join 2
1187993027 0:default cpg add node 2 total 1
1187993027 0:default cpg add node 3 total 2
1187993027 0:default confchg left 0 joined 1 total 3
1187993027 0:default process_node_join 4
1187993027 0:default cpg add node 4 total 1
1187993027 0:default cpg add node 3 total 2
1187993027 0:default cpg add node 2 total 3
1187992876 0:default confchg left 3 joined 0 total 1
1187992876 0:default confchg removed node 2 reason 3
1187992876 0:default confchg removed node 3 reason 3
1187992876 0:default confchg removed node 4 reason 3
1187993033 0:default confchg left 0 joined 1 total 2
1187993033 0:default process_node_join 2
1187993033 0:default cpg add node 2 total 2
1187993033 0:default confchg left 0 joined 1 total 3
1187993033 0:default process_node_join 3
1187993033 0:default cpg add node 3 total 3
1187993033 0:default confchg left 0 joined 1 total 4
1187993033 0:default process_node_join 4
1187993033 0:default cpg add node 4 total 4
Created attachment 173961 [details]
/var/log/messages from all tank nodes
comments 20 and 22 are not related to this bz; they are related to bug 258121
re: comment #19: so we're saying that the dirty flag isn't working ... and do we
have the patch to groupd in place as well as the cman one that's in this bz?
re comment 24 re comment 19: they are using the unfixed, RHEL5 code. I'm very
confident that the dirty-flag fix in cvs HEAD is the solution to this whole
problem. I think we should sync the fix to the RHEL5 branch so we don't forget
it. I spent some time with iptables trying to simulate the kind of cluster
partition needed to test the fix, but couldn't quite get it right.
Committed to RHEL5 branch.
Checking in cman_tool/main.c;
/cvs/cluster/cluster/cman/cman_tool/main.c,v <-- main.c
new revision: 184.108.40.206; previous revision: 220.127.116.11
Checking in daemon/cnxman-private.h;
/cvs/cluster/cluster/cman/daemon/cnxman-private.h,v <-- cnxman-private.h
new revision: 18.104.22.168; previous revision: 1.26
Checking in daemon/cnxman-socket.h;
/cvs/cluster/cluster/cman/daemon/cnxman-socket.h,v <-- cnxman-socket.h
new revision: 22.214.171.124; previous revision: 1.17
Checking in daemon/commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c
new revision: 126.96.36.199; previous revision: 188.8.131.52
Checking in lib/libcman.c;
/cvs/cluster/cluster/cman/lib/libcman.c,v <-- libcman.c
new revision: 184.108.40.206; previous revision: 220.127.116.11
Checking in lib/libcman.h;
/cvs/cluster/cluster/cman/lib/libcman.h,v <-- libcman.h
new revision: 18.104.22.168; previous revision: 1.29
This bug has been reported on the linux-cluster mailing list, so we may
need to fix it sooner than 5.2.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.