Description of problem: This was discovered while working on bug #251082, and appears in that bug's comment #9: I had revolver crash again at 2:00 in the morning. This time it had different symptoms, and aisexec was still running, so I asked Dave Teigland to take a look. According to Dave: <dct> openais/cpg seems to be the source of the problem with the roth nodes <dct> the problem seems to be in syncing cpg state when nodes join The scenario was this: roth-01 and -03 were shot, and -02 was left to pick up the pieces. Now roth-02 group_tool -v claims: fence 0 default 00010002 JOIN_STOP_WAIT 1100020001 1 [1 2] But roth-01 and -03 both claim: fence 0 default 00010003 JOIN_START_WAIT 2 200030001 1 [1 2 3] which seems wrong to me. I'll keep the cluster in this state so you can look at it with gdb, since it's still running. However, I need my cluster back soon in order to work on other problems. Version-Release number of selected component (if applicable): 5.1 beta How reproducible: Unknown Steps to Reproduce: 1. Run revolver long enough Actual results: group_tool -v reports the nodes being in an abnormal state. Expected results: group_tool -v should report no problems. Additional info: According to Dave Teigland: "The problem on those nodes seems to be cpg syncing. One node was left as a cpg member when two nodes were killed, then the two nodes came back and joined the cpg, but didn't see the old node there. So it appears that the cpg state from the one node that wasn't killed never got synced to the two nodes when they rejoined. Information was obtained by looking at group_tool dump, at the "0:default confchg" lines. "All three nodes are cpg members, 1 and 3 are killed, 2 is the only one remaining, then 3 comes back, joins the cpg and its confchg says it's the only member (which it isn't), then 1 comes back, joins the cpg, and is told that it (1) and 3 are the two members, then finally 1 and 3 are told that 2 is a cpg member (which it was already)." Here is the output: 1186642340 0:default confchg left 0 joined 1 total 2 1186642344 0:default confchg left 0 joined 1 total 3 1186642465 0:default confchg left 1 joined 0 total 2 1186642465 0:default confchg removed node 1 reason 3 1186642574 0:default confchg left 0 joined 1 total 3 1186642687 0:default confchg left 1 joined 0 total 2 1186642687 0:default confchg removed node 3 reason 3 1186642781 0:default confchg left 0 joined 1 total 3 1186642895 0:default confchg left 2 joined 0 total 1 1186642895 0:default confchg removed node 1 reason 3 1186642895 0:default confchg removed node 3 reason 3 1186643010 0:default confchg left 0 joined 1 total 2 1186643010 0:default confchg left 0 joined 1 total 3 "Now compare that with the same from nodes 1 and 3 and you'll see the inconsistency." from 3: 1186643006 0:default confchg left 0 joined 1 total 1 from 3: 1186643006 0:default confchg left 0 joined 1 total 2 from 3: 1186643010 0:default confchg left 0 joined 1 total 3 node 2 sees a sequence of membership as: 1,2,3 -> 2 -> 2,1 -> 2,1,3 node 3 sees a sequence of membership as: 3 -> 1,3 -> 1,2,3 node 1 sees a sequence of membership as: 1,3 -> 1,2,3
Created attachment 161197 [details] Syslog from roth-01, gzipped.
Created attachment 161199 [details] Syslog from roth-02, gzipped.
Created attachment 161201 [details] Syslog from roth-03, gzipped.
Adding Patrick to the cc list because of cpg's involvement.
A possible scenario for membersihp events is 5 nodes node1: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5 node2: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5 node 3: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5 node 4: 4 -> 4, 5 -> 1, 2, 3, 4, 5 node 5: 5 -> 1, 2, 3, 4, 5 each configuration has a different ring id. The way checkpoint handles this scenario is if all nodes are transitioning from same previous configuraration no synchronization is needed - otherwise a full synchronization occurs. Does groupd handle the scenario outlines above, and if so how does it differ from the scenario described in comment #0 ?
That looks illegal to me, I think node 5 would have to see: 5 -> 4,5 -> 1,2,3,4,5
My initial analysis was wrong; I failed to notice that there were two cluster partitions. The following now seems to explain it. cman membership: 1,2,3 cpg membership: 1,2,3 1 and 3 die 1 and 3 come back and form partitioned, quorate cluster cman membership partition A: 2 cpg membership in partition A: 2 cman membership partition B: 1,3 cpg membership in partition B: 1,3 When partition B becomes quorate, it will fence the inquorate partition A (in this case node 2). Before the nodes in partition B fence node 2, the two partitions merge, and 1,3 see 2 join the cluster. The groupd/fenced/dlm/gfs infrastructure above openais assumes that when a node joins the cluster, it has been newly started and has no existing cpg/groupd/fenced/dlm/gfs state associated with it. This is what the ais-only/disallowed state in cman is intended to handle. When the two partitions merge at the openais level, they should remain separate at the cman level: 1,3 showing 2 as disallowed, and 2 showing 1,3 as disallowed. Then, 1,3 would fence 2 as intended and things would continue operating fine. So, this looks like a case where the cman disallowed-node detection should have kicked in but didn't.
Whatever the problem is this is likely a blocker for 5.1 since the result is total cluster faliure. Patrick can you look at the logs and see if things are operating properly for cman for the disallowed case as per Dave's assessment? It looks to me as if during recovery openais has a configuration change (a new node coming up) which may have disrupted the last round of recovery by either CPG or cman. Hard to tell without more collaboration on this issue and perhaps a special debug build of openais.
This is a case that the disallowed code doesn't handle. Because nodes 1 & 3 have been down they don't know that node 2 has been up while they were down and came back. So when it tries to join the cluster they are happy to let it do so because they can't tell that the cluster was previously partitioned, as far as they can tell, node 2 is simply a new node for the cluster It's clearly a flaw in the disallowed code because it can't detect two clusters coming up partitioned like that and then merging. Steve: is there anything we can do with ring IDs here?
add sdake as cc.
Created attachment 161864 [details] Patch to implement 'dirty' bit This patch implements and idea we discussed on IRC on Friday. It adds an API called (to be used by groupd) that sets a bit inside cman to indicate that it has state (a dirty" bit). Basically, a node that has state cannot join a cluster with another node that also has state.
Looks good to me. Hitting this situation naturally is very rare, so we'll have to force it to happen to test it. Let's commit this on HEAD and I'll do the same with the groupd change and then we can work on testing it.
I ran into this on the tank cluster while running revolver on GFS. I left the cluster for inspection. What data needs to be collected for this bug?
To know if you hit this bug or not we'd need to see the groupd logs from all the nodes: group_tool dump > groupd.txt
I appear to have hit this over the weekend on the taft cluster. I'll attach the groupd logs...
Created attachment 173681 [details] taft-01
Created attachment 173701 [details] taft-02
Created attachment 173721 [details] taft-03
Created attachment 173741 [details] taft-04
Created attachment 173901 [details] group_tool dump output from all nodes in tank cluster.
The taft nodes in comments 16-19 do look like this bug: 1,2,3,4 are running 2,3,4 are killed 2,3,4 come back and form their own cluster, partitioned from 1 the two partitions (1 vs 2,3,4) then merge 03 1187993027 0:default confchg left 0 joined 1 total 1 1187993027 0:default process_node_join 3 1187993027 0:default cpg add node 3 total 1 02 1187993027 0:default confchg left 0 joined 1 total 2 1187993027 0:default process_node_join 2 1187993027 0:default cpg add node 2 total 1 1187993027 0:default cpg add node 3 total 2 04 1187993027 0:default confchg left 0 joined 1 total 3 1187993027 0:default process_node_join 4 1187993027 0:default cpg add node 4 total 1 1187993027 0:default cpg add node 3 total 2 1187993027 0:default cpg add node 2 total 3 01 1187992876 0:default confchg left 3 joined 0 total 1 1187992876 0:default confchg removed node 2 reason 3 1187992876 0:default confchg removed node 3 reason 3 1187992876 0:default confchg removed node 4 reason 3 1187993033 0:default confchg left 0 joined 1 total 2 1187993033 0:default process_node_join 2 1187993033 0:default cpg add node 2 total 2 1187993033 0:default confchg left 0 joined 1 total 3 1187993033 0:default process_node_join 3 1187993033 0:default cpg add node 3 total 3 1187993033 0:default confchg left 0 joined 1 total 4 1187993033 0:default process_node_join 4 1187993033 0:default cpg add node 4 total 4
Created attachment 173961 [details] /var/log/messages from all tank nodes
comments 20 and 22 are not related to this bz; they are related to bug 258121
re: comment #19: so we're saying that the dirty flag isn't working ... and do we have the patch to groupd in place as well as the cman one that's in this bz?
re comment 24 re comment 19: they are using the unfixed, RHEL5 code. I'm very confident that the dirty-flag fix in cvs HEAD is the solution to this whole problem. I think we should sync the fix to the RHEL5 branch so we don't forget it. I spent some time with iptables trying to simulate the kind of cluster partition needed to test the fix, but couldn't quite get it right.
Committed to RHEL5 branch. Checking in cman_tool/main.c; /cvs/cluster/cluster/cman/cman_tool/main.c,v <-- main.c new revision: 1.51.2.2; previous revision: 1.51.2.1 done Checking in daemon/cnxman-private.h; /cvs/cluster/cluster/cman/daemon/cnxman-private.h,v <-- cnxman-private.h new revision: 1.26.2.1; previous revision: 1.26 done Checking in daemon/cnxman-socket.h; /cvs/cluster/cluster/cman/daemon/cnxman-socket.h,v <-- cnxman-socket.h new revision: 1.17.2.1; previous revision: 1.17 done Checking in daemon/commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.55.2.10; previous revision: 1.55.2.9 done Checking in lib/libcman.c; /cvs/cluster/cluster/cman/lib/libcman.c,v <-- libcman.c new revision: 1.30.2.4; previous revision: 1.30.2.3 done Checking in lib/libcman.h; /cvs/cluster/cluster/cman/lib/libcman.h,v <-- libcman.h new revision: 1.29.2.1; previous revision: 1.29 done
This bug has been reported on the linux-cluster mailing list, so we may need to fix it sooner than 5.2. https://www.redhat.com/archives/linux-cluster/2007-September/msg00257.html
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0347.html