Bug 251966 - merge of openais partitions and disallowed cman nodes
Summary: merge of openais partitions and disallowed cman nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman
Version: 5.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Christine Caulfield
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 443358
TreeView+ depends on / blocked
 
Reported: 2007-08-13 18:01 UTC by Robert Peterson
Modified: 2009-04-16 22:30 UTC (History)
6 users (show)

Fixed In Version: RHBA-2008-0347
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 15:57:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Syslog from roth-01, gzipped. (3.61 MB, application/octet-stream)
2007-08-13 18:16 UTC, Robert Peterson
no flags Details
Syslog from roth-02, gzipped. (3.34 MB, application/octet-stream)
2007-08-13 18:17 UTC, Robert Peterson
no flags Details
Syslog from roth-03, gzipped. (3.20 MB, application/octet-stream)
2007-08-13 18:18 UTC, Robert Peterson
no flags Details
Patch to implement 'dirty' bit (7.67 KB, patch)
2007-08-20 12:08 UTC, Christine Caulfield
no flags Details | Diff
taft-01 (1.00 MB, text/plain)
2007-08-27 15:09 UTC, Corey Marthaler
no flags Details
taft-02 (1.00 MB, text/plain)
2007-08-27 15:11 UTC, Corey Marthaler
no flags Details
taft-03 (1.00 MB, text/plain)
2007-08-27 15:12 UTC, Corey Marthaler
no flags Details
taft-04 (1.00 MB, text/plain)
2007-08-27 15:13 UTC, Corey Marthaler
no flags Details
group_tool dump output from all nodes in tank cluster. (37.57 KB, application/x-gzip)
2007-08-27 15:29 UTC, Nate Straz
no flags Details
/var/log/messages from all tank nodes (1.37 MB, application/x-gzip)
2007-08-27 16:11 UTC, Nate Straz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0347 0 normal SHIPPED_LIVE cman bug fix and enhancement update 2008-05-20 12:39:41 UTC

Description Robert Peterson 2007-08-13 18:01:48 UTC
Description of problem:
This was discovered while working on bug #251082, and appears in
that bug's comment #9:

I had revolver crash again at 2:00 in the morning.  This time it had
different symptoms, and aisexec was still running, so I asked Dave
Teigland to take a look.  According to Dave:

<dct> openais/cpg seems to be the source of the problem with the roth nodes
<dct> the problem seems to be in syncing cpg state when nodes join

The scenario was this:
roth-01 and -03 were shot, and -02 was left to pick up the pieces.
Now roth-02 group_tool -v claims:

fence            0     default  00010002 JOIN_STOP_WAIT 1100020001 1
[1 2]

But roth-01 and -03 both claim:

fence            0     default  00010003 JOIN_START_WAIT 2 200030001 1
[1 2 3]

which seems wrong to me.  I'll keep the cluster in this state so you
can look at it with gdb, since it's still running.  However, I need
my cluster back soon in order to work on other problems.

Version-Release number of selected component (if applicable):
5.1 beta

How reproducible:
Unknown

Steps to Reproduce:
1. Run revolver long enough
  
Actual results:
group_tool -v reports the nodes being in an abnormal state.

Expected results:
group_tool -v should report no problems.

Additional info:

According to Dave Teigland:

"The problem on those nodes seems to be cpg syncing.
One node was left as a cpg member when two nodes were killed,
then the two nodes came back and joined the cpg, but didn't see the
old node there.  So it appears that the cpg state from the one node
that wasn't killed never got synced to the two nodes when they rejoined.

Information was obtained by looking at group_tool dump, at the
"0:default confchg" lines.

"All three nodes are cpg members, 1 and 3 are killed, 2 is the only
one remaining, then 3 comes back, joins the cpg and its confchg says
it's the only member (which it isn't), then 1 comes back, joins the
cpg, and is told that it (1) and 3 are the two members, then finally
1 and 3 are told that 2 is a cpg member (which it was already)."

Here is the output:
1186642340 0:default confchg left 0 joined 1 total 2
1186642344 0:default confchg left 0 joined 1 total 3
1186642465 0:default confchg left 1 joined 0 total 2
1186642465 0:default confchg removed node 1 reason 3
1186642574 0:default confchg left 0 joined 1 total 3
1186642687 0:default confchg left 1 joined 0 total 2
1186642687 0:default confchg removed node 3 reason 3
1186642781 0:default confchg left 0 joined 1 total 3
1186642895 0:default confchg left 2 joined 0 total 1
1186642895 0:default confchg removed node 1 reason 3
1186642895 0:default confchg removed node 3 reason 3
1186643010 0:default confchg left 0 joined 1 total 2
1186643010 0:default confchg left 0 joined 1 total 3

"Now compare that with the same from nodes 1 and 3 and you'll see the
inconsistency."

from 3: 1186643006 0:default confchg left 0 joined 1 total 1
from 3: 1186643006 0:default confchg left 0 joined 1 total 2
from 3: 1186643010 0:default confchg left 0 joined 1 total 3

node 2 sees a sequence of membership as:  1,2,3 -> 2 -> 2,1 -> 2,1,3
node 3 sees a sequence of membership as:  3 -> 1,3 -> 1,2,3
node 1 sees a sequence of membership as:  1,3 -> 1,2,3

Comment 1 Robert Peterson 2007-08-13 18:16:35 UTC
Created attachment 161197 [details]
Syslog from roth-01, gzipped.

Comment 2 Robert Peterson 2007-08-13 18:17:33 UTC
Created attachment 161199 [details]
Syslog from roth-02, gzipped.

Comment 3 Robert Peterson 2007-08-13 18:18:29 UTC
Created attachment 161201 [details]
Syslog from roth-03, gzipped.

Comment 4 Robert Peterson 2007-08-13 18:20:21 UTC
Adding Patrick to the cc list because of cpg's involvement.


Comment 5 Steven Dake 2007-08-13 18:49:05 UTC
A possible scenario for membersihp events is

5 nodes
node1: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5
node2: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5
node 3: 1, 2, 3, 4, 5 -> 1, 2, 3 -> 1, 2, 3, 4, 5
node 4: 4 -> 4, 5 -> 1, 2, 3, 4, 5
node 5: 5 -> 1, 2, 3, 4, 5

each configuration has a different ring id.  The way checkpoint handles this
scenario is if all nodes are transitioning from same previous configuraration no
synchronization is needed - otherwise a full synchronization occurs.

Does groupd handle the scenario outlines above, and if so how does it differ
from the scenario described in comment #0 ?

Comment 6 David Teigland 2007-08-13 19:15:19 UTC
That looks illegal to me, I think node 5 would have to see:
5 -> 4,5 -> 1,2,3,4,5



Comment 7 David Teigland 2007-08-13 19:19:33 UTC
My initial analysis was wrong; I failed to notice that there were two cluster
partitions.  The following now seems to explain it.

cman membership: 1,2,3
cpg membership: 1,2,3

1 and 3 die

1 and 3 come back and form partitioned, quorate cluster

cman membership partition A: 2
cpg membership in partition A: 2

cman membership partition B: 1,3
cpg membership in partition B: 1,3

When partition B becomes quorate, it will fence the inquorate partition A 
(in this case node 2).

Before the nodes in partition B fence node 2, the two partitions merge,
and 1,3 see 2 join the cluster.

The groupd/fenced/dlm/gfs infrastructure above openais assumes that when a
node joins the cluster, it has been newly started and has no existing
cpg/groupd/fenced/dlm/gfs state associated with it.

This is what the ais-only/disallowed state in cman is intended to handle.
When the two partitions merge at the openais level, they should remain
separate at the cman level:  1,3 showing 2 as disallowed, and 2 showing
1,3 as disallowed.  Then, 1,3 would fence 2 as intended and things would
continue operating fine.

So, this looks like a case where the cman disallowed-node detection should
have kicked in but didn't.


Comment 8 Steven Dake 2007-08-14 00:30:11 UTC
Whatever the problem is this is likely a blocker for 5.1 since the result is
total cluster faliure.

Patrick can you look at the logs and see if things are operating properly for
cman for the disallowed case as per Dave's assessment?

It looks to me as if during recovery openais has a configuration change (a new
node coming up) which may have disrupted the last round of recovery by either
CPG or cman.  Hard to tell without more collaboration on this issue and perhaps
a special debug build of openais.

Comment 9 Christine Caulfield 2007-08-17 09:50:03 UTC
This is a case that the disallowed code doesn't handle. Because nodes 1 & 3 have
been down they don't know that node 2 has been up while they were down and came
back. So when it tries to join the cluster they are happy to let it do so
because they can't tell that the cluster was previously partitioned, as far as
they can tell, node 2 is simply a new node for the cluster

It's clearly a flaw in the disallowed code because it can't detect two clusters
coming up partitioned like that and then merging. Steve: is there anything we
can do with ring IDs here?

Comment 10 Steven Dake 2007-08-17 14:02:18 UTC
add sdake as cc.

Comment 11 Christine Caulfield 2007-08-20 12:08:19 UTC
Created attachment 161864 [details]
Patch to implement 'dirty' bit

This patch implements and idea we discussed on IRC on Friday. It adds an API
called (to be used by groupd) that sets a bit inside cman to indicate that it
has state (a dirty" bit).

Basically, a node that has state cannot join a cluster with another node that
also has state.

Comment 12 David Teigland 2007-08-20 14:15:58 UTC
Looks good to me.  Hitting this situation naturally is very rare, so we'll
have to force it to happen to test it.  Let's commit this on HEAD and I'll
do the same with the groupd change and then we can work on testing it.


Comment 13 Nate Straz 2007-08-25 13:48:25 UTC
I ran into this on the tank cluster while running revolver on GFS.  I left the
cluster for inspection.  What data needs to be collected for this bug?

Comment 14 David Teigland 2007-08-27 14:51:30 UTC
To know if you hit this bug or not we'd need to see the groupd logs from
all the nodes: group_tool dump > groupd.txt

Comment 15 Corey Marthaler 2007-08-27 15:06:02 UTC
I appear to have hit this over the weekend on the taft cluster. I'll attach the
groupd logs...

Comment 16 Corey Marthaler 2007-08-27 15:09:51 UTC
Created attachment 173681 [details]
taft-01

Comment 17 Corey Marthaler 2007-08-27 15:11:00 UTC
Created attachment 173701 [details]
taft-02

Comment 18 Corey Marthaler 2007-08-27 15:12:10 UTC
Created attachment 173721 [details]
taft-03

Comment 19 Corey Marthaler 2007-08-27 15:13:35 UTC
Created attachment 173741 [details]
taft-04

Comment 20 Nate Straz 2007-08-27 15:29:26 UTC
Created attachment 173901 [details]
group_tool dump output from all nodes in tank cluster.

Comment 21 David Teigland 2007-08-27 16:00:21 UTC
The taft nodes in comments 16-19 do look like this bug:
1,2,3,4 are running
2,3,4 are killed
2,3,4 come back and form their own cluster, partitioned from 1
the two partitions (1 vs 2,3,4) then merge

03

1187993027 0:default confchg left 0 joined 1 total 1
1187993027 0:default process_node_join 3
1187993027 0:default cpg add node 3 total 1

02

1187993027 0:default confchg left 0 joined 1 total 2
1187993027 0:default process_node_join 2
1187993027 0:default cpg add node 2 total 1
1187993027 0:default cpg add node 3 total 2

04

1187993027 0:default confchg left 0 joined 1 total 3
1187993027 0:default process_node_join 4
1187993027 0:default cpg add node 4 total 1
1187993027 0:default cpg add node 3 total 2
1187993027 0:default cpg add node 2 total 3

01

1187992876 0:default confchg left 3 joined 0 total 1
1187992876 0:default confchg removed node 2 reason 3
1187992876 0:default confchg removed node 3 reason 3
1187992876 0:default confchg removed node 4 reason 3

1187993033 0:default confchg left 0 joined 1 total 2
1187993033 0:default process_node_join 2
1187993033 0:default cpg add node 2 total 2

1187993033 0:default confchg left 0 joined 1 total 3
1187993033 0:default process_node_join 3
1187993033 0:default cpg add node 3 total 3

1187993033 0:default confchg left 0 joined 1 total 4
1187993033 0:default process_node_join 4
1187993033 0:default cpg add node 4 total 4


Comment 22 Nate Straz 2007-08-27 16:11:34 UTC
Created attachment 173961 [details]
/var/log/messages from all tank nodes

Comment 23 David Teigland 2007-08-27 21:27:43 UTC
comments 20 and 22 are not related to this bz; they are related to bug 258121


Comment 24 Christine Caulfield 2007-08-28 07:56:33 UTC
re: comment #19: so we're saying that the dirty flag isn't working ... and do we
have the patch to groupd in place as well as the cman one that's in this bz?

Comment 25 David Teigland 2007-08-28 14:15:11 UTC
re comment 24 re comment 19: they are using the unfixed, RHEL5 code.  I'm very
confident that the dirty-flag fix in cvs HEAD is the solution to this whole
problem.  I think we should sync the fix to the RHEL5 branch so we don't forget
it.  I spent some time with iptables trying to simulate the kind of cluster
partition needed to test the fix, but couldn't quite get it right.


Comment 26 Christine Caulfield 2007-09-17 13:25:50 UTC
Committed to RHEL5 branch.
Checking in cman_tool/main.c;
/cvs/cluster/cluster/cman/cman_tool/main.c,v  <--  main.c
new revision: 1.51.2.2; previous revision: 1.51.2.1
done
Checking in daemon/cnxman-private.h;
/cvs/cluster/cluster/cman/daemon/cnxman-private.h,v  <--  cnxman-private.h
new revision: 1.26.2.1; previous revision: 1.26
done
Checking in daemon/cnxman-socket.h;
/cvs/cluster/cluster/cman/daemon/cnxman-socket.h,v  <--  cnxman-socket.h
new revision: 1.17.2.1; previous revision: 1.17
done
Checking in daemon/commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.2.10; previous revision: 1.55.2.9
done
Checking in lib/libcman.c;
/cvs/cluster/cluster/cman/lib/libcman.c,v  <--  libcman.c
new revision: 1.30.2.4; previous revision: 1.30.2.3
done
Checking in lib/libcman.h;
/cvs/cluster/cluster/cman/lib/libcman.h,v  <--  libcman.h
new revision: 1.29.2.1; previous revision: 1.29
done


Comment 27 David Teigland 2007-09-28 16:57:45 UTC
This bug has been reported on the linux-cluster mailing list, so we may
need to fix it sooner than 5.2.

https://www.redhat.com/archives/linux-cluster/2007-September/msg00257.html


Comment 30 errata-xmlrpc 2008-05-21 15:57:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html



Note You need to log in before you can comment on or make changes to this bug.