Bug 323931

Summary:	disallowed nodes and inconsistent cluster views
Product:	Red Hat Enterprise Linux 5	Reporter:	Steven Dake <sdake>
Component:	cman	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	low	Docs Contact:
Priority:	medium
Version:	5.0	CC:	cluster-maint, sdake, teigland
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2008-0347	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-05-21 15:58:03 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Steven Dake 2007-10-09 12:38:09 UTC

The disallowed state can happen in a cluster if openais fails to schedule for
more then 10 seconds.  Unfortunately this can happen with the RHEL5 kernel
because it does not include full preemption.  The result is total cluster
failure.  Either documentation needs to be written to describe how to recover
from this failure or the disallowed state needs to be totally removed or reworked.

I can make a special patch for openais which will generate the disallowed node
state via openais-cfgtool.  In this way we can come up with a valid method for
properly fixing this issue.

Please note this is not an openais issue but rather a clone of a different
openais bug which can be used to demonstrate this issue.

Comment 2 David Teigland 2007-10-09 14:51:49 UTC

If openais fails to send the token within the timeout period on node A, then
openais on the other nodes remove A from the membership, and A removes the
others similarly.  This is a cluster partition.  Later, when A *does* manage
to forward the token, the membership changes back to the way it was, with A
and the other nodes all a part of one membership; the partitions have been
merged.

The problem is, when A comes back, the other nodes can't tell whether:
1) A is joining the cluster as a new member having just started up, or
2) A is rejoining the cluster as an old member that was momentarily
partitioned and is now merged back.

The cluster software above openais *must* know which of the two possibilities
is really the case, because it must take different actions for each.

One way of fixing this is to make openais *not* automatically merge the
two partitions (A with the rest) back together after they are partitioned.
We discussed this only briefly long ago, and decided it would be too unnatural
for openais to do this, so we abandoned the idea.  I'm assuming that's still
the case, which is a pity.

A second way of fixing this is to allow openais to partition and merge at
will, as it always has, but have cman, which represents the cluster membership
to the higher level software, detect the case above and treat the remerged
nodes in case 2 as non-members, i.e. it keeps the partition in place.
This is the disallowed nodes case.

A third way of fixing this is to change each of the higher level components
to deal with the remerged nodes themselves.  This would amount to
implementing the disallowed case N times, in N higher level components,
instead of 1 time in cman.  (Some higher level components may be able to
skip the disallowed logic and treat cases 1 and 2 above identically.)

A fourth way of working around this, is to raise the timeouts in openais
to make the transient partitions extremely unlikely in the first place.
We've also taken this route to date, along with adding the disallowed
logic to cman.

The first three approaches are ways of dealing with the transient partitions,
the fourth approach is a way to make them rare.  So, what can cause these
transient partitions in the first place?

1. openais bugs.  This was the cause leading to this specific bz.
2. network problems
3. network driver resets and other driver "bugs"
4. kernel subsystems hogging all cpu's and not scheduling, e.g. dlm, gfs

All of these have likely been causes at various times for transient partitions.
None of them are very common in general.  Most of the time, the default
openais timeouts will let us ride over all of them, but occasionally, any
one of them could be big enough to surpass the timeouts.

Comment 3 RHEL Program Management 2007-10-16 03:35:47 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Steven Dake 2007-10-16 06:51:46 UTC

my general problem with how disallowed works currently is that it results in
complete cluster failure requiring at minimum reboot of a major subset of the
nodes in the cluster.  It also does not "self-heal" requiring administrative
intervention to continue operation.  Further "what to do" once it happens is not
documented.  I in fact don't even know how to resolve the disallowed cluster
state without rebooting every node in the cluster.

Your missing an alternative wihch is that the state is resynchronized each time
a processor joins the configuration:
openais joins and remerges
cman says "oh I got a new node let me tell all other nodes about my state"
All higher level software says "oh I got a new node let me reset any state
information I had about that node"

This is exactly how the event service or checkpoint service work currently. 
WHile I realize it is difficult to resynchornize, it can be done and provides
superior reliability results.

This is not some alteration of the cluster membership on top of openais (which
is what cman is doing) but instead a change in the way new configurations are
handled.  If a processor leaves a configuration any state associated with it
could be reset and if a processor joins, its state should be synchronized (if it
has no state, then it synchronizes nothing).

imo the problem is that the current upper level code does not support
resynchronization of state.  If that is not possible to do, then it should be
stated now so we can look at other alternatives.

Comment 5 David Teigland 2007-10-16 14:59:13 UTC

> my general problem with how disallowed works currently is that it results in
> complete cluster failure requiring at minimum reboot of a major subset of the
> nodes in the cluster.

If you're rebooting some nodes, others are still running.  That's not a
complete cluster failure.

> It also does not "self-heal" requiring administrative
> intervention to continue operation.

It does "self-heal" unless the partitioning has left no partition with quorum.

> Further "what to do" once it happens is
> not documented.  I in fact don't even know how to resolve the disallowed
> cluster state without rebooting every node in the cluster.

In the case that a spurious cluster partition has left all partitions without
quorum, then yes, we should have some instructions on how to handle it.  The
solution is to reboot nodes in one of the partitions (ideally a partition
with fewer nodes).  Perhaps a new section in the cman_tool man page.

As an aside, this is difficult to understand because we don't have a way
to view the cluster-as-a-whole from a vantage point outside of the cluster.
We only have different perspectives of the cluster from the viewpoint of
nodes within it. A person must infer the condition of the whole cluster
by piecing together these different views.  It would be nice to design
some command-line based utility that could run outside the cluster and give
an external perspective, which could also show current cluster partitions.

> Your missing an alternative wihch is that the state is resynchronized 
> each time a processor joins the configuration: openais joins and
> remerges cman says "oh I got a new node let me tell all other nodes
> about my state" All higher level software says "oh I got a new node let
> me reset any state information I had about that node"
> 
> This is exactly how the event service or checkpoint service work
> currently.  WHile I realize it is difficult to resynchornize, it can be
> done and provides superior reliability results.
...
> imo the problem is that the current upper level code does not support
> resynchronization of state.  If that is not possible to do, then it
> should be stated now so we can look at other alternatives.

The problem with dlm/gfs is that the state diverges as soon as the
partition occurs.  The state is spread all over (from the applications
and throughout the kernel vm, vfs, gfs, device drivers, etc), and it's
*contradictory*, so it cannot be merged if suddenly the nodes can
communicate again.  The one and only way for dlm/gfs to resynchronize
is to go through the recovery process in the partition that continues
running and to reboot in the other partition.

Comment 6 Christine Caulfield 2007-10-30 14:06:42 UTC

I'm going to treat this as a documentation request.

I will take it upon myself to write a document about this situation (this bit
already exists in the cman/openais document I wrote) and how to best recover
from it.

Comment 7 Christine Caulfield 2007-11-07 09:45:57 UTC

I've added a section on disallowed nodes with ideas for recovery to the aiscman
document at http://people.redhat.com/pcaulfie/docs/aiscman.pdf

As this is really a dev white-paper it might be handy for it to be included
somewhere more user-orientated...any ideas?

Comment 8 David Teigland 2007-11-07 14:40:35 UTC

Looks good.  I think a brief summary of what disallowed nodes are and what to
do if they appear may be nice in the cman_tool man page.  I see that
there's currently no section in the man page describing the outpout of
cman_tool nodes; maybe something like that should come first.  It would include
a definition of the M, X, d states for nodes.  After a person sees a node in
the "d" state, and reads what that means, they could refer to the section on
how they can resolve it.

Comment 9 Christine Caulfield 2007-11-08 09:39:40 UTC

Added that to the cman_tool man page

head:
Checking in cman_tool.8;
/cvs/cluster/cluster/cman/man/cman_tool.8,v  <--  cman_tool.8
new revision: 1.14; previous revision: 1.13
done

-rRHEL5:
Checking in cman_tool.8;
/cvs/cluster/cluster/cman/man/cman_tool.8,v  <--  cman_tool.8
new revision: 1.9.2.5; previous revision: 1.9.2.4
done

Comment 13 errata-xmlrpc 2008-05-21 15:58:03 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html