Bug 323931
Summary: | disallowed nodes and inconsistent cluster views | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Steven Dake <sdake> |
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | cluster-maint, sdake, teigland |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2008-0347 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-05-21 15:58:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Steven Dake
2007-10-09 12:38:09 UTC
If openais fails to send the token within the timeout period on node A, then openais on the other nodes remove A from the membership, and A removes the others similarly. This is a cluster partition. Later, when A *does* manage to forward the token, the membership changes back to the way it was, with A and the other nodes all a part of one membership; the partitions have been merged. The problem is, when A comes back, the other nodes can't tell whether: 1) A is joining the cluster as a new member having just started up, or 2) A is rejoining the cluster as an old member that was momentarily partitioned and is now merged back. The cluster software above openais *must* know which of the two possibilities is really the case, because it must take different actions for each. One way of fixing this is to make openais *not* automatically merge the two partitions (A with the rest) back together after they are partitioned. We discussed this only briefly long ago, and decided it would be too unnatural for openais to do this, so we abandoned the idea. I'm assuming that's still the case, which is a pity. A second way of fixing this is to allow openais to partition and merge at will, as it always has, but have cman, which represents the cluster membership to the higher level software, detect the case above and treat the remerged nodes in case 2 as non-members, i.e. it keeps the partition in place. This is the disallowed nodes case. A third way of fixing this is to change each of the higher level components to deal with the remerged nodes themselves. This would amount to implementing the disallowed case N times, in N higher level components, instead of 1 time in cman. (Some higher level components may be able to skip the disallowed logic and treat cases 1 and 2 above identically.) A fourth way of working around this, is to raise the timeouts in openais to make the transient partitions extremely unlikely in the first place. We've also taken this route to date, along with adding the disallowed logic to cman. The first three approaches are ways of dealing with the transient partitions, the fourth approach is a way to make them rare. So, what can cause these transient partitions in the first place? 1. openais bugs. This was the cause leading to this specific bz. 2. network problems 3. network driver resets and other driver "bugs" 4. kernel subsystems hogging all cpu's and not scheduling, e.g. dlm, gfs All of these have likely been causes at various times for transient partitions. None of them are very common in general. Most of the time, the default openais timeouts will let us ride over all of them, but occasionally, any one of them could be big enough to surpass the timeouts. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. my general problem with how disallowed works currently is that it results in complete cluster failure requiring at minimum reboot of a major subset of the nodes in the cluster. It also does not "self-heal" requiring administrative intervention to continue operation. Further "what to do" once it happens is not documented. I in fact don't even know how to resolve the disallowed cluster state without rebooting every node in the cluster. Your missing an alternative wihch is that the state is resynchronized each time a processor joins the configuration: openais joins and remerges cman says "oh I got a new node let me tell all other nodes about my state" All higher level software says "oh I got a new node let me reset any state information I had about that node" This is exactly how the event service or checkpoint service work currently. WHile I realize it is difficult to resynchornize, it can be done and provides superior reliability results. This is not some alteration of the cluster membership on top of openais (which is what cman is doing) but instead a change in the way new configurations are handled. If a processor leaves a configuration any state associated with it could be reset and if a processor joins, its state should be synchronized (if it has no state, then it synchronizes nothing). imo the problem is that the current upper level code does not support resynchronization of state. If that is not possible to do, then it should be stated now so we can look at other alternatives. > my general problem with how disallowed works currently is that it results in > complete cluster failure requiring at minimum reboot of a major subset of the > nodes in the cluster. If you're rebooting some nodes, others are still running. That's not a complete cluster failure. > It also does not "self-heal" requiring administrative > intervention to continue operation. It does "self-heal" unless the partitioning has left no partition with quorum. > Further "what to do" once it happens is > not documented. I in fact don't even know how to resolve the disallowed > cluster state without rebooting every node in the cluster. In the case that a spurious cluster partition has left all partitions without quorum, then yes, we should have some instructions on how to handle it. The solution is to reboot nodes in one of the partitions (ideally a partition with fewer nodes). Perhaps a new section in the cman_tool man page. As an aside, this is difficult to understand because we don't have a way to view the cluster-as-a-whole from a vantage point outside of the cluster. We only have different perspectives of the cluster from the viewpoint of nodes within it. A person must infer the condition of the whole cluster by piecing together these different views. It would be nice to design some command-line based utility that could run outside the cluster and give an external perspective, which could also show current cluster partitions. > Your missing an alternative wihch is that the state is resynchronized > each time a processor joins the configuration: openais joins and > remerges cman says "oh I got a new node let me tell all other nodes > about my state" All higher level software says "oh I got a new node let > me reset any state information I had about that node" > > This is exactly how the event service or checkpoint service work > currently. WHile I realize it is difficult to resynchornize, it can be > done and provides superior reliability results. ... > imo the problem is that the current upper level code does not support > resynchronization of state. If that is not possible to do, then it > should be stated now so we can look at other alternatives. The problem with dlm/gfs is that the state diverges as soon as the partition occurs. The state is spread all over (from the applications and throughout the kernel vm, vfs, gfs, device drivers, etc), and it's *contradictory*, so it cannot be merged if suddenly the nodes can communicate again. The one and only way for dlm/gfs to resynchronize is to go through the recovery process in the partition that continues running and to reboot in the other partition. I'm going to treat this as a documentation request. I will take it upon myself to write a document about this situation (this bit already exists in the cman/openais document I wrote) and how to best recover from it. I've added a section on disallowed nodes with ideas for recovery to the aiscman document at http://people.redhat.com/pcaulfie/docs/aiscman.pdf As this is really a dev white-paper it might be handy for it to be included somewhere more user-orientated...any ideas? Looks good. I think a brief summary of what disallowed nodes are and what to do if they appear may be nice in the cman_tool man page. I see that there's currently no section in the man page describing the outpout of cman_tool nodes; maybe something like that should come first. It would include a definition of the M, X, d states for nodes. After a person sees a node in the "d" state, and reads what that means, they could refer to the section on how they can resolve it. Added that to the cman_tool man page head: Checking in cman_tool.8; /cvs/cluster/cluster/cman/man/cman_tool.8,v <-- cman_tool.8 new revision: 1.14; previous revision: 1.13 done -rRHEL5: Checking in cman_tool.8; /cvs/cluster/cluster/cman/man/cman_tool.8,v <-- cman_tool.8 new revision: 1.9.2.5; previous revision: 1.9.2.4 done An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0347.html |