Bug 122364 - [PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices
Summary: [PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: clumanager
Version: 3
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-05-03 18:35 UTC by Lon Hohberger
Modified: 2009-04-16 20:14 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2004-06-25 21:22:14 UTC
Embargoed:


Attachments (Terms of Use)
Allows clean transitions. (2.00 KB, patch)
2004-05-03 18:36 UTC, Lon Hohberger
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2004:239 0 normal SHIPPED_LIVE Updated clumanager package adds support for GFS mounts 2004-06-25 04:00:00 UTC

Description Lon Hohberger 2004-05-03 18:35:26 UTC
The "problem":  When the cluster loses enough members for quorum (that
is, it no longer has a majority) and no power switches are configured,
all remaining members reboot.


Why:  Dissolution of the cluster quorum (majority set) is
non-deterministic.  When a node loses communication with the cluster
quorum, it does not care _why_, only that it did.  When a node which
has no fencing devices (= power switches) loses communication with the
cluster quorum, it reboots immediately since no one can fence it.  If,
however, it has fencing devices configured, the member does not
reboot.  In this case, it stops services immediately and waits (a) to
be shot or (b) to regain communication with the cluster quorum.


The fix:  Make quorum transitions semi-deterministic.  Don't reboot if
we were expecting to lose the set of nodes which are now reported down.


This fix doesn't impact cluster communications or node failures, and
could lower support calls on larger clusters.

This fix is, at the moment, untested, but theoretically correct.

Comment 1 Lon Hohberger 2004-05-03 18:36:12 UTC
Created attachment 99921 [details]
Allows clean transitions.

Patch untested; it compiles.

Comment 2 Lon Hohberger 2004-05-03 18:47:52 UTC
Unit tested; the patch works.

Comment 3 Lon Hohberger 2004-05-03 18:56:21 UTC
Unit tests: Require a cluster with 3 or more members with no fencing
devices (STONITH drivers, powercontrollers) configured.

I. Old Behavior - "Clean" quorum dissolution
 1. Start cluster software on all members (service clumanager start)
 2. Stop enough members to constitute loss of quorum using clean
shutdown procedure (service clumanager stop).
 3. All remaining members should reboot.

II.  New Behavior - "Clean" quorum dissolution
 1. Ditto.
 2. Ditto.
 3. All remaining members should report "Quorum Lost" at <emerg> log
level.

III.  Old & New Behavior - Unclean quorum dissolution
 1. Ditto.
 2. Dissolve the cluster quorum forcefully using "killall -9
clumembd", "reboot -fn", or a cable pull on enough members to break
the majority requirement.
 3. All remaining members should reboot.


Comment 4 Mike McLean 2004-06-25 21:22:14 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2004-239.html


Comment 5 Lon Hohberger 2007-12-21 15:09:45 UTC
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3


Note You need to log in before you can comment on or make changes to this bug.