Bug 122364 - [PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices
[PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: clumanager (Show other bugs)
3
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-05-03 14:35 EDT by Lon Hohberger
Modified: 2009-04-16 16:14 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-06-25 17:22:14 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Allows clean transitions. (2.00 KB, patch)
2004-05-03 14:36 EDT, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Lon Hohberger 2004-05-03 14:35:26 EDT
The "problem":  When the cluster loses enough members for quorum (that
is, it no longer has a majority) and no power switches are configured,
all remaining members reboot.


Why:  Dissolution of the cluster quorum (majority set) is
non-deterministic.  When a node loses communication with the cluster
quorum, it does not care _why_, only that it did.  When a node which
has no fencing devices (= power switches) loses communication with the
cluster quorum, it reboots immediately since no one can fence it.  If,
however, it has fencing devices configured, the member does not
reboot.  In this case, it stops services immediately and waits (a) to
be shot or (b) to regain communication with the cluster quorum.


The fix:  Make quorum transitions semi-deterministic.  Don't reboot if
we were expecting to lose the set of nodes which are now reported down.


This fix doesn't impact cluster communications or node failures, and
could lower support calls on larger clusters.

This fix is, at the moment, untested, but theoretically correct.
Comment 1 Lon Hohberger 2004-05-03 14:36:12 EDT
Created attachment 99921 [details]
Allows clean transitions.

Patch untested; it compiles.
Comment 2 Lon Hohberger 2004-05-03 14:47:52 EDT
Unit tested; the patch works.
Comment 3 Lon Hohberger 2004-05-03 14:56:21 EDT
Unit tests: Require a cluster with 3 or more members with no fencing
devices (STONITH drivers, powercontrollers) configured.

I. Old Behavior - "Clean" quorum dissolution
 1. Start cluster software on all members (service clumanager start)
 2. Stop enough members to constitute loss of quorum using clean
shutdown procedure (service clumanager stop).
 3. All remaining members should reboot.

II.  New Behavior - "Clean" quorum dissolution
 1. Ditto.
 2. Ditto.
 3. All remaining members should report "Quorum Lost" at <emerg> log
level.

III.  Old & New Behavior - Unclean quorum dissolution
 1. Ditto.
 2. Dissolve the cluster quorum forcefully using "killall -9
clumembd", "reboot -fn", or a cable pull on enough members to break
the majority requirement.
 3. All remaining members should reboot.
Comment 4 Mike McLean 2004-06-25 17:22:14 EDT
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2004-239.html
Comment 5 Lon Hohberger 2007-12-21 10:09:45 EST
Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.