122364 – [PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices

Bug 122364 - [PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices

Summary: [PATCH] Allow loss of quorum in clean-exit cases w/o STONITH devices

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	clumanager
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-05-03 18:35 UTC by Lon Hohberger
Modified:	2009-04-16 20:14 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-06-25 21:22:14 UTC
Embargoed:

Attachments	(Terms of Use)
Allows clean transitions. (2.00 KB, patch) 2004-05-03 18:36 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2004:239	0	normal	SHIPPED_LIVE	Updated clumanager package adds support for GFS mounts	2004-06-25 04:00:00 UTC

Description Lon Hohberger 2004-05-03 18:35:26 UTC

The "problem":  When the cluster loses enough members for quorum (that
is, it no longer has a majority) and no power switches are configured,
all remaining members reboot.


Why:  Dissolution of the cluster quorum (majority set) is
non-deterministic.  When a node loses communication with the cluster
quorum, it does not care _why_, only that it did.  When a node which
has no fencing devices (= power switches) loses communication with the
cluster quorum, it reboots immediately since no one can fence it.  If,
however, it has fencing devices configured, the member does not
reboot.  In this case, it stops services immediately and waits (a) to
be shot or (b) to regain communication with the cluster quorum.


The fix:  Make quorum transitions semi-deterministic.  Don't reboot if
we were expecting to lose the set of nodes which are now reported down.


This fix doesn't impact cluster communications or node failures, and
could lower support calls on larger clusters.

This fix is, at the moment, untested, but theoretically correct.

Comment 1 Lon Hohberger 2004-05-03 18:36:12 UTC

Created attachment 99921 [details]
Allows clean transitions.

Patch untested; it compiles.

Comment 2 Lon Hohberger 2004-05-03 18:47:52 UTC

Unit tested; the patch works.

Comment 3 Lon Hohberger 2004-05-03 18:56:21 UTC

Unit tests: Require a cluster with 3 or more members with no fencing
devices (STONITH drivers, powercontrollers) configured.

I. Old Behavior - "Clean" quorum dissolution
 1. Start cluster software on all members (service clumanager start)
 2. Stop enough members to constitute loss of quorum using clean
shutdown procedure (service clumanager stop).
 3. All remaining members should reboot.

II.  New Behavior - "Clean" quorum dissolution
 1. Ditto.
 2. Ditto.
 3. All remaining members should report "Quorum Lost" at <emerg> log
level.

III.  Old & New Behavior - Unclean quorum dissolution
 1. Ditto.
 2. Dissolve the cluster quorum forcefully using "killall -9
clumembd", "reboot -fn", or a cable pull on enough members to break
the majority requirement.
 3. All remaining members should reboot.

Comment 4 Mike McLean 2004-06-25 21:22:14 UTC

An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2004-239.html

Comment 5 Lon Hohberger 2007-12-21 15:09:45 UTC

Fixing product name.  Clumanager on RHEL3 was part of RHCS3, not RHEL3

Note You need to log in before you can comment on or make changes to this bug.