The "problem": When the cluster loses enough members for quorum (that is, it no longer has a majority) and no power switches are configured, all remaining members reboot. Why: Dissolution of the cluster quorum (majority set) is non-deterministic. When a node loses communication with the cluster quorum, it does not care _why_, only that it did. When a node which has no fencing devices (= power switches) loses communication with the cluster quorum, it reboots immediately since no one can fence it. If, however, it has fencing devices configured, the member does not reboot. In this case, it stops services immediately and waits (a) to be shot or (b) to regain communication with the cluster quorum. The fix: Make quorum transitions semi-deterministic. Don't reboot if we were expecting to lose the set of nodes which are now reported down. This fix doesn't impact cluster communications or node failures, and could lower support calls on larger clusters. This fix is, at the moment, untested, but theoretically correct.
Created attachment 99921 [details] Allows clean transitions. Patch untested; it compiles.
Unit tested; the patch works.
Unit tests: Require a cluster with 3 or more members with no fencing devices (STONITH drivers, powercontrollers) configured. I. Old Behavior - "Clean" quorum dissolution 1. Start cluster software on all members (service clumanager start) 2. Stop enough members to constitute loss of quorum using clean shutdown procedure (service clumanager stop). 3. All remaining members should reboot. II. New Behavior - "Clean" quorum dissolution 1. Ditto. 2. Ditto. 3. All remaining members should report "Quorum Lost" at <emerg> log level. III. Old & New Behavior - Unclean quorum dissolution 1. Ditto. 2. Dissolve the cluster quorum forcefully using "killall -9 clumembd", "reboot -fn", or a cable pull on enough members to break the majority requirement. 3. All remaining members should reboot.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2004-239.html
Fixing product name. Clumanager on RHEL3 was part of RHCS3, not RHEL3