Bug 1229194

Summary: auto_tie_breaker can create two quorate clusters
Product: Red Hat Enterprise Linux 7 Reporter: Christine Caulfield <ccaulfie>
Component: corosyncAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.1CC: ccaulfie, cluster-maint, fdinitto, jfriesse, jkortus, jsvarova, mjuricek
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: corosync-2.3.4-6.el7 Doc Type: Bug Fix
Doc Text:
Prior to this update, in clusters with an odd number of nodes that had the auto_tie_breaker option enabled, when one of the nodes failed, the remaining nodes were split 50:50. Consequently, auto_tie_breaker was not invoked and a random half of the cluster was fenced, rather than the half that did not contain the tie breaker node. With this update, the wait_for_all option is required for clusters with an odd number of nodes. As a result, the cluster half that does not contain the tie breaker node is now fenced in the described scenario.
Story Points: ---
Clone Of:
: 1260719 (view as bug list) Environment:
Last Closed: 2015-11-19 11:41:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1260719    
Attachments:
Description Flags
votequorum: Fix auto_tie_breaker behaviour in odd-sized clusters
none
quorum: don't allow quorum_trackstart to be called twice none

Description Christine Caulfield 2015-06-08 09:10:08 UTC
Description of problem:
If a 5 node cluster splits into a two partitions of 3 & 2 nodes then both can remain quorate if the '2' partition has the auto_tie_breaker_node in it.

Version-Release number of selected component (if applicable):
7.1+

How reproducible:
Every time according to email on corosync mailing list

Steps to Reproduce:
1. Create a 5 node cluster with auto_tie_breaker enabled
2. split the cluster so that the low node is in the 2 node partition

Actual results:
Both partitions are quorate

Expected results:
Only the 3 node parition is quorate

Additional info:

See github thread here: https://github.com/corosync/corosync/issues/74

Comment 2 Christine Caulfield 2015-06-18 13:48:03 UTC
commit b9f5c290b7dedd0a677cdfc25db7dd111245a745
Author: Christine Caulfield <ccaulfie>
Date:   Thu Jun 18 09:57:59 2015 +0100

    votequorum: Fix auto_tie_breaker behaviour in odd-sized clusters

Comment 3 Jan Friesse 2015-06-19 16:07:05 UTC
Created attachment 1041008 [details]
votequorum: Fix auto_tie_breaker behaviour in odd-sized clusters

votequorum: Fix auto_tie_breaker behaviour in odd-sized clusters

auto_tie_breaker can behave incorrectly in the case of a cluster
with an odd number of nodes. It's possible for a partition to
have quorum while the other side has the ATB node, and both will
continue working. (Of course in a properly configured cluster one side
will be fenced but that becomes an indeterminate race .. just what ATB
is supposed to avoid).

This patch prevents ATB from running in a partition if the 'other'
partition might have quorum, and also mandates the use of wait_for_all
in clusters with an odd number of nodes so that a quorate partition
cannot start services or fence an existing partition with the tie
breaker node.

Signed-Off-By: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 4 Jan Friesse 2015-06-22 08:37:03 UTC
Created attachment 1041647 [details]
quorum: don't allow quorum_trackstart to be called twice

quorum: don't allow quorum_trackstart to be called twice

If quorum_trackstart() or votequorum_trackstart() are called twice with
CS_TRACK_CHANGES then the client gets added twice to the notifications
list effectively corrupting it. Users have reported segfaults in
corosync when they did this (by mistake!).

As there's already a tracking_enabled flag in the private-data, we check
that before adding to the list again and return an error if
the process is already registered.

Signed-off-by: Christine Caulfield <ccaulfie>
Reviewed-by: Jan Friesse <jfriesse>

Comment 9 errata-xmlrpc 2015-11-19 11:41:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2354.html