Bug 1816653

Summary: Inconsistent quorum when wait_for_all is set
Product: Red Hat Enterprise Linux 8 Reporter: Jan Friesse <jfriesse>
Component: corosyncAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 8.1CC: ccaulfie, cluster-maint, cluster-qe, mlisik, mnovacek, phagara
Target Milestone: rcFlags: pm-rhel: mirror+
Target Release: 8.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: corosync-3.0.3-4.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1679792 Environment:
Last Closed: 2020-11-04 03:25:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1679792    
Bug Blocks:    
Attachments:
Description Flags
votequorum: set wfa status only on startup none

Comment 1 Jan Friesse 2020-03-24 13:29:41 UTC
This bug affects also RHEL 8, so the clone.

Comment 2 Jan Friesse 2020-03-24 13:30:12 UTC
Created attachment 1673073 [details]
votequorum: set wfa status only on startup

votequorum: set wfa status only on startup

Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.

Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).

Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.

Such behaviour is consistent but seems to be a bit too strict.

Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.

This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.

Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>

Comment 5 michal novacek 2020-09-17 07:58:55 UTC
I have verified that quorum stays in sync when adding node to two node cluster having WaitForAll set with corosync-3.0.3-4.el8.x86_64.


The same procedure as the one in cloned bug 1679792, comment 10

Comment 8 errata-xmlrpc 2020-11-04 03:25:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4736