Bug 1816653
Summary: | Inconsistent quorum when wait_for_all is set | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jan Friesse <jfriesse> | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 8.1 | CC: | ccaulfie, cluster-maint, cluster-qe, mlisik, mnovacek, phagara | ||||
Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
||||
Target Release: | 8.0 | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | corosync-3.0.3-4.el8 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 1679792 | Environment: | |||||
Last Closed: | 2020-11-04 03:25:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1679792 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Comment 1
Jan Friesse
2020-03-24 13:29:41 UTC
Created attachment 1673073 [details]
votequorum: set wfa status only on startup
votequorum: set wfa status only on startup
Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.
Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).
Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.
Such behaviour is consistent but seems to be a bit too strict.
Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.
This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.
Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
I have verified that quorum stays in sync when adding node to two node cluster having WaitForAll set with corosync-3.0.3-4.el8.x86_64. The same procedure as the one in cloned bug 1679792, comment 10 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (corosync bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4736 |