Bug 1679792
Summary: | Inconsistent quorum when wait_for_all is set | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Miroslav Lisik <mlisik> | ||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 7.6 | CC: | ccaulfie, cluster-maint, phagara | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | corosync-2.4.5-5.el7 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1816653 (view as bug list) | Environment: | |||||
Last Closed: | 2020-09-29 19:55:11 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1816653 | ||||||
Attachments: |
|
Description
Miroslav Lisik
2019-02-21 21:12:23 UTC
Forgot to add comment. Issue is really easily reproducible (that's good :) ) It's not yet clear how to fix this. We've had discussion with chrissie/fabio and result is that there is no reason why quorum and votequorum output should differ. Right now there are two possible solutions known: - current cluster becomes non-quorate till new node appears - "standard" calculations are used First solution seems to be more nature (I like Chrissie comment: it's "wait_for_all", not "wait_for_some"), but may have some problems, most notably with last man standing. It must be also properly tested what configuration change makes corosync wait_for_all. Moving to 7.8. Solution should be quite easy but we have to find out all corner cases. Bug is there since 7.0, so shouldn't be a big deal. Also because this affects also corosync 3.0 (RHEL 8) we must clone this BZ to 8.1/8.2 when fix become ready. qa_ack+, reproducer in description Problem described in description is solved by patch in bug 1780134, but problem appears in different situations as described in the upstream comment https://github.com/corosync/corosync/pull/542#issuecomment-597207397. For QA: https://github.com/corosync/corosync/pull/542#issuecomment-597207397 contains tested scenarios. Created attachment 1673072 [details]
votequorum: set wfa status only on startup
votequorum: set wfa status only on startup
Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.
Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).
Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.
Such behaviour is consistent but seems to be a bit too strict.
Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.
This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.
Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.
Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit ca320beac25f82c0c555799e647a47975a333c28)
before (rhel-7.8, corosync-2.4.5-4.el7) ======================================= [root@virt-173 ~]# rpm -q corosync corosync-2.4.5-4.el7.x86_64 [root@virt-173 ~]# pcs status Cluster name: STSRHTS28883 Stack: corosync Current DC: virt-173 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum Last updated: Tue May 26 13:50:13 2020 Last change: Tue May 26 13:48:25 2020 by root via cibadmin on virt-173 2 nodes configured 7 resources configured Online: [ virt-173 virt-175 ] Full list of resources: fence-virt-173 (stonith:fence_xvm): Started virt-173 fence-virt-175 (stonith:fence_xvm): Started virt-175 fence-virt-178 (stonith:fence_xvm): Started virt-173 dummy-1 (ocf::pacemaker:Dummy): Started virt-175 dummy-2 (ocf::pacemaker:Dummy): Started virt-173 dummy-3 (ocf::pacemaker:Dummy): Started virt-175 dummy-4 (ocf::pacemaker:Dummy): Started virt-173 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@virt-173 ~]# pcs quorum status Quorum information ------------------ Date: Tue May 26 13:50:17 2020 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/26 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 1 Flags: 2Node Quorate WaitForAll Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR virt-173 (local) 2 1 NR virt-175 [root@virt-173 ~]# pcs cluster node add virt-178 Disabling SBD service... virt-178: sbd disabled Sending remote node configuration files to 'virt-178' virt-178: successful distribution of the file 'pacemaker_remote authkey' virt-173: Corosync updated virt-175: Corosync updated Setting up corosync... virt-178: Succeeded Synchronizing pcsd certificates on nodes virt-178... virt-178: Success Restarting pcsd on the nodes in order to reload the certificates... virt-178: Success [root@virt-173 ~]# pcs quorum status Quorum information ------------------ Date: Tue May 26 13:52:25 2020 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/26 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 2 Quorum: 2 Activity blocked Flags: WaitForAll Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR virt-173 (local) 2 1 NR virt-175 [root@virt-173 ~]# pcs resource dummy-1 (ocf::pacemaker:Dummy): Started virt-175 dummy-2 (ocf::pacemaker:Dummy): Started virt-173 dummy-3 (ocf::pacemaker:Dummy): Started virt-175 dummy-4 (ocf::pacemaker:Dummy): Started virt-173 result: quorum and votequorum are out of sync (quorate vs not) after (rhel-7.9, corosync-2.4.5-5.el7) ====================================== [root@virt-053 ~]# rpm -q corosync corosync-2.4.5-5.el7.x86_64 [root@virt-053 ~]# pcs status Cluster name: STSRHTS12710 Stack: corosync Current DC: virt-060 (version 1.1.22-1.el7-63d2d79005) - partition with quorum Last updated: Tue May 26 13:58:35 2020 Last change: Tue May 26 13:58:20 2020 by root via cibadmin on virt-053 2 nodes configured 7 resource instances configured Online: [ virt-053 virt-060 ] Full list of resources: fence-virt-053 (stonith:fence_xvm): Started virt-053 fence-virt-060 (stonith:fence_xvm): Started virt-060 fence-virt-070 (stonith:fence_xvm): Started virt-053 dummy-1 (ocf::pacemaker:Dummy): Started virt-060 dummy-2 (ocf::pacemaker:Dummy): Started virt-053 dummy-3 (ocf::pacemaker:Dummy): Started virt-060 dummy-4 (ocf::pacemaker:Dummy): Started virt-053 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@virt-053 ~]# pcs quorum status Quorum information ------------------ Date: Tue May 26 13:58:43 2020 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/35 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 2 Quorum: 1 Flags: 2Node Quorate WaitForAll Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR virt-053 (local) 2 1 NR virt-060 [root@virt-053 ~]# pcs cluster node add virt-070 Disabling SBD service... virt-070: sbd disabled Sending remote node configuration files to 'virt-070' virt-070: successful distribution of the file 'pacemaker_remote authkey' virt-053: Corosync updated virt-060: Corosync updated Setting up corosync... virt-070: Succeeded Synchronizing pcsd certificates on nodes virt-070... virt-070: Success Restarting pcsd on the nodes in order to reload the certificates... virt-070: Success [root@virt-053 ~]# pcs quorum status Quorum information ------------------ Date: Tue May 26 13:59:40 2020 Quorum provider: corosync_votequorum Nodes: 2 Node ID: 1 Ring ID: 1/35 Quorate: Yes Votequorum information ---------------------- Expected votes: 3 Highest expected: 3 Total votes: 2 Quorum: 2 Flags: Quorate Membership information ---------------------- Nodeid Votes Qdevice Name 1 1 NR virt-053 (local) 2 1 NR virt-060 [root@virt-053 ~]# pcs resource dummy-1 (ocf::pacemaker:Dummy): Started virt-060 dummy-2 (ocf::pacemaker:Dummy): Started virt-053 dummy-3 (ocf::pacemaker:Dummy): Started virt-060 dummy-4 (ocf::pacemaker:Dummy): Started virt-053 result: both quorum and votequorum report same state (quorate) marking verified in corosync-2.4.5-5.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (corosync bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3924 |