RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1679792 - Inconsistent quorum when wait_for_all is set
Summary: Inconsistent quorum when wait_for_all is set
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: corosync
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1816653
TreeView+ depends on / blocked
 
Reported: 2019-02-21 21:12 UTC by Miroslav Lisik
Modified: 2020-09-29 19:55 UTC (History)
3 users (show)

Fixed In Version: corosync-2.4.5-5.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1816653 (view as bug list)
Environment:
Last Closed: 2020-09-29 19:55:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
votequorum: set wfa status only on startup (2.56 KB, patch)
2020-03-24 13:23 UTC, Jan Friesse
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:3924 0 None None None 2020-09-29 19:55:20 UTC

Description Miroslav Lisik 2019-02-21 21:12:23 UTC
Description of problem:
Quorum is inconsistent after adding a third node to cluster. The information does not match reality.

Version-Release number of selected component (if applicable):
# rpm -q corosync pacemaker
corosync-2.4.3-4.el7.x86_64
pacemaker-1.1.19-8.el7_6.4.x86_64

How reproducible:
always


Steps to Reproduce:
1. Setup 2-node cluster with few dummy resources.
[root@virt-025 ~]# pcs cluster auth -u hacluster -p password virt-025 virt-026
virt-025: Authorized
virt-026: Authorized
[root@virt-025 ~]# pcs cluster setup --name HAcluster virt-025 virt-026 --start
Destroying cluster on nodes: virt-025, virt-026...
virt-026: Stopping Cluster (pacemaker)...
virt-025: Stopping Cluster (pacemaker)...
virt-026: Successfully destroyed cluster
virt-025: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'virt-025', 'virt-026'
virt-025: successful distribution of the file 'pacemaker_remote authkey'
virt-026: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
virt-025: Succeeded
virt-026: Succeeded

Starting cluster on nodes: virt-025, virt-026...
virt-026: Starting Cluster (corosync)...
virt-025: Starting Cluster (corosync)...
virt-026: Starting Cluster (pacemaker)...
virt-025: Starting Cluster (pacemaker)...

Synchronizing pcsd certificates on nodes virt-025, virt-026...
virt-025: Success
virt-026: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-025: Success
virt-026: Success
[root@virt-025 ~]# pcs stonith create fence-virt-025 fence_xvm pcmk_host_check="static-list" pcmk_host_list="virt-025" pcmk_host_map="virt-025:virt-025.cluster-qe.lab.eng.brq.redhat.com"
[root@virt-025 ~]# pcs stonith create fence-virt-026 fence_xvm pcmk_host_check="static-list" pcmk_host_list="virt-026" pcmk_host_map="virt-026:virt-026.cluster-qe.lab.eng.brq.redhat.com"
[root@virt-025 ~]# for i in $(seq 1 4); do pcs resource create "d-$i" ocf:pacemaker:Dummy; done
[root@virt-025 ~]# pcs resource
 d-1    (ocf::pacemaker:Dummy): Started virt-025
 d-2    (ocf::pacemaker:Dummy): Started virt-026
 d-3    (ocf::pacemaker:Dummy): Started virt-025
 d-4    (ocf::pacemaker:Dummy): Started virt-026
[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:34:53 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026

2. Add a third node to cluster.
[root@virt-025 ~]# pcs cluster auth -u hacluster -p password virt-032
virt-032: Authorized
[root@virt-025 ~]# pcs cluster node add virt-032
Disabling SBD service...
virt-032: sbd disabled
Sending remote node configuration files to 'virt-032'
virt-032: successful distribution of the file 'pacemaker_remote authkey'
virt-025: Corosync updated
virt-026: Corosync updated
Setting up corosync...
virt-032: Succeeded
Synchronizing pcsd certificates on nodes virt-032...
virt-032: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-032: Success

3. Check the quorum information and state of resources:

[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:38:24 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 Activity blocked
Flags:            WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026
[root@virt-025 ~]# pcs resource 
 d-1    (ocf::pacemaker:Dummy): Started virt-025
 d-2    (ocf::pacemaker:Dummy): Started virt-026
 d-3    (ocf::pacemaker:Dummy): Started virt-025
 d-4    (ocf::pacemaker:Dummy): Started virt-026
[root@virt-025 ~]# pcs quorum status | grep -E "Quorate:|Quorum:|Flags:"
Quorate:          Yes
Quorum:           2 Activity blocked
Flags:            WaitForAll


Actual results:

Quorum information is not consistent with state of resources.


Expected results:

Quorum information should be consistent with state of resources.

Additional info:

Quorum is consistent when node is added to a 2-node cluster with wait_for_all=0 (turned off).

[root@virt-025 ~]# pcs cluster stop --all
...
[root@virt-025 ~]# pcs quorum update wait_for_all=0
...
[root@virt-025 ~]# pcs cluster start --all
...
[root@virt-025 ~]# pcs quorum
Options:
  wait_for_all: 0
[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:44:30 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1248
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026

[root@virt-025 ~]# pcs cluster node add virt-032
...

[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:45:46 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1248
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026
[root@virt-025 ~]# pcs resource
 d-1    (ocf::pacemaker:Dummy): Started virt-025
 d-2    (ocf::pacemaker:Dummy): Started virt-026
 d-3    (ocf::pacemaker:Dummy): Started virt-025
 d-4    (ocf::pacemaker:Dummy): Started virt-026
[root@virt-025 ~]# pcs quorum status | grep -E "Quorate:|Quorum:|Flags:"
Quorate:          Yes
Quorum:           2  
Flags:            Quorate

Comment 2 Jan Friesse 2019-03-08 07:06:07 UTC
Forgot to add comment. Issue is really easily reproducible (that's good :) ) It's not yet clear how to fix this. We've had discussion with chrissie/fabio and result is that there is no reason why quorum and votequorum output should differ.

Right now there are two possible solutions known:
  - current cluster becomes non-quorate till new node appears
  - "standard" calculations are used

First solution seems to be more nature (I like Chrissie comment: it's "wait_for_all", not "wait_for_some"), but may have some problems, most notably with last man standing.

It must be also properly tested what configuration change makes corosync wait_for_all.

Comment 3 Jan Friesse 2019-03-18 12:05:43 UTC
Moving to 7.8. Solution should be quite easy but we have to find out all corner cases. Bug is there since 7.0, so shouldn't be a big deal. Also because this affects also corosync 3.0 (RHEL 8) we must clone this BZ to 8.1/8.2 when fix become ready.

Comment 5 Patrik Hagara 2020-02-17 14:25:40 UTC
qa_ack+, reproducer in description

Comment 6 Jan Friesse 2020-03-24 13:20:59 UTC
Problem described in description is solved by patch in bug 1780134, but problem appears in different situations as described in the upstream comment https://github.com/corosync/corosync/pull/542#issuecomment-597207397.

For QA: https://github.com/corosync/corosync/pull/542#issuecomment-597207397 contains tested scenarios.

Comment 7 Jan Friesse 2020-03-24 13:23:50 UTC
Created attachment 1673072 [details]
votequorum: set wfa status only on startup

votequorum: set wfa status only on startup

Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.

Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).

Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.

Such behaviour is consistent but seems to be a bit too strict.

Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.

This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.

Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit ca320beac25f82c0c555799e647a47975a333c28)

Comment 10 Patrik Hagara 2020-05-26 12:18:06 UTC
before (rhel-7.8, corosync-2.4.5-4.el7)
=======================================

[root@virt-173 ~]# rpm -q corosync
corosync-2.4.5-4.el7.x86_64
[root@virt-173 ~]# pcs status
Cluster name: STSRHTS28883
Stack: corosync
Current DC: virt-173 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Tue May 26 13:50:13 2020
Last change: Tue May 26 13:48:25 2020 by root via cibadmin on virt-173

2 nodes configured
7 resources configured

Online: [ virt-173 virt-175 ]

Full list of resources:

 fence-virt-173 (stonith:fence_xvm):    Started virt-173
 fence-virt-175 (stonith:fence_xvm):    Started virt-175
 fence-virt-178 (stonith:fence_xvm):    Started virt-173
 dummy-1        (ocf::pacemaker:Dummy): Started virt-175
 dummy-2        (ocf::pacemaker:Dummy): Started virt-173
 dummy-3        (ocf::pacemaker:Dummy): Started virt-175
 dummy-4        (ocf::pacemaker:Dummy): Started virt-173

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@virt-173 ~]# pcs quorum status                                                                                                                                                                                
Quorum information
------------------
Date:             Tue May 26 13:50:17 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/26
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-173 (local)
         2          1         NR virt-175

[root@virt-173 ~]# pcs cluster node add virt-178                                                                                                                                                               
Disabling SBD service...
virt-178: sbd disabled
Sending remote node configuration files to 'virt-178'
virt-178: successful distribution of the file 'pacemaker_remote authkey'
virt-173: Corosync updated
virt-175: Corosync updated
Setting up corosync...
virt-178: Succeeded
Synchronizing pcsd certificates on nodes virt-178...
virt-178: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-178: Success
[root@virt-173 ~]# pcs quorum status
Quorum information
------------------
Date:             Tue May 26 13:52:25 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/26
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 Activity blocked
Flags:            WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-173 (local)
         2          1         NR virt-175

[root@virt-173 ~]# pcs resource
 dummy-1        (ocf::pacemaker:Dummy): Started virt-175
 dummy-2        (ocf::pacemaker:Dummy): Started virt-173
 dummy-3        (ocf::pacemaker:Dummy): Started virt-175
 dummy-4        (ocf::pacemaker:Dummy): Started virt-173


result: quorum and votequorum are out of sync (quorate vs not)



after (rhel-7.9, corosync-2.4.5-5.el7)
======================================

[root@virt-053 ~]# rpm -q corosync
corosync-2.4.5-5.el7.x86_64
[root@virt-053 ~]# pcs status
Cluster name: STSRHTS12710
Stack: corosync
Current DC: virt-060 (version 1.1.22-1.el7-63d2d79005) - partition with quorum
Last updated: Tue May 26 13:58:35 2020
Last change: Tue May 26 13:58:20 2020 by root via cibadmin on virt-053

2 nodes configured
7 resource instances configured

Online: [ virt-053 virt-060 ]

Full list of resources:

 fence-virt-053	(stonith:fence_xvm):	Started virt-053
 fence-virt-060	(stonith:fence_xvm):	Started virt-060
 fence-virt-070	(stonith:fence_xvm):	Started virt-053
 dummy-1	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-2	(ocf::pacemaker:Dummy):	Started virt-053
 dummy-3	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-4	(ocf::pacemaker:Dummy):	Started virt-053

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@virt-053 ~]# pcs quorum status
Quorum information
------------------
Date:             Tue May 26 13:58:43 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/35
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-053 (local)
         2          1         NR virt-060

[root@virt-053 ~]# pcs cluster node add virt-070
Disabling SBD service...
virt-070: sbd disabled
Sending remote node configuration files to 'virt-070'
virt-070: successful distribution of the file 'pacemaker_remote authkey'
virt-053: Corosync updated
virt-060: Corosync updated
Setting up corosync...
virt-070: Succeeded
Synchronizing pcsd certificates on nodes virt-070...
virt-070: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-070: Success
[root@virt-053 ~]# pcs quorum status
Quorum information
------------------
Date:             Tue May 26 13:59:40 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/35
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-053 (local)
         2          1         NR virt-060

[root@virt-053 ~]# pcs resource
 dummy-1	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-2	(ocf::pacemaker:Dummy):	Started virt-053
 dummy-3	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-4	(ocf::pacemaker:Dummy):	Started virt-053


result: both quorum and votequorum report same state (quorate)


marking verified in corosync-2.4.5-5.el7

Comment 12 errata-xmlrpc 2020-09-29 19:55:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3924


Note You need to log in before you can comment on or make changes to this bug.