Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1679792

Summary:

Inconsistent quorum when wait_for_all is set

Product:

Red Hat Enterprise Linux 7

Reporter:

Miroslav Lisik <mlisik>

Component:

corosync

Assignee:

Jan Friesse <jfriesse>

Status:

CLOSED ERRATA

QA Contact:

cluster-qe <cluster-qe>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.6

CC:

ccaulfie, cluster-maint, phagara

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

corosync-2.4.5-5.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1816653 (view as bug list)

Environment:

Last Closed:

2020-09-29 19:55:11 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1816653

Attachments:

Description	Flags
votequorum: set wfa status only on startup	none

Description Miroslav Lisik 2019-02-21 21:12:23 UTC

Description of problem:
Quorum is inconsistent after adding a third node to cluster. The information does not match reality.

Version-Release number of selected component (if applicable):
# rpm -q corosync pacemaker
corosync-2.4.3-4.el7.x86_64
pacemaker-1.1.19-8.el7_6.4.x86_64

How reproducible:
always


Steps to Reproduce:
1. Setup 2-node cluster with few dummy resources.
[root@virt-025 ~]# pcs cluster auth -u hacluster -p password virt-025 virt-026
virt-025: Authorized
virt-026: Authorized
[root@virt-025 ~]# pcs cluster setup --name HAcluster virt-025 virt-026 --start
Destroying cluster on nodes: virt-025, virt-026...
virt-026: Stopping Cluster (pacemaker)...
virt-025: Stopping Cluster (pacemaker)...
virt-026: Successfully destroyed cluster
virt-025: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'virt-025', 'virt-026'
virt-025: successful distribution of the file 'pacemaker_remote authkey'
virt-026: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
virt-025: Succeeded
virt-026: Succeeded

Starting cluster on nodes: virt-025, virt-026...
virt-026: Starting Cluster (corosync)...
virt-025: Starting Cluster (corosync)...
virt-026: Starting Cluster (pacemaker)...
virt-025: Starting Cluster (pacemaker)...

Synchronizing pcsd certificates on nodes virt-025, virt-026...
virt-025: Success
virt-026: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-025: Success
virt-026: Success
[root@virt-025 ~]# pcs stonith create fence-virt-025 fence_xvm pcmk_host_check="static-list" pcmk_host_list="virt-025" pcmk_host_map="virt-025:virt-025.cluster-qe.lab.eng.brq.redhat.com"
[root@virt-025 ~]# pcs stonith create fence-virt-026 fence_xvm pcmk_host_check="static-list" pcmk_host_list="virt-026" pcmk_host_map="virt-026:virt-026.cluster-qe.lab.eng.brq.redhat.com"
[root@virt-025 ~]# for i in $(seq 1 4); do pcs resource create "d-$i" ocf:pacemaker:Dummy; done
[root@virt-025 ~]# pcs resource
 d-1    (ocf::pacemaker:Dummy): Started virt-025
 d-2    (ocf::pacemaker:Dummy): Started virt-026
 d-3    (ocf::pacemaker:Dummy): Started virt-025
 d-4    (ocf::pacemaker:Dummy): Started virt-026
[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:34:53 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026

2. Add a third node to cluster.
[root@virt-025 ~]# pcs cluster auth -u hacluster -p password virt-032
virt-032: Authorized
[root@virt-025 ~]# pcs cluster node add virt-032
Disabling SBD service...
virt-032: sbd disabled
Sending remote node configuration files to 'virt-032'
virt-032: successful distribution of the file 'pacemaker_remote authkey'
virt-025: Corosync updated
virt-026: Corosync updated
Setting up corosync...
virt-032: Succeeded
Synchronizing pcsd certificates on nodes virt-032...
virt-032: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-032: Success

3. Check the quorum information and state of resources:

[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:38:24 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 Activity blocked
Flags:            WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026
[root@virt-025 ~]# pcs resource 
 d-1    (ocf::pacemaker:Dummy): Started virt-025
 d-2    (ocf::pacemaker:Dummy): Started virt-026
 d-3    (ocf::pacemaker:Dummy): Started virt-025
 d-4    (ocf::pacemaker:Dummy): Started virt-026
[root@virt-025 ~]# pcs quorum status | grep -E "Quorate:|Quorum:|Flags:"
Quorate:          Yes
Quorum:           2 Activity blocked
Flags:            WaitForAll


Actual results:

Quorum information is not consistent with state of resources.


Expected results:

Quorum information should be consistent with state of resources.

Additional info:

Quorum is consistent when node is added to a 2-node cluster with wait_for_all=0 (turned off).

[root@virt-025 ~]# pcs cluster stop --all
...
[root@virt-025 ~]# pcs quorum update wait_for_all=0
...
[root@virt-025 ~]# pcs cluster start --all
...
[root@virt-025 ~]# pcs quorum
Options:
  wait_for_all: 0
[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:44:30 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1248
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026

[root@virt-025 ~]# pcs cluster node add virt-032
...

[root@virt-025 ~]# pcs quorum status
Quorum information
------------------
Date:             Thu Feb 21 21:45:46 2019
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/1248
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-025 (local)
         2          1         NR virt-026
[root@virt-025 ~]# pcs resource
 d-1    (ocf::pacemaker:Dummy): Started virt-025
 d-2    (ocf::pacemaker:Dummy): Started virt-026
 d-3    (ocf::pacemaker:Dummy): Started virt-025
 d-4    (ocf::pacemaker:Dummy): Started virt-026
[root@virt-025 ~]# pcs quorum status | grep -E "Quorate:|Quorum:|Flags:"
Quorate:          Yes
Quorum:           2  
Flags:            Quorate

Comment 2 Jan Friesse 2019-03-08 07:06:07 UTC

Forgot to add comment. Issue is really easily reproducible (that's good :) ) It's not yet clear how to fix this. We've had discussion with chrissie/fabio and result is that there is no reason why quorum and votequorum output should differ.

Right now there are two possible solutions known:
  - current cluster becomes non-quorate till new node appears
  - "standard" calculations are used

First solution seems to be more nature (I like Chrissie comment: it's "wait_for_all", not "wait_for_some"), but may have some problems, most notably with last man standing.

It must be also properly tested what configuration change makes corosync wait_for_all.

Comment 3 Jan Friesse 2019-03-18 12:05:43 UTC

Moving to 7.8. Solution should be quite easy but we have to find out all corner cases. Bug is there since 7.0, so shouldn't be a big deal. Also because this affects also corosync 3.0 (RHEL 8) we must clone this BZ to 8.1/8.2 when fix become ready.

Comment 5 Patrik Hagara 2020-02-17 14:25:40 UTC

qa_ack+, reproducer in description

Comment 6 Jan Friesse 2020-03-24 13:20:59 UTC

Problem described in description is solved by patch in bug 1780134, but problem appears in different situations as described in the upstream comment https://github.com/corosync/corosync/pull/542#issuecomment-597207397.

For QA: https://github.com/corosync/corosync/pull/542#issuecomment-597207397 contains tested scenarios.

Comment 7 Jan Friesse 2020-03-24 13:23:50 UTC

Created attachment 1673072 [details]
votequorum: set wfa status only on startup

votequorum: set wfa status only on startup

Previously reload of configuration with enabled wait_for_all result in
set of wait_for_all_status which set cluster_is_quorate to 0 but didn't
inform the quorum service so votequorum and quorum information may get
out of sync.

Example is 1 node cluster, which is extended to 3 nodes. Quorum service
reports cluster as a quorate (incorrect) and votequorum as not-quorate
(correct). Similar behavior happens when extending cluster in general,
but some configurations are less incorrect (3->4).

Discussed solution was to inform quorum service but that would mean
every reload would cause loss of quorum until all nodes would be seen
again.

Such behaviour is consistent but seems to be a bit too strict.

Proposed solution sets wait_for_all_status only on startup and
doesn't touch it during reload.

This solution fulfills requirement of "cluster will be quorate for
the first time only after all nodes have been visible at least
once at the same time." because node clears wait_for_all_status only
after it sees all other nodes or joins cluster which is quorate. It also
solves problem with extending cluster, because when cluster becomes
unquorate (1->3) wait_for_all_status is set.

Added assert is only for ensure that I haven't missed any case when
quorate cluster may become unquorate.

Signed-off-by: Jan Friesse <jfriesse>
Reviewed-by: Christine Caulfield <ccaulfie>
(cherry picked from commit ca320beac25f82c0c555799e647a47975a333c28)

Comment 10 Patrik Hagara 2020-05-26 12:18:06 UTC

before (rhel-7.8, corosync-2.4.5-4.el7)
=======================================

[root@virt-173 ~]# rpm -q corosync
corosync-2.4.5-4.el7.x86_64
[root@virt-173 ~]# pcs status
Cluster name: STSRHTS28883
Stack: corosync
Current DC: virt-173 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Tue May 26 13:50:13 2020
Last change: Tue May 26 13:48:25 2020 by root via cibadmin on virt-173

2 nodes configured
7 resources configured

Online: [ virt-173 virt-175 ]

Full list of resources:

 fence-virt-173 (stonith:fence_xvm):    Started virt-173
 fence-virt-175 (stonith:fence_xvm):    Started virt-175
 fence-virt-178 (stonith:fence_xvm):    Started virt-173
 dummy-1        (ocf::pacemaker:Dummy): Started virt-175
 dummy-2        (ocf::pacemaker:Dummy): Started virt-173
 dummy-3        (ocf::pacemaker:Dummy): Started virt-175
 dummy-4        (ocf::pacemaker:Dummy): Started virt-173

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@virt-173 ~]# pcs quorum status                                                                                                                                                                                
Quorum information
------------------
Date:             Tue May 26 13:50:17 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/26
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-173 (local)
         2          1         NR virt-175

[root@virt-173 ~]# pcs cluster node add virt-178                                                                                                                                                               
Disabling SBD service...
virt-178: sbd disabled
Sending remote node configuration files to 'virt-178'
virt-178: successful distribution of the file 'pacemaker_remote authkey'
virt-173: Corosync updated
virt-175: Corosync updated
Setting up corosync...
virt-178: Succeeded
Synchronizing pcsd certificates on nodes virt-178...
virt-178: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-178: Success
[root@virt-173 ~]# pcs quorum status
Quorum information
------------------
Date:             Tue May 26 13:52:25 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/26
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2 Activity blocked
Flags:            WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-173 (local)
         2          1         NR virt-175

[root@virt-173 ~]# pcs resource
 dummy-1        (ocf::pacemaker:Dummy): Started virt-175
 dummy-2        (ocf::pacemaker:Dummy): Started virt-173
 dummy-3        (ocf::pacemaker:Dummy): Started virt-175
 dummy-4        (ocf::pacemaker:Dummy): Started virt-173


result: quorum and votequorum are out of sync (quorate vs not)



after (rhel-7.9, corosync-2.4.5-5.el7)
======================================

[root@virt-053 ~]# rpm -q corosync
corosync-2.4.5-5.el7.x86_64
[root@virt-053 ~]# pcs status
Cluster name: STSRHTS12710
Stack: corosync
Current DC: virt-060 (version 1.1.22-1.el7-63d2d79005) - partition with quorum
Last updated: Tue May 26 13:58:35 2020
Last change: Tue May 26 13:58:20 2020 by root via cibadmin on virt-053

2 nodes configured
7 resource instances configured

Online: [ virt-053 virt-060 ]

Full list of resources:

 fence-virt-053	(stonith:fence_xvm):	Started virt-053
 fence-virt-060	(stonith:fence_xvm):	Started virt-060
 fence-virt-070	(stonith:fence_xvm):	Started virt-053
 dummy-1	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-2	(ocf::pacemaker:Dummy):	Started virt-053
 dummy-3	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-4	(ocf::pacemaker:Dummy):	Started virt-053

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@virt-053 ~]# pcs quorum status
Quorum information
------------------
Date:             Tue May 26 13:58:43 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/35
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      2
Quorum:           1  
Flags:            2Node Quorate WaitForAll 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-053 (local)
         2          1         NR virt-060

[root@virt-053 ~]# pcs cluster node add virt-070
Disabling SBD service...
virt-070: sbd disabled
Sending remote node configuration files to 'virt-070'
virt-070: successful distribution of the file 'pacemaker_remote authkey'
virt-053: Corosync updated
virt-060: Corosync updated
Setting up corosync...
virt-070: Succeeded
Synchronizing pcsd certificates on nodes virt-070...
virt-070: Success
Restarting pcsd on the nodes in order to reload the certificates...
virt-070: Success
[root@virt-053 ~]# pcs quorum status
Quorum information
------------------
Date:             Tue May 26 13:59:40 2020
Quorum provider:  corosync_votequorum
Nodes:            2
Node ID:          1
Ring ID:          1/35
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      2
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes    Qdevice Name
         1          1         NR virt-053 (local)
         2          1         NR virt-060

[root@virt-053 ~]# pcs resource
 dummy-1	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-2	(ocf::pacemaker:Dummy):	Started virt-053
 dummy-3	(ocf::pacemaker:Dummy):	Started virt-060
 dummy-4	(ocf::pacemaker:Dummy):	Started virt-053


result: both quorum and votequorum report same state (quorate)


marking verified in corosync-2.4.5-5.el7

Comment 12 errata-xmlrpc 2020-09-29 19:55:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (corosync bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3924