Description of problem: ----------------------- 4 node cluster(Reproduced on 2 nodes too). Quorum requirement is that > n/2 nodes should have a pacemaker running. Stop pacemaker and Ganesha service (actually only pacemaker would do too) on 3/4 nodes. pcs status should ideally say "partition _without_ quorum" when quorum is lost.That's not the case though.It still says "partition with quorum". **pcs status on the 4th node where I kept pacemaker running ** [root@dhcp42-125 ~]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp42-125.lab.eng.blr.redhat.com (version 1.1.16-8.el7-94ff4df) - partition with quorum Last updated: Wed Jun 14 20:03:51 2017 Last change: Wed Jun 14 19:52:41 2017 by root via cibadmin on dhcp42-125.lab.eng.blr.redhat.com 4 nodes configured 24 resources configured Online: [ dhcp42-125.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ dhcp42-125.lab.eng.blr.redhat.com ] Stopped: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ dhcp42-125.lab.eng.blr.redhat.com ] Stopped: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp42-125.lab.eng.blr.redhat.com ] Stopped: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Resource Group: dhcp42-125.lab.eng.blr.redhat.com-group dhcp42-125.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com Resource Group: dhcp42-127.lab.eng.blr.redhat.com-group dhcp42-127.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com Resource Group: dhcp42-129.lab.eng.blr.redhat.com-group dhcp42-129.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com Resource Group: dhcp42-119.lab.eng.blr.redhat.com-group dhcp42-119.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-119.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-119.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled [root@dhcp42-125 ~]# You can see that it shows nodes as OFFLINE though. **pacemaker statuses on other 3 nodes *** : [root@dhcp42-127 ~]# service pacemaker status Redirecting to /bin/systemctl status pacemaker.service ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2017-06-14 19:57:02 IST; 1min 23s ago Docs: man:pacemakerd http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html Main PID: 10499 (code=exited, status=0/SUCCESS) Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]: notice: Stopping lrmd Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com lrmd[10502]: notice: Caught 'Terminated' signal Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]: notice: Stopping stonith-ng Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com stonith-ng[10501]: notice: Caught 'Terminated' signal Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]: notice: Stopping cib Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com cib[10500]: notice: Caught 'Terminated' signal Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com cib[10500]: notice: Disconnected from Corosync Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com cib[10500]: notice: Disconnected from Corosync Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]: notice: Shutdown complete Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com systemd[1]: Stopped Pacemaker High Availability Cluster Manager. [root@dhcp42-127 ~]# [root@dhcp42-129 ~]# service pacemaker status Redirecting to /bin/systemctl status pacemaker.service ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2017-06-14 19:57:15 IST; 1min 10s ago Docs: man:pacemakerd http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html Main PID: 14208 (code=exited, status=0/SUCCESS) Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]: notice: Stopping lrmd Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com lrmd[14211]: notice: Caught 'Terminated' signal Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]: notice: Stopping stonith-ng Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com stonith-ng[14210]: notice: Caught 'Terminated' signal Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]: notice: Stopping cib Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com cib[14209]: notice: Caught 'Terminated' signal Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com cib[14209]: notice: Disconnected from Corosync Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com cib[14209]: notice: Disconnected from Corosync Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]: notice: Shutdown complete Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com systemd[1]: Stopped Pacemaker High Availability Cluster Manager. [root@dhcp42-129 ~]# [root@dhcp42-129 ~]# [root@dhcp42-129 ~]# [root@dhcp42-119 ~]# service pacemaker status Redirecting to /bin/systemctl status pacemaker.service ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2017-06-14 19:57:09 IST; 1min 17s ago Docs: man:pacemakerd http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html Main PID: 7483 (code=exited, status=0/SUCCESS) Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com pacemakerd[7483]: notice: Stopping stonith-ng Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com stonith-ng[7485]: notice: Caught 'Terminated' signal Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]: warning: new_event_notification (7484-7485-13): Br...32) Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]: warning: Notification of client stonithd/f1402075-...led Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com pacemakerd[7483]: notice: Stopping cib Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]: notice: Caught 'Terminated' signal Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]: notice: Disconnected from Corosync Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]: notice: Disconnected from Corosync Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com pacemakerd[7483]: notice: Shutdown complete Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Hint: Some lines were ellipsized, use -l to show in full. [root@dhcp42-119 ~]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- [root@dhcp42-125 ~]# rpm -qa|grep gane nfs-ganesha-gluster-2.4.4-8.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-28.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.4.4-8.el7rhgs.x86_64 nfs-ganesha-2.4.4-8.el7rhgs.x86_64 [root@dhcp42-125 ~]# [root@dhcp42-125 ~]# rpm -qa|grep libntir libntirpc-1.4.3-4.el7rhgs.x86_64 [root@dhcp42-125 ~]# rpm -qa|grep pacemaker pacemaker-libs-1.1.16-8.el7.x86_64 pacemaker-cli-1.1.16-8.el7.x86_64 pacemaker-cluster-libs-1.1.16-8.el7.x86_64 pacemaker-1.1.16-8.el7.x86_64 [root@dhcp42-125 ~]# rpm -qa|grep corosync corosync-2.4.0-9.el7.x86_64 corosynclib-2.4.0-9.el7.x86_64 [root@dhcp42-125 ~]# [root@dhcp42-125 ~]# rpm -qa|grep pcs pcs-0.9.157-1.el7.x86_64 [root@dhcp42-125 ~]# [root@dhcp42-125 ~]# rpm -qa|grep resource-ag resource-agents-3.9.5-97.el7.x86_64 [root@dhcp42-125 ~]# How reproducible: ----------------- 2/2 - on 2 diff setups. Steps to Reproduce: ------------------- 1. Have a Ganesha cluster up and running. 2. Stop pacemaker srrvice on > n/2 nodes. 3. Check pcs status. Actual results: --------------- pcs status shows partion with quorum. Expected results: ----------------- Since quorum is lost,pcs status should reflect the correct state of the cluster - partitioned without quorum.
Few data points here : * Tried this on RHEL 7.3 Service pacemaker stop/ganesha stop on 3/4 Nodes - pcs status still showed "in quorum",like in RHEL 7.4 * If I do a node reboot instead,pacemkaer quorum gets lost on RHEL 7.3 and pcs status is appropriate.
Will try node reboot scenario on 7.4 and update.
[17:04:15] <kkeithley> kgaillot: circling back to my question about quorum-- I misstated yesterday. On a 4 node cluster, if pacemaker is stopped on 3 nodes, pcs status still shows quorum. If (same) 3 nodes are downed, then it goes to no quorum. [17:04:51] <kkeithley> behaves the same on both RHEL7.3 and RHEL7.4. [17:05:10] <kgaillot> kkeithley: that sounds like how last-man-standing behaves [17:05:23] <kkeithley> I'm not sure why our QE is testing this, but they are [17:06:47] <kgaillot> or allow_downscale [17:08:05] <kgaillot> the corosync folks are gone for the day, but they would have the best idea of what's going on [17:08:22] <kgaillot> chrissie_away and honza [17:09:49] <kgaillot> with all corosync defaults, the cluster should lose quorum when 2 nodes are stopped or downed (Ambarish, is my description of what you did accurate?)
(In reply to Kaleb KEITHLEY from comment #8) > [17:04:15] <kkeithley> kgaillot: circling back to my question about quorum-- > I misstated yesterday. On a 4 node cluster, if pacemaker is stopped on 3 > nodes, pcs status still shows quorum. If (same) 3 nodes are downed, then it > goes to no quorum. > [17:04:51] <kkeithley> behaves the same on both RHEL7.3 and RHEL7.4. > [17:05:10] <kgaillot> kkeithley: that sounds like how last-man-standing > behaves > [17:05:23] <kkeithley> I'm not sure why our QE is testing this, but they are > [17:06:47] <kgaillot> or allow_downscale > [17:08:05] <kgaillot> the corosync folks are gone for the day, but they > would have the best idea of what's going on > [17:08:22] <kgaillot> chrissie_away and honza > [17:09:49] <kgaillot> with all corosync defaults, the cluster should lose > quorum when 2 nodes are stopped or downed > > (Ambarish, is my description of what you did accurate?) On a 4 node cluster, if pacemaker is stopped on 3 nodes, pcs status still shows quorum. If (same) 3 nodes are downed, then it goes to no quorum - This is correct. But I am yet to test reboots on 7.4(On 7.3 I tested reboots,and quorum was lost),I'll try to have the update by EOD.
**On 7.3** : *Node Reboot* : partition without quorum *pacemaker stop* : partition with quorum **On 7.4** : *Node Reboot* : partition without quorum *pacemaker stop* : partition with quorum There's no behavior change between the RHEL 7.3 -> RHEL 7.4 My (incorrect) understanding was that since it's pacemaker quorum,stopping the service should result in quorum loss. Closing as NaB.
and in conclusion (maybe?): [03:21:24] <chrissie> kkeithley stopping pacemaker has no effect on quorum. quorum is about nodes, not applications