Bug 1461500 - [Ganesha] : Misleading pcs status quorum description.
Summary: [Ganesha] : Misleading pcs status quorum description.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: common-ha
Version: rhgs-3.3
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Ambarish
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-14 15:28 UTC by Ambarish
Modified: 2017-06-16 12:48 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-16 10:59:45 UTC
Embargoed:


Attachments (Terms of Use)

Description Ambarish 2017-06-14 15:28:52 UTC
Description of problem:
-----------------------

4 node cluster(Reproduced on 2 nodes too).

Quorum requirement is that > n/2 nodes should have a pacemaker running.

Stop pacemaker and Ganesha service (actually only pacemaker would do too) on 3/4 nodes.

pcs status should ideally say "partition _without_ quorum" when quorum is lost.That's not the case though.It still says "partition with quorum".

**pcs status on the 4th node where I kept pacemaker running **

[root@dhcp42-125 ~]# pcs status
Cluster name: ganesha-ha-360
Stack: corosync
Current DC: dhcp42-125.lab.eng.blr.redhat.com (version 1.1.16-8.el7-94ff4df) - partition with quorum
Last updated: Wed Jun 14 20:03:51 2017
Last change: Wed Jun 14 19:52:41 2017 by root via cibadmin on dhcp42-125.lab.eng.blr.redhat.com

4 nodes configured
24 resources configured

Online: [ dhcp42-125.lab.eng.blr.redhat.com ]
OFFLINE: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ dhcp42-125.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ dhcp42-125.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp42-125.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]
 Resource Group: dhcp42-125.lab.eng.blr.redhat.com-group
     dhcp42-125.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-125.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-125.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
 Resource Group: dhcp42-127.lab.eng.blr.redhat.com-group
     dhcp42-127.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-127.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-127.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
 Resource Group: dhcp42-129.lab.eng.blr.redhat.com-group
     dhcp42-129.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-129.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-129.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
 Resource Group: dhcp42-119.lab.eng.blr.redhat.com-group
     dhcp42-119.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-119.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-119.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com

Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@dhcp42-125 ~]# 


You can see that it shows nodes as OFFLINE though.

**pacemaker statuses on other 3 nodes *** :

[root@dhcp42-127 ~]# service pacemaker status
Redirecting to /bin/systemctl status pacemaker.service
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2017-06-14 19:57:02 IST; 1min 23s ago
     Docs: man:pacemakerd
           http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
 Main PID: 10499 (code=exited, status=0/SUCCESS)

Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]:   notice: Stopping lrmd
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com lrmd[10502]:   notice: Caught 'Terminated' signal
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]:   notice: Stopping stonith-ng
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com stonith-ng[10501]:   notice: Caught 'Terminated' signal
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]:   notice: Stopping cib
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com cib[10500]:   notice: Caught 'Terminated' signal
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com cib[10500]:   notice: Disconnected from Corosync
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com cib[10500]:   notice: Disconnected from Corosync
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com pacemakerd[10499]:   notice: Shutdown complete
Jun 14 19:57:02 dhcp42-127.lab.eng.blr.redhat.com systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
[root@dhcp42-127 ~]# 

[root@dhcp42-129 ~]# service pacemaker status
Redirecting to /bin/systemctl status pacemaker.service
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2017-06-14 19:57:15 IST; 1min 10s ago
     Docs: man:pacemakerd
           http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
 Main PID: 14208 (code=exited, status=0/SUCCESS)

Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]:   notice: Stopping lrmd
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com lrmd[14211]:   notice: Caught 'Terminated' signal
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]:   notice: Stopping stonith-ng
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com stonith-ng[14210]:   notice: Caught 'Terminated' signal
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]:   notice: Stopping cib
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com cib[14209]:   notice: Caught 'Terminated' signal
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com cib[14209]:   notice: Disconnected from Corosync
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com cib[14209]:   notice: Disconnected from Corosync
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com pacemakerd[14208]:   notice: Shutdown complete
Jun 14 19:57:15 dhcp42-129.lab.eng.blr.redhat.com systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
[root@dhcp42-129 ~]# 
[root@dhcp42-129 ~]# 
[root@dhcp42-129 ~]# 


[root@dhcp42-119 ~]# service pacemaker status
Redirecting to /bin/systemctl status pacemaker.service
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Wed 2017-06-14 19:57:09 IST; 1min 17s ago
     Docs: man:pacemakerd
           http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
 Main PID: 7483 (code=exited, status=0/SUCCESS)

Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com pacemakerd[7483]:   notice: Stopping stonith-ng
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com stonith-ng[7485]:   notice: Caught 'Terminated' signal
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]:  warning: new_event_notification (7484-7485-13): Br...32)
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]:  warning: Notification of client stonithd/f1402075-...led
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com pacemakerd[7483]:   notice: Stopping cib
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]:   notice: Caught 'Terminated' signal
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]:   notice: Disconnected from Corosync
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com cib[7484]:   notice: Disconnected from Corosync
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com pacemakerd[7483]:   notice: Shutdown complete
Jun 14 19:57:09 dhcp42-119.lab.eng.blr.redhat.com systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
Hint: Some lines were ellipsized, use -l to show in full.
[root@dhcp42-119 ~]# 

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

[root@dhcp42-125 ~]# rpm -qa|grep gane
nfs-ganesha-gluster-2.4.4-8.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-28.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.4.4-8.el7rhgs.x86_64
nfs-ganesha-2.4.4-8.el7rhgs.x86_64
[root@dhcp42-125 ~]# 

[root@dhcp42-125 ~]# rpm -qa|grep libntir
libntirpc-1.4.3-4.el7rhgs.x86_64

[root@dhcp42-125 ~]# rpm -qa|grep pacemaker
pacemaker-libs-1.1.16-8.el7.x86_64
pacemaker-cli-1.1.16-8.el7.x86_64
pacemaker-cluster-libs-1.1.16-8.el7.x86_64
pacemaker-1.1.16-8.el7.x86_64

[root@dhcp42-125 ~]# rpm -qa|grep corosync
corosync-2.4.0-9.el7.x86_64
corosynclib-2.4.0-9.el7.x86_64
[root@dhcp42-125 ~]# 

[root@dhcp42-125 ~]# rpm -qa|grep pcs
pcs-0.9.157-1.el7.x86_64
[root@dhcp42-125 ~]# 

[root@dhcp42-125 ~]# rpm -qa|grep resource-ag
resource-agents-3.9.5-97.el7.x86_64
[root@dhcp42-125 ~]# 



How reproducible:
-----------------

2/2 - on 2 diff setups.

Steps to Reproduce:
-------------------

1. Have a Ganesha cluster up and running.
2. Stop pacemaker srrvice on > n/2 nodes.
3. Check pcs status.

Actual results:
---------------

pcs status shows partion with quorum.

Expected results:
-----------------

Since quorum is lost,pcs status should reflect the correct state of the cluster - partitioned without quorum.

Comment 6 Ambarish 2017-06-15 12:54:09 UTC
Few data points here :


* Tried this on RHEL 7.3

Service pacemaker stop/ganesha stop on 3/4 Nodes - pcs status still showed "in quorum",like in RHEL 7.4


* If I do a node reboot instead,pacemkaer quorum gets lost on RHEL 7.3 and pcs status is appropriate.

Comment 7 Ambarish 2017-06-15 14:36:52 UTC
Will try node reboot scenario on 7.4 and update.

Comment 8 Kaleb KEITHLEY 2017-06-15 21:13:31 UTC
[17:04:15] <kkeithley> kgaillot: circling back to my question about quorum--  I misstated yesterday.  On a 4 node cluster, if pacemaker is stopped on 3 nodes, pcs status still shows quorum.  If (same) 3 nodes are downed, then it goes to no quorum.
[17:04:51] <kkeithley> behaves the same on both RHEL7.3 and RHEL7.4.
[17:05:10] <kgaillot> kkeithley: that sounds like how last-man-standing behaves
[17:05:23] <kkeithley> I'm not sure why our QE is testing this, but they are
[17:06:47] <kgaillot> or allow_downscale
[17:08:05] <kgaillot> the corosync folks are gone for the day, but they would have the best idea of what's going on
[17:08:22] <kgaillot> chrissie_away and honza
[17:09:49] <kgaillot> with all corosync defaults, the cluster should lose quorum when 2 nodes are stopped or downed

(Ambarish, is my description of what you did accurate?)

Comment 9 Ambarish 2017-06-16 03:02:17 UTC
(In reply to Kaleb KEITHLEY from comment #8)
> [17:04:15] <kkeithley> kgaillot: circling back to my question about quorum--
> I misstated yesterday.  On a 4 node cluster, if pacemaker is stopped on 3
> nodes, pcs status still shows quorum.  If (same) 3 nodes are downed, then it
> goes to no quorum.
> [17:04:51] <kkeithley> behaves the same on both RHEL7.3 and RHEL7.4.
> [17:05:10] <kgaillot> kkeithley: that sounds like how last-man-standing
> behaves
> [17:05:23] <kkeithley> I'm not sure why our QE is testing this, but they are
> [17:06:47] <kgaillot> or allow_downscale
> [17:08:05] <kgaillot> the corosync folks are gone for the day, but they
> would have the best idea of what's going on
> [17:08:22] <kgaillot> chrissie_away and honza
> [17:09:49] <kgaillot> with all corosync defaults, the cluster should lose
> quorum when 2 nodes are stopped or downed
> 
> (Ambarish, is my description of what you did accurate?)


On a 4 node cluster, if pacemaker is stopped on 3 nodes, pcs status still shows quorum.  If (same) 3 nodes are downed, then it goes to no quorum - This is correct.

But I am yet to test reboots on 7.4(On 7.3 I tested reboots,and quorum was lost),I'll try to have the update by EOD.

Comment 10 Ambarish 2017-06-16 10:59:45 UTC
**On 7.3** :


*Node Reboot* : partition without quorum

*pacemaker stop* : partition with quorum

**On 7.4** :


*Node Reboot* : partition without quorum

*pacemaker stop* : partition with quorum

There's no behavior change between the RHEL 7.3 -> RHEL 7.4

My (incorrect) understanding was that since it's pacemaker quorum,stopping the service should result in quorum loss.

Closing as NaB.

Comment 11 Kaleb KEITHLEY 2017-06-16 12:48:15 UTC
and in conclusion (maybe?):

[03:21:24] <chrissie> kkeithley stopping pacemaker has no effect on quorum. quorum is about nodes, not applications


Note You need to log in before you can comment on or make changes to this bug.