1399757 – Ganesha services are not stopped when pacemaker quorum is lost

Bug 1399757 - Ganesha services are not stopped when pacemaker quorum is lost

Summary: Ganesha services are not stopped when pacemaker quorum is lost

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kaleb KEITHLEY
QA Contact:	Arthy Loganathan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1400237 1400572 1400573
TreeView+	depends on / blocked

Reported:	2016-11-29 16:37 UTC by Arthy Loganathan
Modified:	2017-03-23 05:52 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.8.4-7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1400237 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:52:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Arthy Loganathan 2016-11-29 16:37:53 UTC

Description of problem:
Ganesha services are not stopped when pacemaker quorum is lost

Version-Release number of selected component (if applicable):
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a 4 node ganesha cluster.
2. Reboot 2 nodes in the ganesha cluster 

Actual results:
Ganesha services are not stopped when pacemaker quorum is lost

Expected results:
If pacemaker quorum is lost then nfs ganesha services should be stopped.

Additional info:

If no-quorum-policy is set to 'stop' state in the cluster, all the resources in the cluster should be stopped if quorum is lost.
Reference: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-clusteropts-HAAR.html

[root@dhcp46-42 ~]# pcs property list --all | grep quorum
 no-quorum-policy: stop

/var/log/messages log snippet:
-------------------------------

Nov 29 17:27:36 dhcp46-42 crmd[28465]:  notice: do_shutdown of peer dhcp47-155.lab.eng.blr.redhat.com is complete
Nov 29 17:27:36 dhcp46-42 attrd[28463]:  notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost
Nov 29 17:27:36 dhcp46-42 attrd[28463]:  notice: Removing all dhcp47-155.lab.eng.blr.redhat.com attributes for peer loss
Nov 29 17:27:36 dhcp46-42 attrd[28463]:  notice: Lost attribute writer dhcp47-155.lab.eng.blr.redhat.com
Nov 29 17:27:36 dhcp46-42 attrd[28463]:  notice: Purged 1 peers with id=3 and/or uname=dhcp47-155.lab.eng.blr.redhat.com from the membership cache
Nov 29 17:27:36 dhcp46-42 stonith-ng[28461]:  notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost
Nov 29 17:27:36 dhcp46-42 stonith-ng[28461]:  notice: Purged 1 peers with id=3 and/or uname=dhcp47-155.lab.eng.blr.redhat.com from the membership cache
Nov 29 17:27:36 dhcp46-42 cib[28460]:  notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost
Nov 29 17:27:36 dhcp46-42 cib[28460]:  notice: Purged 1 peers with id=3 and/or uname=dhcp47-155.lab.eng.blr.redhat.com from the membership cache
Nov 29 17:27:36 dhcp46-42 crmd[28465]:  notice: Result of start operation for dhcp47-155.lab.eng.blr.redhat.com-nfs_block on dhcp46-42.lab.eng.blr.redhat.com: 0 (ok)
Nov 29 17:27:36 dhcp46-42 crmd[28465]:  notice: Initiating monitor operation dhcp47-155.lab.eng.blr.redhat.com-nfs_block_monitor_10000 locally on dhcp46-42.lab.eng.blr.redhat.com
Nov 29 17:27:36 dhcp46-42 crmd[28465]:  notice: Initiating start operation dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1_start_0 locally on dhcp46-42.lab.eng.blr.redhat.com
Nov 29 17:27:36 dhcp46-42 corosync[28442]: [TOTEM ] A new membership (10.70.46.42:1620) was formed. Members left: 3
Nov 29 17:27:36 dhcp46-42 corosync[28442]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 29 17:27:36 dhcp46-42 corosync[28442]: [QUORUM] Members[2]: 1 4
Nov 29 17:27:36 dhcp46-42 corosync[28442]: [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 17:27:36 dhcp46-42 crmd[28465]: warning: Quorum lost
Nov 29 17:27:36 dhcp46-42 crmd[28465]:  notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost
Nov 29 17:27:36 dhcp46-42 pacemakerd[28458]: warning: Quorum lost
Nov 29 17:27:36 dhcp46-42 pacemakerd[28458]:  notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost
Nov 29 17:27:36 dhcp46-42 crmd[28465]:  notice: do_shutdown of peer dhcp47-155.lab.eng.blr.redhat.com is complete.

pcs status:
------------

[root@dhcp46-42 ~]# pcs status
Cluster name: ganesha-ha-360
Stack: corosync
Current DC: dhcp46-42.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum
Last updated: Tue Nov 29 20:07:55 2016		Last change: Tue Nov 29 17:25:21 2016 by root via cibadmin on dhcp46-42.lab.eng.blr.redhat.com

4 nodes and 24 resources configured

Online: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
OFFLINE: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ]
 Resource Group: dhcp46-42.lab.eng.blr.redhat.com-group
     dhcp46-42.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp46-42.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped
 Resource Group: dhcp46-101.lab.eng.blr.redhat.com-group
     dhcp46-101.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-42.lab.eng.blr.redhat.com
     dhcp46-101.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-42.lab.eng.blr.redhat.com
     dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-42.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp47-155.lab.eng.blr.redhat.com-group
     dhcp47-155.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped
 Resource Group: dhcp47-167.lab.eng.blr.redhat.com-group
     dhcp47-167.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp47-167.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp47-167.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped

Failed Actions:
* dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-42.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=92, status=complete, exitreason='none',
    last-rc-change='Tue Nov 29 17:28:10 2016', queued=0ms, exec=103ms
* dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-42.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none',
    last-rc-change='Tue Nov 29 17:27:52 2016', queued=0ms, exec=0ms


sosreports are located at, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/ha_reboot_case/

Comment 2 Arthy Loganathan 2016-11-30 10:55:57 UTC

Few more observations:

Initially when the quorunm is lost, pcs status shows,

[root@dhcp46-111 ~]# pcs status
Cluster name: ganesha-ha-360
Stack: corosync
Current DC: dhcp46-111.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum
Last updated: Wed Nov 30 16:09:13 2016		Last change: Wed Nov 30 14:46:54 2016 by root via cibadmin on dhcp46-111.lab.eng.blr.redhat.com

4 nodes and 24 resources configured

Online: [ dhcp46-111.lab.eng.blr.redhat.com ]
OFFLINE: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp46-111.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
 Resource Group: dhcp46-111.lab.eng.blr.redhat.com-group
     dhcp46-111.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-111.lab.eng.blr.redhat.com
     dhcp46-111.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-111.lab.eng.blr.redhat.com
     dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp46-115.lab.eng.blr.redhat.com-group
     dhcp46-115.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-111.lab.eng.blr.redhat.com
     dhcp46-115.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-111.lab.eng.blr.redhat.com
     dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp46-139.lab.eng.blr.redhat.com-group
     dhcp46-139.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-111.lab.eng.blr.redhat.com
     dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-111.lab.eng.blr.redhat.com
     dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp46-124.lab.eng.blr.redhat.com-group
     dhcp46-124.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp46-124.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped


But sometimes after ~ 2 hours, some of the node's services are going to stopped state.


Online: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
OFFLINE: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ]
 Resource Group: dhcp46-42.lab.eng.blr.redhat.com-group
     dhcp46-42.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp46-42.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped
 Resource Group: dhcp46-101.lab.eng.blr.redhat.com-group
     dhcp46-101.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-42.lab.eng.blr.redhat.com
     dhcp46-101.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-42.lab.eng.blr.redhat.com
     dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-42.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp47-155.lab.eng.blr.redhat.com-group
     dhcp47-155.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped
 Resource Group: dhcp47-167.lab.eng.blr.redhat.com-group
     dhcp47-167.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Stopped
     dhcp47-167.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
     dhcp47-167.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Stopped


Also, IOs are continuing on the mount point even when quorum is lost.

Comment 7 Atin Mukherjee 2016-12-01 04:36:34 UTC

upstream mainline patch http://review.gluster.org/#/c/15981/ posted for review.

Comment 8 Atin Mukherjee 2016-12-04 05:20:02 UTC

upstream mainline : http://review.gluster.org/#/c/15981/
upstream 3.9 : http://review.gluster.org/15991
upstream 3.8 : http://review.gluster.org/15992

downstream : https://code.engineering.redhat.com/gerrit/#/c/91896/

Comment 10 Arthy Loganathan 2017-01-16 10:15:15 UTC

I have seen this issue few times very rarely after the fix, but with the latest build the issue is not seen.

Verified the fix in build,

nfs-ganesha-gluster-2.4.1-4.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64
nfs-ganesha-2.4.1-4.el7rhgs.x86_64

Comment 12 errata-xmlrpc 2017-03-23 05:52:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.