Description of problem: Ganesha services are not stopped when pacemaker quorum is lost Version-Release number of selected component (if applicable): glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1. Create a 4 node ganesha cluster. 2. Reboot 2 nodes in the ganesha cluster Actual results: Ganesha services are not stopped when pacemaker quorum is lost Expected results: If pacemaker quorum is lost then nfs ganesha services should be stopped. Additional info: If no-quorum-policy is set to 'stop' state in the cluster, all the resources in the cluster should be stopped if quorum is lost. Reference: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/ch-clusteropts-HAAR.html [root@dhcp46-42 ~]# pcs property list --all | grep quorum no-quorum-policy: stop /var/log/messages log snippet: ------------------------------- Nov 29 17:27:36 dhcp46-42 crmd[28465]: notice: do_shutdown of peer dhcp47-155.lab.eng.blr.redhat.com is complete Nov 29 17:27:36 dhcp46-42 attrd[28463]: notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost Nov 29 17:27:36 dhcp46-42 attrd[28463]: notice: Removing all dhcp47-155.lab.eng.blr.redhat.com attributes for peer loss Nov 29 17:27:36 dhcp46-42 attrd[28463]: notice: Lost attribute writer dhcp47-155.lab.eng.blr.redhat.com Nov 29 17:27:36 dhcp46-42 attrd[28463]: notice: Purged 1 peers with id=3 and/or uname=dhcp47-155.lab.eng.blr.redhat.com from the membership cache Nov 29 17:27:36 dhcp46-42 stonith-ng[28461]: notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost Nov 29 17:27:36 dhcp46-42 stonith-ng[28461]: notice: Purged 1 peers with id=3 and/or uname=dhcp47-155.lab.eng.blr.redhat.com from the membership cache Nov 29 17:27:36 dhcp46-42 cib[28460]: notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost Nov 29 17:27:36 dhcp46-42 cib[28460]: notice: Purged 1 peers with id=3 and/or uname=dhcp47-155.lab.eng.blr.redhat.com from the membership cache Nov 29 17:27:36 dhcp46-42 crmd[28465]: notice: Result of start operation for dhcp47-155.lab.eng.blr.redhat.com-nfs_block on dhcp46-42.lab.eng.blr.redhat.com: 0 (ok) Nov 29 17:27:36 dhcp46-42 crmd[28465]: notice: Initiating monitor operation dhcp47-155.lab.eng.blr.redhat.com-nfs_block_monitor_10000 locally on dhcp46-42.lab.eng.blr.redhat.com Nov 29 17:27:36 dhcp46-42 crmd[28465]: notice: Initiating start operation dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1_start_0 locally on dhcp46-42.lab.eng.blr.redhat.com Nov 29 17:27:36 dhcp46-42 corosync[28442]: [TOTEM ] A new membership (10.70.46.42:1620) was formed. Members left: 3 Nov 29 17:27:36 dhcp46-42 corosync[28442]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Nov 29 17:27:36 dhcp46-42 corosync[28442]: [QUORUM] Members[2]: 1 4 Nov 29 17:27:36 dhcp46-42 corosync[28442]: [MAIN ] Completed service synchronization, ready to provide service. Nov 29 17:27:36 dhcp46-42 crmd[28465]: warning: Quorum lost Nov 29 17:27:36 dhcp46-42 crmd[28465]: notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost Nov 29 17:27:36 dhcp46-42 pacemakerd[28458]: warning: Quorum lost Nov 29 17:27:36 dhcp46-42 pacemakerd[28458]: notice: Node dhcp47-155.lab.eng.blr.redhat.com state is now lost Nov 29 17:27:36 dhcp46-42 crmd[28465]: notice: do_shutdown of peer dhcp47-155.lab.eng.blr.redhat.com is complete. pcs status: ------------ [root@dhcp46-42 ~]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp46-42.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum Last updated: Tue Nov 29 20:07:55 2016 Last change: Tue Nov 29 17:25:21 2016 by root via cibadmin on dhcp46-42.lab.eng.blr.redhat.com 4 nodes and 24 resources configured Online: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Resource Group: dhcp46-42.lab.eng.blr.redhat.com-group dhcp46-42.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-42.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-101.lab.eng.blr.redhat.com-group dhcp46-101.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-42.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp47-155.lab.eng.blr.redhat.com-group dhcp47-155.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp47-167.lab.eng.blr.redhat.com-group dhcp47-167.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp47-167.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp47-167.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Failed Actions: * dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-42.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=92, status=complete, exitreason='none', last-rc-change='Tue Nov 29 17:28:10 2016', queued=0ms, exec=103ms * dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-42.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none', last-rc-change='Tue Nov 29 17:27:52 2016', queued=0ms, exec=0ms sosreports are located at, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/ha_reboot_case/
Few more observations: Initially when the quorunm is lost, pcs status shows, [root@dhcp46-111 ~]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp46-111.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum Last updated: Wed Nov 30 16:09:13 2016 Last change: Wed Nov 30 14:46:54 2016 by root via cibadmin on dhcp46-111.lab.eng.blr.redhat.com 4 nodes and 24 resources configured Online: [ dhcp46-111.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp46-111.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Resource Group: dhcp46-111.lab.eng.blr.redhat.com-group dhcp46-111.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-111.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp46-115.lab.eng.blr.redhat.com-group dhcp46-115.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp46-139.lab.eng.blr.redhat.com-group dhcp46-139.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp46-124.lab.eng.blr.redhat.com-group dhcp46-124.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-124.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped But sometimes after ~ 2 hours, some of the node's services are going to stopped state. Online: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Resource Group: dhcp46-42.lab.eng.blr.redhat.com-group dhcp46-42.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-42.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-101.lab.eng.blr.redhat.com-group dhcp46-101.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-42.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp47-155.lab.eng.blr.redhat.com-group dhcp47-155.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp47-167.lab.eng.blr.redhat.com-group dhcp47-167.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp47-167.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp47-167.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Also, IOs are continuing on the mount point even when quorum is lost.
upstream mainline patch http://review.gluster.org/#/c/15981/ posted for review.
upstream mainline : http://review.gluster.org/#/c/15981/ upstream 3.9 : http://review.gluster.org/15991 upstream 3.8 : http://review.gluster.org/15992 downstream : https://code.engineering.redhat.com/gerrit/#/c/91896/
I have seen this issue few times very rarely after the fix, but with the latest build the issue is not seen. Verified the fix in build, nfs-ganesha-gluster-2.4.1-4.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-11.el7rhgs.x86_64 nfs-ganesha-2.4.1-4.el7rhgs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html