Description of problem: ----------------------- **This might be a consequence of https://bugzilla.redhat.com/show_bug.cgi?id=1403587.Raising a separate BZ for Dev to take a look** While running multithreaded perf tests on RHEL 6.X cluster,nodes went to stopped state(tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1403587). Another thing I noticed was that pcs status was different on all nodes for nearly 10 minutes.I don't think that's expected behaviour. When nodes go to stopped state,all services appear alive and running: [root@gqas013 ~]# service nfs-ganesha status ganesha.nfsd (pid 21613) is running... [root@gqas013 ~]# [root@gqas013 ~]# service pacemaker status pacemakerd (pid 22548) is running... [root@gqas013 ~]# [root@gqas013 ~]# [root@gqas013 ~]# service corosync status corosync (pid 22268) is running... [root@gqas013 ~]# [root@gqas013 ~]# service pcsd status pcsd (pid 22604) is running... [root@gqas013 ~]# ----------- *gqas011*: ----------- [root@gqas011 ~]# pcs status Cluster name: G14623742.03 Last updated: Mon Dec 12 10:35:52 2016 Last change: Mon Dec 12 10:26:10 2016 by root via cibadmin on gqas013.sbu.lab.eng.bos.redhat.com Stack: cman Current DC: gqas011.sbu.lab.eng.bos.redhat.com (version 1.1.14-8.el6_8.2-70404b0) - partition WITHOUT quorum 4 nodes and 24 resources configured Online: [ gqas011.sbu.lab.eng.bos.redhat.com ] OFFLINE: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Stopped: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Resource Group: gqas013.sbu.lab.eng.bos.redhat.com-group gqas013.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas013.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: gqas005.sbu.lab.eng.bos.redhat.com-group gqas005.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas005.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: gqas006.sbu.lab.eng.bos.redhat.com-group gqas006.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas006.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: gqas011.sbu.lab.eng.bos.redhat.com-group gqas011.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas011.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Failed Actions: * nfs-mon_monitor_10000 on gqas011.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms PCSD Status: gqas013.sbu.lab.eng.bos.redhat.com: Online gqas005.sbu.lab.eng.bos.redhat.com: Online gqas006.sbu.lab.eng.bos.redhat.com: Online gqas011.sbu.lab.eng.bos.redhat.com: Online [root@gqas011 ~]# --------- gqas013 : --------- [root@gqas013 ~]# [root@gqas013 ~]# pcs status Cluster name: G14623742.03 Last updated: Mon Dec 12 10:35:52 2016 Last change: Mon Dec 12 10:26:10 2016 by root via cibadmin on gqas013.sbu.lab.eng.bos.redhat.com Stack: cman Current DC: gqas013.sbu.lab.eng.bos.redhat.com (version 1.1.14-8.el6_8.2-70404b0) - partition with quorum 4 nodes and 24 resources configured Online: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] OFFLINE: [ gqas011.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Resource Group: gqas013.sbu.lab.eng.bos.redhat.com-group gqas013.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas013.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas013.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas013.sbu.lab.eng.bos.redhat.com Resource Group: gqas005.sbu.lab.eng.bos.redhat.com-group gqas005.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas005.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas005.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas005.sbu.lab.eng.bos.redhat.com Resource Group: gqas006.sbu.lab.eng.bos.redhat.com-group gqas006.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com Resource Group: gqas011.sbu.lab.eng.bos.redhat.com-group gqas011.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com Failed Actions: * nfs-mon_monitor_10000 on gqas006.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms * nfs-mon_monitor_10000 on gqas005.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms PCSD Status: gqas013.sbu.lab.eng.bos.redhat.com: Online gqas005.sbu.lab.eng.bos.redhat.com: Online gqas006.sbu.lab.eng.bos.redhat.com: Online gqas011.sbu.lab.eng.bos.redhat.com: Online [root@gqas013 ~]# --------- gqas005 : --------- [root@gqas005 ~]# pcs status Cluster name: G14623742.03 Last updated: Mon Dec 12 10:35:52 2016 Last change: Mon Dec 12 10:26:10 2016 by root via cibadmin on gqas013.sbu.lab.eng.bos.redhat.com Stack: cman Current DC: gqas013.sbu.lab.eng.bos.redhat.com (version 1.1.14-8.el6_8.2-70404b0) - partition with quorum 4 nodes and 24 resources configured Online: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] OFFLINE: [ gqas011.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Resource Group: gqas013.sbu.lab.eng.bos.redhat.com-group gqas013.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas013.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas013.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas013.sbu.lab.eng.bos.redhat.com Resource Group: gqas005.sbu.lab.eng.bos.redhat.com-group gqas005.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas005.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas005.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas005.sbu.lab.eng.bos.redhat.com Resource Group: gqas006.sbu.lab.eng.bos.redhat.com-group gqas006.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com Resource Group: gqas011.sbu.lab.eng.bos.redhat.com-group gqas011.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com Failed Actions: * nfs-mon_monitor_10000 on gqas006.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms * nfs-mon_monitor_10000 on gqas005.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms PCSD Status: gqas013.sbu.lab.eng.bos.redhat.com: Online gqas005.sbu.lab.eng.bos.redhat.com: Online gqas006.sbu.lab.eng.bos.redhat.com: Online gqas011.sbu.lab.eng.bos.redhat.com: Online [root@gqas005 ~]# --------- gqas006 : --------- [root@gqas006 ~]# pcs status Cluster name: G14623742.03 Last updated: Mon Dec 12 10:35:52 2016 Last change: Mon Dec 12 10:26:10 2016 by root via cibadmin on gqas013.sbu.lab.eng.bos.redhat.com Stack: cman Current DC: gqas013.sbu.lab.eng.bos.redhat.com (version 1.1.14-8.el6_8.2-70404b0) - partition with quorum 4 nodes and 24 resources configured Online: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] OFFLINE: [ gqas011.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gqas005.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com ] Stopped: [ gqas011.sbu.lab.eng.bos.redhat.com ] Resource Group: gqas013.sbu.lab.eng.bos.redhat.com-group gqas013.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas013.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas013.sbu.lab.eng.bos.redhat.com gqas013.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas013.sbu.lab.eng.bos.redhat.com Resource Group: gqas005.sbu.lab.eng.bos.redhat.com-group gqas005.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas005.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas005.sbu.lab.eng.bos.redhat.com gqas005.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas005.sbu.lab.eng.bos.redhat.com Resource Group: gqas006.sbu.lab.eng.bos.redhat.com-group gqas006.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas006.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com Resource Group: gqas011.sbu.lab.eng.bos.redhat.com-group gqas011.sbu.lab.eng.bos.redhat.com-nfs_block (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started gqas006.sbu.lab.eng.bos.redhat.com gqas011.sbu.lab.eng.bos.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started gqas006.sbu.lab.eng.bos.redhat.com Failed Actions: * nfs-mon_monitor_10000 on gqas006.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms * nfs-mon_monitor_10000 on gqas005.sbu.lab.eng.bos.redhat.com 'unknown error' (1): call=12, status=Timed Out, exitreason='none', last-rc-change='Mon Dec 12 10:28:42 2016', queued=0ms, exec=0ms PCSD Status: gqas013.sbu.lab.eng.bos.redhat.com: Online gqas005.sbu.lab.eng.bos.redhat.com: Online gqas006.sbu.lab.eng.bos.redhat.com: Online gqas011.sbu.lab.eng.bos.redhat.com: Online [root@gqas006 ~]# Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-ganesha-3.8.4-7.el6rhs.x86_64 nfs-ganesha-2.4.1-2.el6rhs.x86_64 How reproducible: ----------------- Sporadically. Steps to Reproduce: ------------------- 1. Create a 4 node Ganesha cluster on RHEL 6.Mount 2*2 volume on 4 clients and run iozone. 2. Wait for nodes to go to Stopped state. 3. Check pcs status on all nodes periodically. Actual results: ---------------- pcs status is different on all the nodes for sometime.After ~8 minutes ,it went to Stopped on all nodes. Expected results: ----------------- pcs status should reflect the current state of the cluster ,and should be the same everywhere on all nodes. Additional info: ---------------- Server OS : RHEL 6.8 Client OS : RHEL 7.3 *Vol Config* : Volume Name: testvol Type: Distributed-Replicate Volume ID: c43082bb-e807-46b8-8e07-c8eae54eec21 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas011.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on nfs.disable: on performance.readdir-ahead: on performance.stat-prefetch: off server.allow-insecure: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas005 /]# [root@gqas013 ~]# rpm -qa|grep pacemaker pacemaker-cluster-libs-1.1.14-8.el6_8.2.x86_64 pacemaker-1.1.14-8.el6_8.2.x86_64 pacemaker-cli-1.1.14-8.el6_8.2.x86_64 pacemaker-libs-1.1.14-8.el6_8.2.x86_64 [root@gqas013 ~]# rpm -qa|grep pcs pcs-0.9.148-7.el6_8.1.x86_64 [root@gqas013 ~]# rpm -qa|grep corosync corosynclib-1.4.7-5.el6.x86_64 corosync-1.4.7-5.el6.x86_64 [root@gqas013 ~]# rpm -qa|grep cman cman-3.0.12.1-78.el6.x86_64 [root@gqas013 ~]#
upstream patch http://review.gluster.org/#/c/16122/ posted for review.
upstream mainline http://review.gluster.org/16122 release-3.9 : http://review.gluster.org/16139 release-3.8 : http://review.gluster.org/16140 downstream patch : https://code.engineering.redhat.com/gerrit/#/c/93080
I hit the same issue with node reboot scenario on rhel6. With the localfix provided by Jiffin, Tested the scenario of node reboot, with this localfix I am Not observing the same issue again.
Verifed on glusterfs-3.8.4-11 and Ganesha 2.4.1-4. Could not reproduce the reported issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0484.html