Description of problem: While doing ganesha disable/enable,one node goes to stopped state in nfs-grace-clone [nfs-grace],which in turn not letting this node to enter into grace period while doing node reboot Version-Release number of selected component (if applicable): # rpm -qa | grep ganesha nfs-ganesha-gluster-2.4.4-10.el7rhgs.x86_64 nfs-ganesha-2.4.4-10.el7rhgs.x86_64 nfs-ganesha-debuginfo-2.4.4-10.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64 # rpm -qa | grep pacemaker pacemaker-1.1.16-12.el7.x86_64 pacemaker-libs-1.1.16-12.el7.x86_64 pacemaker-cluster-libs-1.1.16-12.el7.x86_64 pacemaker-cli-1.1.16-12.el7.x86_64 # rpm -qa | grep pcs pcs-0.9.158-6.el7.x86_64 How reproducible: Reporting 1st instance Steps to Reproduce: 1.Create 4 node ganesha cluster. 2.Perform enable and disable ganesha (gluster nfs-ganesha enable/disable) 3.Check pcs status Actual results: One node goes to stopped state in nfs-grace-clone [nfs-grace] Expected results: No node should be in stopped state in nfs-grace-clone Additional info: [root@dhcp42-125 exports]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp42-125.lab.eng.blr.redhat.com (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Thu Jun 29 16:21:17 2017 Last change: Thu Jun 29 15:59:28 2017 by root via crm_attribute on dhcp42-125.lab.eng.blr.redhat.com 4 nodes configured 24 resources configured Online: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ] Stopped: [ dhcp42-125.lab.eng.blr.redhat.com ] Resource Group: dhcp42-125.lab.eng.blr.redhat.com-group dhcp42-125.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-125.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-125.lab.eng.blr.redhat.com Resource Group: dhcp42-127.lab.eng.blr.redhat.com-group dhcp42-127.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-127.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-127.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-127.lab.eng.blr.redhat.com Resource Group: dhcp42-129.lab.eng.blr.redhat.com-group dhcp42-129.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-129.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-129.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-129.lab.eng.blr.redhat.com Resource Group: dhcp42-119.lab.eng.blr.redhat.com-group dhcp42-119.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp42-119.lab.eng.blr.redhat.com dhcp42-119.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-119.lab.eng.blr.redhat.com dhcp42-119.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp42-119.lab.eng.blr.redhat.com Failed Actions: * nfs-mon_monitor_10000 on dhcp42-125.lab.eng.blr.redhat.com 'unknown error' (1): call=132, status=Timed Out, exitreason='none', last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms * nfs-grace_start_0 on dhcp42-125.lab.eng.blr.redhat.com 'not running' (7): call=19, status=complete, exitreason='none', last-rc-change='Thu Jun 29 15:59:10 2017', queued=0ms, exec=5045ms * nfs-mon_monitor_10000 on dhcp42-127.lab.eng.blr.redhat.com 'unknown error' (1): call=129, status=Timed Out, exitreason='none', last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms * nfs-mon_monitor_10000 on dhcp42-129.lab.eng.blr.redhat.com 'unknown error' (1): call=129, status=Timed Out, exitreason='none', last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms * nfs-mon_monitor_10000 on dhcp42-119.lab.eng.blr.redhat.com 'unknown error' (1): call=133, status=Timed Out, exitreason='none', last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms Daemon Status: corosync: active/disabled pacemaker: active/enabled pcsd: active/enabled Following messages are observed in /var/log/messages Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op start for nfs-grace:0 on dhcp42-125.lab.eng.blr.redhat.com: not running (7) Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:0 on dhcp42-125.lab.eng.blr.redhat.com: unknown error (1) Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:1 on dhcp42-127.lab.eng.blr.redhat.com: unknown error (1) Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:2 on dhcp42-129.lab.eng.blr.redhat.com: unknown error (1) Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-grace:2 on dhcp42-119.lab.eng.blr.redhat.com: not running (7) Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:3 on dhcp42-119.lab.eng.blr.redhat.com: unknown error (1) Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000) Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000) Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000) Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000) Will adjust the priority according to the frequency later on. Attaching sosreports shortly
Not setting blocker flag as of now since the issue is seen only once.I tried reproducing it 4-5 times,not able to hit it again. Going forward in regression testing,if i hit the same issue will mark this bug as blocker.