Bug 1466258 - [GANESHA] After enabling ganesha,one node goes to stopped state in nfs-grace-clone [nfs-grace]
Summary: [GANESHA] After enabling ganesha,one node goes to stopped state in nfs-grace...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: common-ha
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-29 11:10 UTC by Manisha Saini
Modified: 2018-11-19 09:04 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-19 09:04:22 UTC
Embargoed:


Attachments (Terms of Use)

Description Manisha Saini 2017-06-29 11:10:05 UTC
Description of problem:

While doing ganesha disable/enable,one node goes to stopped state in  nfs-grace-clone [nfs-grace],which in turn not letting this node to enter into grace period while doing node reboot

Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
nfs-ganesha-gluster-2.4.4-10.el7rhgs.x86_64
nfs-ganesha-2.4.4-10.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.4.4-10.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-31.el7rhgs.x86_64

# rpm -qa | grep pacemaker
pacemaker-1.1.16-12.el7.x86_64
pacemaker-libs-1.1.16-12.el7.x86_64
pacemaker-cluster-libs-1.1.16-12.el7.x86_64
pacemaker-cli-1.1.16-12.el7.x86_64

# rpm -qa | grep pcs
pcs-0.9.158-6.el7.x86_64


How reproducible:
Reporting 1st instance

Steps to Reproduce:
1.Create 4 node ganesha cluster.
2.Perform enable and disable ganesha (gluster nfs-ganesha enable/disable)
3.Check pcs status

Actual results:
One node goes to stopped state in nfs-grace-clone [nfs-grace] 

Expected results:
No node should be in stopped state in nfs-grace-clone

Additional info:


[root@dhcp42-125 exports]# pcs status
Cluster name: ganesha-ha-360
Stack: corosync
Current DC: dhcp42-125.lab.eng.blr.redhat.com (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Jun 29 16:21:17 2017
Last change: Thu Jun 29 15:59:28 2017 by root via crm_attribute on dhcp42-125.lab.eng.blr.redhat.com

4 nodes configured
24 resources configured

Online: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-125.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp42-119.lab.eng.blr.redhat.com dhcp42-127.lab.eng.blr.redhat.com dhcp42-129.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp42-125.lab.eng.blr.redhat.com ]
 Resource Group: dhcp42-125.lab.eng.blr.redhat.com-group
     dhcp42-125.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-125.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-125.lab.eng.blr.redhat.com
     dhcp42-125.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-125.lab.eng.blr.redhat.com
 Resource Group: dhcp42-127.lab.eng.blr.redhat.com-group
     dhcp42-127.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-127.lab.eng.blr.redhat.com
     dhcp42-127.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-127.lab.eng.blr.redhat.com
     dhcp42-127.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-127.lab.eng.blr.redhat.com
 Resource Group: dhcp42-129.lab.eng.blr.redhat.com-group
     dhcp42-129.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-129.lab.eng.blr.redhat.com
     dhcp42-129.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-129.lab.eng.blr.redhat.com
     dhcp42-129.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-129.lab.eng.blr.redhat.com
 Resource Group: dhcp42-119.lab.eng.blr.redhat.com-group
     dhcp42-119.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp42-119.lab.eng.blr.redhat.com
     dhcp42-119.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-119.lab.eng.blr.redhat.com
     dhcp42-119.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp42-119.lab.eng.blr.redhat.com

Failed Actions:
* nfs-mon_monitor_10000 on dhcp42-125.lab.eng.blr.redhat.com 'unknown error' (1): call=132, status=Timed Out, exitreason='none',
    last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms
* nfs-grace_start_0 on dhcp42-125.lab.eng.blr.redhat.com 'not running' (7): call=19, status=complete, exitreason='none',
    last-rc-change='Thu Jun 29 15:59:10 2017', queued=0ms, exec=5045ms
* nfs-mon_monitor_10000 on dhcp42-127.lab.eng.blr.redhat.com 'unknown error' (1): call=129, status=Timed Out, exitreason='none',
    last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp42-129.lab.eng.blr.redhat.com 'unknown error' (1): call=129, status=Timed Out, exitreason='none',
    last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms
* nfs-mon_monitor_10000 on dhcp42-119.lab.eng.blr.redhat.com 'unknown error' (1): call=133, status=Timed Out, exitreason='none',
    last-rc-change='Thu Jun 29 16:20:49 2017', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled


Following messages are observed in /var/log/messages

Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op start for nfs-grace:0 on dhcp42-125.lab.eng.blr.redhat.com: not running (7)
Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:0 on dhcp42-125.lab.eng.blr.redhat.com: unknown error (1)
Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:1 on dhcp42-127.lab.eng.blr.redhat.com: unknown error (1)
Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:2 on dhcp42-129.lab.eng.blr.redhat.com: unknown error (1)
Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-grace:2 on dhcp42-119.lab.eng.blr.redhat.com: not running (7)
Jun 29 16:34:14 localhost pengine[23577]: warning: Processing failed op monitor for nfs-mon:3 on dhcp42-119.lab.eng.blr.redhat.com: unknown error (1)
Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000)
Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000)
Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000)
Jun 29 16:34:14 localhost pengine[23577]: warning: Forcing nfs-grace-clone away from dhcp42-125.lab.eng.blr.redhat.com after 1000000 failures (max=1000000)


Will adjust the priority according to the frequency later on.

Attaching sosreports shortly

Comment 3 Manisha Saini 2017-07-05 10:55:08 UTC
Not setting blocker flag as of now since the issue is seen only once.I tried reproducing it 4-5 times,not able to hit it again.

Going forward in regression testing,if i hit the same issue will mark this bug as blocker.


Note You need to log in before you can comment on or make changes to this bug.