1213304 – nfs-ganesha: using features.enable command the nfs-ganesha process does come up on all four nodes

Bug 1213304 - nfs-ganesha: using features.enable command the nfs-ganesha process does come up on all four nodes

Summary: nfs-ganesha: using features.enable command the nfs-ganesha process does come ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	ganesha-nfs
Sub Component:
Version:	3.7.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kaleb KEITHLEY
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	qe_tracker_everglades
TreeView+	depends on / blocked

Reported:	2015-04-20 10:01 UTC by Saurabh
Modified:	2016-04-08 11:10 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-04-08 11:10:04 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sosreport of node1 (15.13 MB, application/x-xz) 2015-04-20 10:49 UTC, Saurabh	no flags	Details
sosreport of node2 (12.95 MB, application/x-xz) 2015-04-20 10:50 UTC, Saurabh	no flags	Details
sosreport of node3 (12.09 MB, application/x-xz) 2015-04-20 10:52 UTC, Saurabh	no flags	Details
sosreport of node4 (8.22 MB, application/x-xz) 2015-04-20 10:56 UTC, Saurabh	no flags	Details
View All

Description Saurabh 2015-04-20 10:01:17 UTC

Description of problem:
gluster features.ganesha enable cli is used to set up the pcs cluster for nfs-ganesha and bring up nfs-ganesha process.

So this time I am trying out the things with 4 node cluster of glusterfs.
All four nodes are suppose to be part of the nfs-ganesha cluster as well.
So effectively the four nodes in consideration should have nfs-ganesha process post completion of the cli command, but nfs-ganesha does not come up on all nodes every time.

Here are the logs of the issue seen from latest execution,

[root@nfs1 ~]# gluster features.ganesha enable
Enabling NFS-Ganesha requires Gluster-NFS to bedisabled across the trusted pool. Do you still want to continue? (y/n) y
Error : Request timed out

node 1,
#####################################
[root@nfs1 ~]# ps -eaf | grep nfs
root      5338  6760  0 14:57 pts/0    00:00:00 grep nfs


[root@nfs1 ~]# pcs status
Cluster name: ganesha-ha-2
Last updated: Mon Apr 20 14:58:03 2015
Last change: Mon Apr 20 12:28:04 2015
Stack: cman
Current DC: nfs1 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
22 Resources configured


Online: [ nfs1 nfs2 nfs3 nfs4 ]

Full list of resources:

 Clone Set: nfs_start-clone [nfs_start]
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs3 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs1 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs2 (unmanaged) 
     Stopped: [ nfs4 ]
 nfs1-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs4 
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 nfs1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs2 
 nfs2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs2 
 nfs3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 

Failed actions:
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms


node 2,
##########################################
[root@nfs2 ~]# ps -eaf | grep nfs
root      5260 16826  0 14:58 pts/0    00:00:00 grep nfs
root      6216     1  0 12:27 ?        00:00:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid


[root@nfs2 ~]# pcs status
Cluster name: ganesha-ha-2
Last updated: Mon Apr 20 14:58:49 2015
Last change: Mon Apr 20 12:28:04 2015
Stack: cman
Current DC: nfs1 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
22 Resources configured


Online: [ nfs1 nfs2 nfs3 nfs4 ]

Full list of resources:

 Clone Set: nfs_start-clone [nfs_start]
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs3 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs1 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs2 (unmanaged) 
     Stopped: [ nfs4 ]
 nfs1-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs4 
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 nfs1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs2 
 nfs2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs2 
 nfs3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 

Failed actions:
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms

node 3,
#############################################

[root@nfs3 ~]# ps -eaf | grep nfs
root     20901 18085  0 14:59 pts/0    00:00:00 grep nfs
root     26369     1  0 12:27 ?        00:00:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid


[root@nfs3 ~]# pcs status
Cluster name: ganesha-ha-2
Last updated: Mon Apr 20 14:59:22 2015
Last change: Mon Apr 20 12:28:04 2015
Stack: cman
Current DC: nfs1 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
22 Resources configured


Online: [ nfs1 nfs2 nfs3 nfs4 ]

Full list of resources:

 Clone Set: nfs_start-clone [nfs_start]
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs3 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs1 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs2 (unmanaged) 
     Stopped: [ nfs4 ]
 nfs1-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs4 
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 nfs1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs2 
 nfs2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs2 
 nfs3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 

Failed actions:
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms


node 4,
######################################

[root@nfs4 ~]# ps -eaf | grep nfs
root     16073 27004  0 04:12 pts/0    00:00:00 grep nfs

[root@nfs4 ~]# pcs status
Cluster name: ganesha-ha-2
Last updated: Mon Apr 20 04:13:00 2015
Last change: Mon Apr 20 01:41:11 2015
Stack: cman
Current DC: nfs1 - partition with quorum
Version: 1.1.11-97629de
4 Nodes configured
22 Resources configured


Online: [ nfs1 nfs2 nfs3 nfs4 ]

Full list of resources:

 Clone Set: nfs_start-clone [nfs_start]
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs3 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs1 (unmanaged) 
     nfs_start	(ocf::heartbeat:ganesha_nfsd):	FAILED nfs2 (unmanaged) 
     Stopped: [ nfs4 ]
 nfs1-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs4 
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ nfs1 nfs2 nfs3 nfs4 ]
 nfs1-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs1-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs2-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs2 
 nfs2-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs2 
 nfs3-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs3-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started nfs3 
 nfs4-trigger_ip-1	(ocf::heartbeat:Dummy):	Started nfs3 
 nfs4-dead_ip-1	(ocf::heartbeat:Dummy):	Started nfs1 

Failed actions:
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
    nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms



Version-Release number of selected component (if applicable):
glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64
nfs-ganesha-2.2-0.rc8.el6.x86_64

How reproducible:
most of the times.

Expected results:
nfs-ganesha is suppose to come up on all nodes.

Additional info:

Comment 1 Saurabh 2015-04-20 10:49:10 UTC

Created attachment 1016314 [details]
sosreport of node1

Comment 2 Saurabh 2015-04-20 10:50:35 UTC

Created attachment 1016315 [details]
sosreport of node2

Comment 3 Saurabh 2015-04-20 10:52:37 UTC

Created attachment 1016316 [details]
sosreport of node3

Comment 4 Saurabh 2015-04-20 10:56:14 UTC

Created attachment 1016320 [details]
sosreport of node4

Comment 5 Saurabh 2015-04-22 09:16:08 UTC

Putting the response as per the conversation we had yesterday,

"1. In general, there is the issue with rpcbind which we have seen
yesterday. For that there is a 7.1 BZ which unfortunately has not been
opened for 6.6/6.x. I am not sure if it is too late for that but we need
to see. I sent a quick note to Sayan to see what he has to say. Will
need to wait for his inputs.

     But in general there is the ugly workaround of removing the state
associated with rpcbind (especially as it seems to be started with
rpcbind -w) :
     - on RHEL 7.x we need to remove the rpcbind.socket file which
contains the info
     - on RHEL 6.6 after digging around on Saurabh's machine, I believe
the file to delete is /var/cache/rpcbind/* (there are 2 .xdr files in there)

If nothing, this should serve as a possible workaround for us (ugly as
it is).

2. ganesha repeatedly fails to start on nfs1 and I think now that the
reason is that there is someone listening on port 2049 still. I think
the issue happened after the reboots yesterday when someone was
listening on port 2049 preventing ganesha from binding to tcp 2049.

- So first I removed the existing /var/run/ganesha pid.

- next clear rpcbind state by rm -rf /var/cache/rpcbind/*



What I see with the useful -tulpn option is:

[root@nfs1 ~]# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address               Foreign
Address             State       PID/Program name
tcp        0      0 0.0.0.0:853 0.0.0.0:*                   LISTEN      
1518/glusterfs
tcp        0      0 0.0.0.0:22 0.0.0.0:*                   LISTEN      
1918/sshd
tcp        0      0 127.0.0.1:25 0.0.0.0:*                   LISTEN      
2027/master
tcp        0      0 0.0.0.0:59101 0.0.0.0:*                  
LISTEN      1596/rpc.statd
tcp        0      0 0.0.0.0:2049 0.0.0.0:*                   LISTEN      
1518/glusterfs
tcp        0      0 0.0.0.0:38465 0.0.0.0:*                  
LISTEN      1518/glusterfs
tcp        0      0 0.0.0.0:5666 0.0.0.0:*                   LISTEN      
1940/nrpe


3. I then did the "nfs.disable on" on both the volumes share-vol0 and vol0.

4. [root@nfs1 ~]# netstat -an | grep 2049
[root@nfs1 ~]#


5. ganesha then starts successfully:

[root@nfs1 ~]# service nfs-ganesha start
Starting ganesha.nfsd:                                     [  OK  ]
[root@nfs1 ~]#
[root@nfs1 ~]# ps auxw | grep ganesha
root      3478  0.5  0.1 1496316 8532 ?        Ssl  00:05   0:00
/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f
/etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid
root      3521  0.0  0.0 103252   820 pts/1    S+   00:05   0:00 grep
ganesha


Anand"

Comment 6 Saurabh 2015-04-22 09:18:18 UTC

so I did the things as per comment 5 and executed the cli 
"gluster features.ganesha enable" post cleanup.

The nfs-ganesha process came up on all nodes but the pcs cluster still didn't come up.

logs,
[root@nfs1 ~]# gluster features.ganesha enable
Enabling NFS-Ganesha requires Gluster-NFS to bedisabled across the trusted pool. Do you still want to continue? (y/n) y
ganesha enable : success 

[root@nfs1 ~]# pcs status
Error: cluster is not currently running on this node


Same is the status on all nodes

Comment 7 Meghana 2015-04-22 09:44:59 UTC

Soumya and I logged into the machines and found a few issues.

There has been a change in the pre-requisites by the introduction of common
meta-volume. The user has to create a volume called "gluster_shared_storage"
and mount it on /var/run/gluster/shared_storage. It has to be mounted before
running the command/script.

Also, pcsd hasn't started on all the machines. 
pcsd has to be started on all the nodes. 

Also, the HA_CONFIG was wrongly populated. HA_VOL_SERVER="IP of the server"
The name of the shared volume was given instead.

Comment 8 Meghana 2015-04-22 09:45:45 UTC

We have also found a bug in the ganesha-ha.sh script. Minor fixes, will fix them
now.

Comment 9 Saurabh 2015-04-22 09:57:28 UTC

(In reply to Meghana from comment #7)
> Soumya and I logged into the machines and found a few issues.
> 
> There has been a change in the pre-requisites by the introduction of common
> meta-volume. The user has to create a volume called "gluster_shared_storage"
> and mount it on /var/run/gluster/shared_storage. It has to be mounted before
> running the command/script.
> 
> Also, pcsd hasn't started on all the machines. 
> pcsd has to be started on all the nodes. 
> 
> Also, the HA_CONFIG was wrongly populated. HA_VOL_SERVER="IP of the server"
> The name of the shared volume was given instead.

Alright, I was not updated about this change.

Comment 11 Anand Avati 2015-04-22 11:26:28 UTC

REVIEW: http://review.gluster.org/10336 (NFS-Ganesha: Shared volume need not be mounted via script) posted (#1) for review on master by Meghana M (mmadhusu)

Comment 15 Kaleb KEITHLEY 2016-04-08 11:10:04 UTC

works in 3.7

Note You need to log in before you can comment on or make changes to this bug.