Description of problem: gluster features.ganesha enable cli is used to set up the pcs cluster for nfs-ganesha and bring up nfs-ganesha process. So this time I am trying out the things with 4 node cluster of glusterfs. All four nodes are suppose to be part of the nfs-ganesha cluster as well. So effectively the four nodes in consideration should have nfs-ganesha process post completion of the cli command, but nfs-ganesha does not come up on all nodes every time. Here are the logs of the issue seen from latest execution, [root@nfs1 ~]# gluster features.ganesha enable Enabling NFS-Ganesha requires Gluster-NFS to bedisabled across the trusted pool. Do you still want to continue? (y/n) y Error : Request timed out node 1, ##################################### [root@nfs1 ~]# ps -eaf | grep nfs root 5338 6760 0 14:57 pts/0 00:00:00 grep nfs [root@nfs1 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 14:58:03 2015 Last change: Mon Apr 20 12:28:04 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms node 2, ########################################## [root@nfs2 ~]# ps -eaf | grep nfs root 5260 16826 0 14:58 pts/0 00:00:00 grep nfs root 6216 1 0 12:27 ? 00:00:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid [root@nfs2 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 14:58:49 2015 Last change: Mon Apr 20 12:28:04 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms node 3, ############################################# [root@nfs3 ~]# ps -eaf | grep nfs root 20901 18085 0 14:59 pts/0 00:00:00 grep nfs root 26369 1 0 12:27 ? 00:00:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid [root@nfs3 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 14:59:22 2015 Last change: Mon Apr 20 12:28:04 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms node 4, ###################################### [root@nfs4 ~]# ps -eaf | grep nfs root 16073 27004 0 04:12 pts/0 00:00:00 grep nfs [root@nfs4 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 04:13:00 2015 Last change: Mon Apr 20 01:41:11 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms Version-Release number of selected component (if applicable): glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64 nfs-ganesha-2.2-0.rc8.el6.x86_64 How reproducible: most of the times. Expected results: nfs-ganesha is suppose to come up on all nodes. Additional info:
Created attachment 1016314 [details] sosreport of node1
Created attachment 1016315 [details] sosreport of node2
Created attachment 1016316 [details] sosreport of node3
Created attachment 1016320 [details] sosreport of node4
Putting the response as per the conversation we had yesterday, "1. In general, there is the issue with rpcbind which we have seen yesterday. For that there is a 7.1 BZ which unfortunately has not been opened for 6.6/6.x. I am not sure if it is too late for that but we need to see. I sent a quick note to Sayan to see what he has to say. Will need to wait for his inputs. But in general there is the ugly workaround of removing the state associated with rpcbind (especially as it seems to be started with rpcbind -w) : - on RHEL 7.x we need to remove the rpcbind.socket file which contains the info - on RHEL 6.6 after digging around on Saurabh's machine, I believe the file to delete is /var/cache/rpcbind/* (there are 2 .xdr files in there) If nothing, this should serve as a possible workaround for us (ugly as it is). 2. ganesha repeatedly fails to start on nfs1 and I think now that the reason is that there is someone listening on port 2049 still. I think the issue happened after the reboots yesterday when someone was listening on port 2049 preventing ganesha from binding to tcp 2049. - So first I removed the existing /var/run/ganesha pid. - next clear rpcbind state by rm -rf /var/cache/rpcbind/* What I see with the useful -tulpn option is: [root@nfs1 ~]# netstat -tulpn Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:853 0.0.0.0:* LISTEN 1518/glusterfs tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1918/sshd tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 2027/master tcp 0 0 0.0.0.0:59101 0.0.0.0:* LISTEN 1596/rpc.statd tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN 1518/glusterfs tcp 0 0 0.0.0.0:38465 0.0.0.0:* LISTEN 1518/glusterfs tcp 0 0 0.0.0.0:5666 0.0.0.0:* LISTEN 1940/nrpe 3. I then did the "nfs.disable on" on both the volumes share-vol0 and vol0. 4. [root@nfs1 ~]# netstat -an | grep 2049 [root@nfs1 ~]# 5. ganesha then starts successfully: [root@nfs1 ~]# service nfs-ganesha start Starting ganesha.nfsd: [ OK ] [root@nfs1 ~]# [root@nfs1 ~]# ps auxw | grep ganesha root 3478 0.5 0.1 1496316 8532 ? Ssl 00:05 0:00 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid root 3521 0.0 0.0 103252 820 pts/1 S+ 00:05 0:00 grep ganesha Anand"
so I did the things as per comment 5 and executed the cli "gluster features.ganesha enable" post cleanup. The nfs-ganesha process came up on all nodes but the pcs cluster still didn't come up. logs, [root@nfs1 ~]# gluster features.ganesha enable Enabling NFS-Ganesha requires Gluster-NFS to bedisabled across the trusted pool. Do you still want to continue? (y/n) y ganesha enable : success [root@nfs1 ~]# pcs status Error: cluster is not currently running on this node Same is the status on all nodes
Soumya and I logged into the machines and found a few issues. There has been a change in the pre-requisites by the introduction of common meta-volume. The user has to create a volume called "gluster_shared_storage" and mount it on /var/run/gluster/shared_storage. It has to be mounted before running the command/script. Also, pcsd hasn't started on all the machines. pcsd has to be started on all the nodes. Also, the HA_CONFIG was wrongly populated. HA_VOL_SERVER="IP of the server" The name of the shared volume was given instead.
We have also found a bug in the ganesha-ha.sh script. Minor fixes, will fix them now.
(In reply to Meghana from comment #7) > Soumya and I logged into the machines and found a few issues. > > There has been a change in the pre-requisites by the introduction of common > meta-volume. The user has to create a volume called "gluster_shared_storage" > and mount it on /var/run/gluster/shared_storage. It has to be mounted before > running the command/script. > > Also, pcsd hasn't started on all the machines. > pcsd has to be started on all the nodes. > > Also, the HA_CONFIG was wrongly populated. HA_VOL_SERVER="IP of the server" > The name of the shared volume was given instead. Alright, I was not updated about this change.
REVIEW: http://review.gluster.org/10336 (NFS-Ganesha: Shared volume need not be mounted via script) posted (#1) for review on master by Meghana M (mmadhusu)
works in 3.7