Description of problem: As per the HA functionality, the IO should resume after the grace period. NOw we are providing nfs-ganesha cluster with HA functionality, so the IO that is going should resume even in case the nfs-ganesha process is killed. This should IO resumption should happen after the grace period completion, as the failover would have happened to another node. So, this is not happening for the present setup and it is a problem. Version-Release number of selected component (if applicable): nfs-ganesha-2.2-0.rc8.el6.x86_64 glusterfs-3.7dev-0.1017.git7fb85e3.el6.x86_64 How reproducible: Tried HA for first time. Steps to Reproduce: 1. do cluster setup for nfs-ganesha, as per guidelines 2. once the nfs-ganehsa is up, mount the volume on a client 3. start the iozone 4. kill the nfs-ganesha process on the server node. Actual results: step 1. -- cluster done, nfs -ganesha come up only on 2 nodes out of four nodes step 2. -- mount it done using vers=4, step 3. -- iozone started step 4. kill the nfs ganesha process using the kill command, iozone stuck on mount point, not moving ahead It is not moving ahead from this point. Expected results: IO should move ahead, as HA functionality should allow failover of the nfs-ganesha process to the other node. Additional info:
Need the following information, 1. showmount -e VIP output 2.NFS-Ganesha logs 3. pcs status output
So I am having four nodes, namely nfs[1,2,3,4] nfs-ganehsa came up only on nfs2 and nfs3 and presently I killed nfs-ganesha process on nfs2 so collected the showmount output from nfs3, [root@nfs3 ~]# showmount -e 10.70.36.217 Export list for 10.70.36.217: /vol0 (everyone) [root@nfs3 ~]# showmount -e 10.70.36.218 Export list for 10.70.36.218: /vol0 (everyone) [root@nfs3 ~]# showmount -e 10.70.36.219 Export list for 10.70.36.219: /vol0 (everyone) [root@nfs3 ~]# showmount -e 10.70.36.220 Export list for 10.70.36.220: /vol0 (everyone) node 1, ##################################### [root@nfs1 ~]# ps -eaf | grep nfs root 5338 6760 0 14:57 pts/0 00:00:00 grep nfs [root@nfs1 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 14:58:03 2015 Last change: Mon Apr 20 12:28:04 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms node 2, ########################################## [root@nfs2 ~]# ps -eaf | grep nfs root 5260 16826 0 14:58 pts/0 00:00:00 grep nfs root 6216 1 0 12:27 ? 00:00:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid [root@nfs2 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 14:58:49 2015 Last change: Mon Apr 20 12:28:04 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms node 3, ############################################# [root@nfs3 ~]# ps -eaf | grep nfs root 20901 18085 0 14:59 pts/0 00:00:00 grep nfs root 26369 1 0 12:27 ? 00:00:05 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid [root@nfs3 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 14:59:22 2015 Last change: Mon Apr 20 12:28:04 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms node 4, ###################################### [root@nfs4 ~]# ps -eaf | grep nfs root 16073 27004 0 04:12 pts/0 00:00:00 grep nfs [root@nfs4 ~]# pcs status Cluster name: ganesha-ha-2 Last updated: Mon Apr 20 04:13:00 2015 Last change: Mon Apr 20 01:41:11 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 22 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs_start-clone [nfs_start] nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs3 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs1 (unmanaged) nfs_start (ocf::heartbeat:ganesha_nfsd): FAILED nfs2 (unmanaged) Stopped: [ nfs4 ] nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Failed actions: nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs3 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs1 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40001ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms nfs_start_stop_0 on nfs2 'unknown error' (1): call=20, status=Timed Out, last-rc-change='Mon Apr 20 12:27:09 2015', queued=0ms, exec=40002ms
Created attachment 1016358 [details] nfs-ganesha logs from nfs2
Created attachment 1016359 [details] nfs-ganesha logs from nfs3
Saurabh, can you make sure if you used VIP of the server to mount the volume on client? Without it, failover will invariably fail.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days