Description of problem: The problem is that the I/O does not move ahead even though as "pcs status" command the nfs-ganesha process has failed over to some other node. The two mounts for the same volume(say vol2) were done using the vers=3 on a client. The mounts were done using the virtual IP. Now, iozone was started on both mount points and on one of the server the nfs-ganehsa process was stopped. So, after a certain grace period on cluster the I/O should have started moving but that has not happened in this case. [root@nfs1 ~]# gluster volume status Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.148:/rhs/brick1/d1r1-share 49156 0 Y 3549 Brick 10.70.37.77:/rhs/brick1/d1r2-share 49155 0 Y 3329 Brick 10.70.37.76:/rhs/brick1/d2r1-share 49155 0 Y 3081 Brick 10.70.37.69:/rhs/brick1/d2r2-share 49155 0 Y 3346 Brick 10.70.37.148:/rhs/brick1/d3r1-share 49157 0 Y 3566 Brick 10.70.37.77:/rhs/brick1/d3r2-share 49156 0 Y 3346 Brick 10.70.37.76:/rhs/brick1/d4r1-share 49156 0 Y 3098 Brick 10.70.37.69:/rhs/brick1/d4r2-share 49156 0 Y 3363 Brick 10.70.37.148:/rhs/brick1/d5r1-share 49158 0 Y 3583 Brick 10.70.37.77:/rhs/brick1/d5r2-share 49157 0 Y 3363 Brick 10.70.37.76:/rhs/brick1/d6r1-share 49157 0 Y 3115 Brick 10.70.37.69:/rhs/brick1/d6r2-share 49157 0 Y 3380 Self-heal Daemon on localhost N/A N/A Y 28128 Self-heal Daemon on 10.70.37.69 N/A N/A Y 30533 Self-heal Daemon on 10.70.37.77 N/A N/A Y 16037 Self-heal Daemon on 10.70.37.76 N/A N/A Y 6128 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.148:/rhs/brick1/d1r1 49153 0 Y 28060 Brick 10.70.37.77:/rhs/brick1/d1r2 49152 0 Y 15975 Brick 10.70.37.76:/rhs/brick1/d2r1 49152 0 Y 6068 Brick 10.70.37.69:/rhs/brick1/d2r2 49152 0 Y 30472 Brick 10.70.37.148:/rhs/brick1/d3r1 49154 0 Y 28077 Brick 10.70.37.77:/rhs/brick1/d3r2 49153 0 Y 15992 Brick 10.70.37.76:/rhs/brick1/d4r1 49153 0 Y 6085 Brick 10.70.37.69:/rhs/brick1/d4r2 49153 0 Y 30489 Brick 10.70.37.148:/rhs/brick1/d5r1 49155 0 Y 28094 Brick 10.70.37.77:/rhs/brick1/d5r2 49154 0 Y 16009 Brick 10.70.37.76:/rhs/brick1/d6r1 49154 0 Y 6102 Brick 10.70.37.69:/rhs/brick1/d6r2 49154 0 Y 30506 Self-heal Daemon on localhost N/A N/A Y 28128 Self-heal Daemon on 10.70.37.69 N/A N/A Y 30533 Self-heal Daemon on 10.70.37.77 N/A N/A Y 16037 Self-heal Daemon on 10.70.37.76 N/A N/A Y 6128 Task Status of Volume vol2 ------------------------------------------------------------------------------ There are no active volume tasks status of nfs-ganesha on all four nodes, nfs1 ==== root 3790 1 0 May13 ? 00:00:09 /usr/sbin/glusterfs --volfile-server=nfs1 --volfile-id=/gluster_shared_storage /var/run/gluster/shared_storage --- nfs2 ==== root 3300 1 0 May13 ? 00:00:09 /usr/sbin/glusterfs --volfile-server=nfs1 --volfile-id=/gluster_shared_storage /var/run/gluster/shared_storage root 11003 1 0 May15 ? 00:01:08 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid --- nfs3 ==== root 3577 1 0 May13 ? 00:00:08 /usr/sbin/glusterfs --volfile-server=nfs1 --volfile-id=/gluster_shared_storage /var/run/gluster/shared_storage root 4195 1 0 May15 ? 00:01:08 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid --- nfs4 ==== root 14760 1 0 May15 ? 00:00:04 /usr/sbin/glusterfs --volfile-server=nfs1 --volfile-id=/gluster_shared_storage /var/run/gluster/shared_storage root 23970 1 0 May15 ? 00:02:17 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid pcs status; this clearly shows that the failover of nfs-ganesha process running on nfs1 happened on nfs4 [root@nfs1 ~]# pcs status Cluster name: ganesha-ha-360 Last updated: Mon May 18 12:54:12 2015 Last change: Fri May 15 19:25:20 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 17 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs4 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs4 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs3 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs4 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs4 nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 Version-Release number of selected component (if applicable): glusterfs-3.7.0beta2-0.0.el6.x86_64 nfs-ganesha-2.2.0-0.el6.x86_64 How reproducible: Happens in first attempt itself Steps to Reproduce: 1. create a volume of type 6x2, start it 2. start nfs-ganesha on all nodes in consideration, after doing the pre-requisites 3. mount the volume on a client on two mount-points, using different virtual IPs. 4. on one server, execute servce nfs-ganesha stop 5. wait for grace_period to finish and let I/O resume Actual results: step 4 result, I/O does not resume as expected, Expected results: In this case I/O should resume Additional info:
Hi Saurabh, This is clearly a multi-head case, and the required patch is not merged in NFS-Ganesha upstream yet. https://review.gerrithub.io/#/c/228614/ Please try this case again as soon as the patch goes in. Nonetheless, please attach the sos reports to capture the current state for future reference.
Well, I tried to give a re-run to this test with a slight modification, earlier I had two different directories and I was executing iozone in both of them separately from different mounts as "iozone -a". Now, the modification is that I am giving different file names for iozone as option. The result with this modification is that iozone has on both mount-point has finished but nfs-ganesha process has got killed on two other nodes by itself, which is also a cause of worry. The present pcs status, [root@nfs1 ~]# pcs status Cluster name: ganesha-ha-360 Last updated: Tue May 19 09:39:54 2015 Last change: Mon May 18 20:29:02 2015 Stack: cman Current DC: nfs1 - partition with quorum Version: 1.1.11-97629de 4 Nodes configured 19 Resources configured Online: [ nfs1 nfs2 nfs3 nfs4 ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Started: [ nfs1 nfs2 nfs3 nfs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ nfs1 nfs2 nfs3 nfs4 ] nfs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started nfs2 nfs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started nfs2 nfs1-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs1 nfs4-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs4 nfs3-dead_ip-1 (ocf::heartbeat:Dummy): Started nfs3 nfs-ganesha process status on four nodes, nfs1 stopped using "kill -s TERM <pid>" --- nfs2 root 20620 1 0 May18 ? 00:00:50 /usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f /etc/ganesha/ganesha.conf -N NIV_EVENT -p /var/run/ganesha.nfsd.pid --- nfs3 Got killed itself --- nfs4 Got killed itself ---
As per the /var/log/messages/ on nfs4, May 18 20:23:55 nfs4 lrmd[1640]: notice: operation_finished: nfs-mon_monitor_10000:28779:stderr [ Error: Resource does not exist. ] May 18 20:24:06 nfs4 lrmd[1640]: notice: operation_finished: nfs-mon_monitor_10000:28927:stderr [ Error: Resource does not exist. ] May 18 20:24:06 nfs4 kernel: ganesha.nfsd[754]: segfault at f8 ip 00007f6606ef31f6 sp 00007f65e19745c0 error 4 in libgfapi.so.0.0.0[7f6606ee9000+1e000] May 18 20:24:10 nfs4 abrtd: Directory 'ccpp-2015-05-18-20:24:06-737' creation detected May 18 20:24:10 nfs4 abrt[28944]: Saved core dump of pid 737 (/usr/bin/ganesha.nfsd) to /var/spool/abrt/ccpp-2015-05-18-20:24:06-737 (1279594496 bytes) May 18 20:24:10 nfs4 abrtd: Package 'nfs-ganesha' isn't signed with proper key May 18 20:24:10 nfs4 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-05-18-20:24:06-737' exited with 1 May 18 20:24:10 nfs4 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-05-18-20:24:06-737' May 18 20:24:16 nfs4 cibadmin[29085]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe May 18 20:24:16 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs4-dead_ip-1_monitor_0: not running (node=nfs4, call=194, rc=7, cib-update=128, confirmed=true) May 18 20:24:17 nfs4 attrd[1641]: notice: attrd_trigger_update: Sending flush op to all hosts for: ganesha-active (<null>) May 18 20:24:17 nfs4 attrd[1641]: notice: attrd_perform_update: Sent delete 62: node=nfs4, attr=ganesha-active, id=<n/a>, set=(null), section=status May 18 20:24:17 nfs4 IPaddr(nfs1-cluster_ip-1)[29092]: INFO: IP status = ok, IP_CIP= May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs4-dead_ip-1_start_0: ok (node=nfs4, call=199, rc=0, cib-update=129, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs1-cluster_ip-1_stop_0: ok (node=nfs4, call=196, rc=0, cib-update=130, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs4-dead_ip-1_monitor_10000: ok (node=nfs4, call=200, rc=0, cib-update=131, confirmed=false) May 18 20:24:17 nfs4 IPaddr(nfs4-cluster_ip-1)[29093]: INFO: IP status = ok, IP_CIP= May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs4-cluster_ip-1_stop_0: ok (node=nfs4, call=198, rc=0, cib-update=132, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs-grace_stop_0: ok (node=nfs4, call=202, rc=0, cib-update=133, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs1-trigger_ip-1_stop_0: ok (node=nfs4, call=204, rc=0, cib-update=134, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs4-trigger_ip-1_stop_0: ok (node=nfs4, call=206, rc=0, cib-update=135, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs-grace_start_0: ok (node=nfs4, call=207, rc=0, cib-update=136, confirmed=true) May 18 20:24:17 nfs4 crmd[1643]: notice: process_lrm_event: Operation nfs-grace_monitor_5000: ok (node=nfs4, call=208, rc=0, cib-update=137, confirmed=false) May 18 20:24:19 nfs4 ntpd[1707]: Deleting interface #41 eth0, 10.70.36.217#123, interface stats: received=0, sent=0, dropped=0, active_time=2551 secs May 18 20:24:19 nfs4 ntpd[1707]: Deleting interface #40 eth0, 10.70.36.220#123, interface stats: received=0, sent=0, dropped=0, active_time=2731 secs May 18 20:24:19 nfs4 ntpd[1707]: peers refreshed and on nfs3, May 18 20:28:57 nfs3 kernel: ganesha.nfsd[3490]: segfault at 5c0000c0 ip 0000003b278093a0 sp 00007f3a08e88598 error 4 in libpthread-2.12.so[3b27800000+17000] May 18 20:29:00 nfs3 abrtd: Directory 'ccpp-2015-05-18-20:28:57-3477' creation detected May 18 20:29:00 nfs3 abrt[27506]: Saved core dump of pid 3477 (/usr/bin/ganesha.nfsd) to /var/spool/abrt/ccpp-2015-05-18-20:28:57-3477 (837488640 bytes) May 18 20:29:00 nfs3 abrtd: Package 'nfs-ganesha' isn't signed with proper key May 18 20:29:00 nfs3 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-05-18-20:28:57-3477' exited with 1 May 18 20:29:00 nfs3 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-05-18-20:28:57-3477' May 18 20:29:02 nfs3 cibadmin[27553]: notice: crm_log_args: Invoked: /usr/sbin/cibadmin --replace -o configuration -V --xml-pipe May 18 20:29:02 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs3-dead_ip-1_monitor_0: not running (node=nfs3, call=186, rc=7, cib-update=121, confirmed=true) May 18 20:29:03 nfs3 attrd[5605]: notice: attrd_trigger_update: Sending flush op to all hosts for: ganesha-active (<null>) May 18 20:29:03 nfs3 attrd[5605]: notice: attrd_perform_update: Sent delete 77: node=nfs3, attr=ganesha-active, id=<n/a>, set=(null), section=status May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs3-dead_ip-1_start_0: ok (node=nfs3, call=193, rc=0, cib-update=122, confirmed=true) May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs3-dead_ip-1_monitor_10000: ok (node=nfs3, call=194, rc=0, cib-update=123, confirmed=false) May 18 20:29:03 nfs3 IPaddr(nfs3-cluster_ip-1)[27561]: INFO: IP status = ok, IP_CIP= May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs3-cluster_ip-1_stop_0: ok (node=nfs3, call=190, rc=0, cib-update=124, confirmed=true) May 18 20:29:03 nfs3 IPaddr(nfs1-cluster_ip-1)[27560]: INFO: IP status = ok, IP_CIP= May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs1-cluster_ip-1_stop_0: ok (node=nfs3, call=188, rc=0, cib-update=125, confirmed=true) May 18 20:29:03 nfs3 IPaddr(nfs4-cluster_ip-1)[27562]: INFO: IP status = ok, IP_CIP= May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs4-cluster_ip-1_stop_0: ok (node=nfs3, call=192, rc=0, cib-update=126, confirmed=true) May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs-grace_stop_0: ok (node=nfs3, call=196, rc=0, cib-update=127, confirmed=true) May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs4-trigger_ip-1_stop_0: ok (node=nfs3, call=202, rc=0, cib-update=128, confirmed=true) May 18 20:29:03 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs3-trigger_ip-1_stop_0: ok (node=nfs3, call=200, rc=0, cib-update=129, confirmed=true) May 18 20:29:04 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs1-trigger_ip-1_stop_0: ok (node=nfs3, call=198, rc=0, cib-update=130, confirmed=true) May 18 20:29:04 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs-grace_start_0: ok (node=nfs3, call=203, rc=0, cib-update=131, confirmed=true) May 18 20:29:04 nfs3 crmd[5607]: notice: process_lrm_event: Operation nfs-grace_monitor_5000: ok (node=nfs3, call=204, rc=0, cib-update=132, confirmed=false) May 18 20:29:04 nfs3 ntpd[1749]: Deleting interface #34 eth0, 10.70.36.217#123, interface stats: received=0, sent=0, dropped=0, active_time=285 secs May 18 20:29:04 nfs3 ntpd[1749]: Deleting interface #33 eth0, 10.70.36.220#123, interface stats: received=0, sent=0, dropped=0, active_time=285 secs May 18 20:29:04 nfs3 ntpd[1749]: Deleting interface #31 eth0, 10.70.36.219#123, interface stats: received=0, sent=0, dropped=0, active_time=3027 secs May 18 20:29:04 nfs3 ntpd[1749]: peers refreshed
Saurabh, please paste the last few lines of gfapi.log that we saw.
as per gfapi.log from nfs3, [2015-05-18 14:08:22.308125] I [client-handshake.c:187:client_set_lk_version_cbk] 0-vol2-client-7: Server lk version = 1 [2015-05-18 14:08:22.319205] I [afr-common.c:1673:afr_local_discovery_cbk] 0-vol2-replicate-3: selecting local read_child vol2-client-7 [2015-05-18 14:08:22.319249] I [afr-common.c:1673:afr_local_discovery_cbk] 0-vol2-replicate-1: selecting local read_child vol2-client-3 [2015-05-18 14:08:22.319280] I [afr-common.c:1673:afr_local_discovery_cbk] 0-vol2-replicate-5: selecting local read_child vol2-client-11 [2015-05-18 14:08:22.319378] I [MSGID: 104041] [glfs-resolve.c:843:__glfs_active_subvol] 0-vol2: switched to graph 6e667333-2d33-3437-372d-323031352d30 (0) [2015-05-18 14:54:55.598244] W [client-callback.c:65:client_cbk_cache_invalidation] 0-vol2-client-11: XDR decode of cache_invalidation failed. [2015-05-18 14:54:59.302948] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:54:59.303540] W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 2. (Possible split-brain) [2015-05-18 14:55:19.427492] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:56:00.410174] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:56:17.207031] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle The message "W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 2. (Possible split-brain)" repeated 3 times between [2015-05-18 14:54:59.303540] and [2015-05-18 14:56:17.219316] [2015-05-18 14:56:54.106281] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:56:54.126517] W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 2. (Possible split-brain) [2015-05-18 14:56:54.283049] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:57:54.191362] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:58:08.095779] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:58:26.778498] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle The message "W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 2. (Possible split-brain)" repeated 4 times between [2015-05-18 14:56:54.126517] and [2015-05-18 14:58:26.780169] [2015-05-18 14:58:39.474909] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:58:39.475858] W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 2. (Possible split-brain) from gfapi.log on nfs4, The message "W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 12. (Possible split-brain)" repeated 5 times between [2015-05-18 14:50:58.054198] and [2015-05-18 14:52:26.797085] [2015-05-18 14:53:13.243341] W [client-callback.c:65:client_cbk_cache_invalidation] 0-vol2-client-10: XDR decode of cache_invalidation failed. [2015-05-18 14:53:17.390477] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-10: remote operation failed: Stale file handle [2015-05-18 14:53:17.395473] W [MSGID: 108008] [afr-read-txn.c:237:afr_read_txn] 0-vol2-replicate-5: Unreadable subvolume -1 found with event generation 12. (Possible split-brain) [2015-05-18 14:53:17.481876] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:53:51.528836] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-11: remote operation failed: Stale file handle [2015-05-18 14:53:56.245641] W [client-rpc-fops.c:506:client3_3_stat_cbk] 0-vol2-client-10: remote operation failed: Stale file handle
I saw cache_invalidation related errors in gfapi.log. This is what I did on my 4 node set up. 1. Cloned gluster master ( no fix has gone except for Disable_ACL) and installed it on all the machines. 2. Cloned NFS-Ganesha-2.3-dev-3 and applied Soumya's patch that is waiting to merged on NFS-Ganesha upstream, https://review.gerrithub.io/#/c/228614/ and installed NFS-Ganesha on all the nodes. My four node set up has a volume testvol, that looks like this, Volume Name: testvol Type: Distribute Volume ID: b021c59b-16ad-4eaf-9fc3-d0eb563a9ab0 Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: rhs1:/brick/brick1 Brick2: rhs2:/brick/brick1 Brick3: rhs3:/brick/brick1 Brick4: rhs4:/brick/brick1 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on nfs.disable: on performance.readdir-ahead: on nfs-ganesha: enable VIP of rhs1 : 10.70.40.173 VIP of rhs2 : 10.70.40. 174 I mounted the volume "testvol" on two different clients. Client 1 : 10.70.40.173:/testvol on /mnt type nfs (rw,vers=4,addr=10.70.40.173,clientaddr=10.70.43.78) Client 2: 10.70.40.174:/testvol on /mnt type nfs (rw,vers=4,addr=10.70.40.174,clientaddr=10.70.42.117) And I created two directories dir and dir1 on the mount point. On client1, I ran iozone by executing the following command, cd /mnt/dir time iozone -a -f rhs1.ioz On client2, I ran iozone by executing the following command, time iozone -a -f rhs3.ioz After a few seconds, I killed ganesha on rhs1 ( VIP 10.70.40.173 ), I saw a delay for a minute and I/Os resumed after that. pcs status looked like this, Online: [ rhs1 rhs2 rhs3 rhs4 ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Started: [ rhs1 rhs2 rhs3 rhs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ rhs1 rhs2 rhs3 rhs4 ] rhs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs4 rhs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs4 rhs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs2 rhs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs2 rhs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs3 rhs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs3 rhs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs4 rhs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs4 rhs1-dead_ip-1 (ocf::heartbeat:Dummy): Started rhs1 rhs1 had failed over to rhs4. After around 9 minutes, iozone was successfully completed on both the clients. Client 1 : iozone test complete. real 9m46.894s user 0m3.679s sys 2m3.828s Client 2 : iozone test complete. real 7m9.126s user 0m2.482s sys 0m56.361s After the tests finished, I checked the pcs status again, Online: [ rhs1 rhs2 rhs3 rhs4 ] Full list of resources: Clone Set: nfs-mon-clone [nfs-mon] Started: [ rhs1 rhs2 rhs3 rhs4 ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ rhs1 rhs2 rhs3 rhs4 ] rhs1-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs4 rhs1-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs4 rhs2-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs2 rhs2-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs2 rhs3-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs3 rhs3-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs3 rhs4-cluster_ip-1 (ocf::heartbeat:IPaddr): Started rhs4 rhs4-trigger_ip-1 (ocf::heartbeat:Dummy): Started rhs4 rhs1-dead_ip-1 (ocf::heartbeat:Dummy): Started rhs1 NFS-Ganesha was running on all the other three nodes.
As I was able to run it successfully, I request you to try it with the new builds.
Saurabh, Could you confirm if you are still hitting this issue.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days