Setup is with two servers in 1x1 config volume 'mirrorvol1' in replica 2 on 3.1 FS GA. ucarp command node 1 root 3264 3235 0 13:56 pts/0 00:00:00 ucarp -P -z -b 1 -i eth0 -s 10.1.30.127 -v 1 -p mypass -a 10.1.30.252 --upscript=/etc/vip-up-252.sh --downscript=/etc/vip-down-252.sh root 3275 3235 0 13:56 pts/0 00:00:00 ucarp -z -b 2 -i eth0 -s 10.1.30.127 -v 2 -p mypass1 -a 10.1.30.253 --upscript=/etc/vip-up-253.sh --downscript=/etc/vip-down-253.sh ucarp command node 2 root 3256 3212 0 17:57 pts/0 00:00:11 ucarp -P -z -b 1 -i eth0 -s 10.1.30.126 -v 2 -p mypass1 -a 10.1.30.253 --upscript=/etc/vip-up-253.sh --downscript=/etc/vip-down-253.sh root 3261 3212 0 17:57 pts/0 00:00:06 ucarp -z -b 2 -i eth0 -s 10.1.30.126 -v 1 -p mypass -a 10.1.30.252 --upscript=/etc/vip-up-252.sh --downscript=/etc/vip-down-252.sh up script #!/bin/sh /sbin/ip addr add 10.1.30.252/24 dev eth0 arping -U -I eth0 10.1.30.252 -c 1 down script #!/bin/sh /sbin/ip addr del 10.1.30.252/24 dev eth0 scripts are the same on both nodes. After failover client is stuck Nfs logs on server ‘failed to’. tail -f /var/log/glusterfs/nfs.log [2010-10-13 18:25:45.288091] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value. [2010-10-13 18:25:45.288116] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value. [2010-10-13 18:25:45.295452] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] mirrorvol1-replicate-0: background meta-data data self-heal completed on /bin/stat [2010-10-13 18:28:18.522523] E [client-handshake.c:116:rpc_client_ping_timer_expired] mirrorvol1-client-0: Server 10.1.30.127:24009 has not responded in the last 42 seconds, disconnecting. [2010-10-13 18:28:18.522765] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2ba3fe516509] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2ba3fe515cae] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2ba3fe515c1e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at 2010-10-13 18:27:11.779234 [2010-10-13 18:28:18.522827] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2ba3fe516509] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2ba3fe515cae] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2ba3fe515c1e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(XATTROP(33)) called at 2010-10-13 18:27:11.779366 [2010-10-13 18:28:18.522908] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2ba3fe516509] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2ba3fe515cae] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2ba3fe515c1e]))) rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2010-10-13 18:27:36.459191 [2010-10-13 18:28:21.525870] E [socket.c:1657:socket_connect_finish] mirrorvol1-client-0: connection to 10.1.30.127:24009 failed (No route to host) [2010-10-13 18:33:38.482826] I [client-handshake.c:699:select_server_supported_programs] mirrorvol1-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-10-13 18:33:38.873463] I [client-handshake.c:535:client_setvolume_cbk] mirrorvol1-client-0: Connected to 10.1.30.127:24009, attached to remote volume '/mnt1'. At this point, client hangs forever.
Allen/Harsha, If you get some time, can you apply the patch http://patches.gluster.com/patch/5495/ and try again? I guess the issue has come up because of CHILD_UP not being propagated properly up to NFS xlator.
Changing this to blocker, we can't move forward without the fix for this for many users/customers which need pure HA with Load balancing with RRDNS.
More info available at bug 763607
Hi Allen, a few questions: 1. Can i get access to this system? 2. Or the complete log that is pasted below? 3. When you say "client hangs forever", how long is forever? it could be that the nfs clients retransmission timeout seems like the client app is hung.
(In reply to comment #4) > Hi Allen, a few questions: > > 1. Can i get access to this system? > Systems are accessible with 10.1.10.111, 10.1.10.112, 10.1.10.113 > 2. Or the complete log that is pasted below? > I think Anush has already reproduced it himself, check with him the scenarios he reproduced it with. > 3. When you say "client hangs forever", how long is forever? it could be that > the nfs clients retransmission timeout seems like the client app is hung. We don't know anything about retransmission timeout here but it was so long that it was forever. Vijay also knows about this issue in detail as he was speaking about changing keep_alive and idle_timeout
> > 3. When you say "client hangs forever", how long is forever? it could be that > > the nfs clients retransmission timeout seems like the client app is hung. > We don't know anything about retransmission timeout here but it was so long > that it was forever. I was able to reproduce this issue. I think it would depend on the mount type (soft or hard). In my case, I'm able to reproduce with a soft mount, and a timeout of 15 seconds on Fedora Core. I loose the mount with Linux returning Input/Output error initially and then the error 526 for any operation I try against the mount. It recovers about 30 seconds later, but my IO operations to anything on the mount are already dead.
(In reply to comment #6) > > > 3. When you say "client hangs forever", how long is forever? it could be that > > > the nfs clients retransmission timeout seems like the client app is hung. > > We don't know anything about retransmission timeout here but it was so long > > that it was forever. > > > I was able to reproduce this issue. I think it would depend on the mount type > (soft or hard). In my case, I'm able to reproduce with a soft mount, and a > timeout of 15 seconds on Fedora Core. I loose the mount with Linux returning > Input/Output error initially and then the error 526 for any operation I try > against the mount. It recovers about 30 seconds later, but my IO operations to > anything on the mount are already dead. Ok. Thats the confirmation. If the timeouts are really occurring nfs soft mounted clients return a EIO. Thanks.
Failover, failback and a second failover, all three work fine with 2 replica config for the following workload over the nfs mount point: for i in $(seq 1 1000); do echo $i; touch /mnt/$i;sleep 1;done There is a few seconds of hang but the file creation resumes once the ip migration and the nfs client's back-off timeout are over. Trying with an IO load.
Failover, failback and a second failover, all three work fine with 2 replica config for the following workload over the nfs mount point: cp -avf /usr /mnt NFS mount is mounted with soft,intr. So cp of one or two files who's requests timeout, result in Input/Output error but the cp for all other files resumes and continues. To confirm that the EIO is due to a timeout, next I am going to mount without the soft,intr options so that the client retransmits indefinitely.
For a 6g dd too, failover and failback work fine with 2 replica config. The second failover is taking unusually long to recover. Logs suggest a self-heal is getting triggered. Investigating....
(In reply to comment #10) > For a 6g dd too, failover and failback work fine with 2 replica config. The > second failover is taking unusually long to recover. Logs suggest a self-heal > is getting triggered. Investigating.... how are you trying the failover, a simple kill command or a shutdown of the node doesn't help. You need to poweroff the node abruptly to understand perfectly this case remove the power cable. You will see the hang, this was reproduced by Anush already did you speak to him?
(In reply to comment #11) > (In reply to comment #10) > > For a 6g dd too, failover and failback work fine with 2 replica config. The > > second failover is taking unusually long to recover. Logs suggest a self-heal > > is getting triggered. Investigating.... > > how are you trying the failover, a simple kill command or a shutdown of the > node doesn't help. You need to poweroff the node abruptly to understand > perfectly this case remove the power cable. > Am powering off the VMs which is the same as abrupt poweroff. I can see the slave take over the ip address as the master node goes off. > You will see the hang, this was reproduced by Anush already did you speak to > him? I think he also saw the long delays as a hang. The trick is in knowing when the client starts retransmitting requests by looking at the dmesg. This may be a case of excessively long timeout period in nfs clients that looks like a hung client. In any case, the second failover still needs a bit of investigation. That may point out the problem you were facing.
PATCH: http://patches.gluster.com/patch/5584 in master (socket: Reduce keepalive idle and interval time)
In all my tests I am able to resume IO after NFS client's starts retransmitting. I've confirmed this while monitoring glusterfs and glusterfsd CPU usages on all bricks as well as file size. I believe the long retransmission timeouts and back-offs are seeming like the client is hung but it is not. The right solution will be to reduce the value of "retrans" option at mount if the IO needs to be resumed faster. Of course, that has its own disadvantages that excessive network traffic flows but I guess that is an acceptable price. Closing this bug.
(In reply to comment #14) > In all my tests I am able to resume IO after NFS client's starts > retransmitting. > I've confirmed this while monitoring glusterfs and glusterfsd CPU usages on all > bricks as well as file size. > bug. I am not following here, there has been a patch which was sent against this bug id what was that supposed to do?. I am aware that i reproduced this is with Anush on a screen session and i think Vijay already knows about it.
(In reply to comment #15) > (In reply to comment #14) > > In all my tests I am able to resume IO after NFS client's starts > > retransmitting. > > I've confirmed this while monitoring glusterfs and glusterfsd CPU usages on all > > bricks as well as file size. > > bug. > > I am not following here, there has been a patch which was sent against this bug > id what was that supposed to do?. I am aware that i reproduced this is with > Anush on a screen session and i think Vijay already knows about it. That patch is a precautionary measure. Even without that it works. You did not reproduce a permanent hang. Even without that patch, you just need to let the client wait long enough to unhang itself when its timeout expires.