Bug 763688 (GLUSTER-1956) - NFS client hangs after failover
Summary: NFS client hangs after failover
Keywords:
Status: CLOSED NOTABUG
Alias: GLUSTER-1956
Product: GlusterFS
Classification: Community
Component: nfs
Version: 3.1.0
Hardware: All
OS: Linux
urgent
high
Target Milestone: ---
Assignee: Shehjar Tikoo
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-10-14 21:05 UTC by Allen Lu
Modified: 2010-12-23 08:18 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTP
Mount Type: nfs
Documentation: DNR
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Allen Lu 2010-10-14 21:05:49 UTC
Setup is with two servers in 1x1 config volume 'mirrorvol1' in replica 2 on 3.1 FS GA.

ucarp command node 1
root      3264  3235  0 13:56 pts/0    00:00:00 ucarp -P -z -b 1 -i eth0 -s 10.1.30.127 -v 1 -p mypass -a 10.1.30.252 --upscript=/etc/vip-up-252.sh --downscript=/etc/vip-down-252.sh
root      3275  3235  0 13:56 pts/0    00:00:00 ucarp -z -b 2 -i eth0 -s 10.1.30.127 -v 2 -p mypass1 -a 10.1.30.253 --upscript=/etc/vip-up-253.sh --downscript=/etc/vip-down-253.sh

ucarp command node 2
root      3256  3212  0 17:57 pts/0    00:00:11 ucarp -P -z -b 1 -i eth0 -s 10.1.30.126 -v 2 -p mypass1 -a 10.1.30.253 --upscript=/etc/vip-up-253.sh --downscript=/etc/vip-down-253.sh
root      3261  3212  0 17:57 pts/0    00:00:06 ucarp -z -b 2 -i eth0 -s 10.1.30.126 -v 1 -p mypass -a 10.1.30.252 --upscript=/etc/vip-up-252.sh --downscript=/etc/vip-down-252.sh

up script
#!/bin/sh
/sbin/ip addr add 10.1.30.252/24 dev eth0
arping -U -I eth0 10.1.30.252 -c 1

down script
#!/bin/sh
/sbin/ip addr del 10.1.30.252/24 dev eth0

scripts are the same on both nodes.

After failover client is stuck

Nfs logs on server ‘failed to’.

tail -f /var/log/glusterfs/nfs.log
[2010-10-13 18:25:45.288091] W
[afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta:
Unable to get dict value.
[2010-10-13 18:25:45.288116] W
[afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta:
Unable to get dict value.
[2010-10-13 18:25:45.295452] I
[afr-self-heal-common.c:1526:afr_self_heal_completion_cbk]
mirrorvol1-replicate-0: background  meta-data data self-heal completed on
/bin/stat
[2010-10-13 18:28:18.522523] E
[client-handshake.c:116:rpc_client_ping_timer_expired] mirrorvol1-client-0:
Server 10.1.30.127:24009 has not responded in the last 42 seconds,
disconnecting.
[2010-10-13 18:28:18.522765] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2ba3fe516509]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2ba3fe515cae]
(-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2ba3fe515c1e])))
rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at
2010-10-13 18:27:11.779234
[2010-10-13 18:28:18.522827] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2ba3fe516509]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2ba3fe515cae]
(-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2ba3fe515c1e])))
rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(XATTROP(33)) called at
2010-10-13 18:27:11.779366
[2010-10-13 18:28:18.522908] E [rpc-clnt.c:338:saved_frames_unwind]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2ba3fe516509]
(-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2ba3fe515cae]
(-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2ba3fe515c1e])))
rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called
at 2010-10-13 18:27:36.459191
[2010-10-13 18:28:21.525870] E [socket.c:1657:socket_connect_finish]
mirrorvol1-client-0: connection to 10.1.30.127:24009 failed (No route to host)
[2010-10-13 18:33:38.482826] I
[client-handshake.c:699:select_server_supported_programs] mirrorvol1-client-0:
Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
[2010-10-13 18:33:38.873463] I [client-handshake.c:535:client_setvolume_cbk]
mirrorvol1-client-0: Connected to 10.1.30.127:24009, attached to remote volume
'/mnt1'.

At this point, client hangs forever.

Comment 1 Amar Tumballi 2010-10-15 03:12:30 UTC
Allen/Harsha,

If you get some time, can you apply the patch http://patches.gluster.com/patch/5495/ and try again? I guess the issue has come up because of CHILD_UP not being propagated properly up to NFS xlator.

Comment 2 Harshavardhana 2010-10-15 14:22:01 UTC
Changing this to blocker, we can't move forward without the fix for this for many users/customers which need pure HA with Load balancing with RRDNS.

Comment 3 Shehjar Tikoo 2010-10-19 04:20:44 UTC
More info available at bug 763607

Comment 4 Shehjar Tikoo 2010-10-19 06:54:10 UTC
Hi Allen, a few questions:

1. Can i get access to this system?

2. Or the complete log that is pasted below?

3. When you say "client hangs forever", how long is forever? it could be that the nfs clients retransmission timeout seems like the client app is hung.

Comment 5 Harshavardhana 2010-10-19 13:39:06 UTC
(In reply to comment #4)
> Hi Allen, a few questions:
> 
> 1. Can i get access to this system?
> 
Systems are accessible with 10.1.10.111, 10.1.10.112, 10.1.10.113
> 2. Or the complete log that is pasted below?
> 
I think Anush has already reproduced it himself, check with him the scenarios he reproduced it with. 

> 3. When you say "client hangs forever", how long is forever? it could be that
> the nfs clients retransmission timeout seems like the client app is hung.
We don't know anything about retransmission timeout here but it was so long that it was forever. 

Vijay also knows about this issue in detail as he was speaking about changing keep_alive and idle_timeout

Comment 6 Julia Ashley Kreger 2010-10-19 16:18:32 UTC
> > 3. When you say "client hangs forever", how long is forever? it could be that
> > the nfs clients retransmission timeout seems like the client app is hung.
> We don't know anything about retransmission timeout here but it was so long
> that it was forever. 


I was able to reproduce this issue.  I think it would depend on the mount type (soft or hard).  In my case, I'm able to reproduce with a soft mount, and a timeout of 15 seconds on Fedora Core.  I loose the mount with Linux returning Input/Output error initially and then the error 526 for any operation I try against the mount.  It recovers about 30 seconds later, but my IO operations to anything on the mount are already dead.

Comment 7 Shehjar Tikoo 2010-10-20 01:18:55 UTC
(In reply to comment #6)
> > > 3. When you say "client hangs forever", how long is forever? it could be that
> > > the nfs clients retransmission timeout seems like the client app is hung.
> > We don't know anything about retransmission timeout here but it was so long
> > that it was forever. 
> 
> 
> I was able to reproduce this issue.  I think it would depend on the mount type
> (soft or hard).  In my case, I'm able to reproduce with a soft mount, and a
> timeout of 15 seconds on Fedora Core.  I loose the mount with Linux returning
> Input/Output error initially and then the error 526 for any operation I try
> against the mount.  It recovers about 30 seconds later, but my IO operations to
> anything on the mount are already dead.

Ok. Thats the confirmation. If the timeouts are really occurring nfs soft mounted clients return a EIO. Thanks.

Comment 8 Shehjar Tikoo 2010-10-20 04:47:00 UTC
Failover, failback and a second failover, all three work fine with 2 replica config for the following workload over the nfs mount point:

for i in $(seq 1 1000); do echo $i; touch /mnt/$i;sleep 1;done

There is a few seconds of hang but the file creation resumes once the ip migration and the nfs client's back-off timeout are over. Trying with an IO load.

Comment 9 Shehjar Tikoo 2010-10-20 05:28:00 UTC
Failover, failback and a second failover, all three work fine with 2 replica
config for the following workload over the nfs mount point:

cp -avf /usr /mnt

NFS mount is mounted with soft,intr. So cp of one or two files who's requests timeout, result in Input/Output error but the cp for all other files resumes and continues.

To confirm that the EIO is due to a timeout, next I am going to mount without the soft,intr options so that the client retransmits indefinitely.

Comment 10 Shehjar Tikoo 2010-10-20 07:59:07 UTC
For a 6g dd too, failover and failback work fine with 2 replica config. The second failover is taking unusually long to recover. Logs suggest a self-heal is getting triggered. Investigating....

Comment 11 Harshavardhana 2010-10-20 08:30:53 UTC
(In reply to comment #10)
> For a 6g dd too, failover and failback work fine with 2 replica config. The
> second failover is taking unusually long to recover. Logs suggest a self-heal
> is getting triggered. Investigating....

how are you trying the failover, a simple kill command or a shutdown of the node doesn't help. You need to poweroff the node abruptly to understand perfectly this case remove the power cable. 

You will see the hang, this was reproduced by Anush already did you speak to him?

Comment 12 Shehjar Tikoo 2010-10-21 02:03:32 UTC
(In reply to comment #11)
> (In reply to comment #10)
> > For a 6g dd too, failover and failback work fine with 2 replica config. The
> > second failover is taking unusually long to recover. Logs suggest a self-heal
> > is getting triggered. Investigating....
> 
> how are you trying the failover, a simple kill command or a shutdown of the
> node doesn't help. You need to poweroff the node abruptly to understand
> perfectly this case remove the power cable. 
> 

Am powering off the VMs which is the same as abrupt poweroff. I can see the slave take over the ip address as the master node goes off.

> You will see the hang, this was reproduced by Anush already did you speak to
> him?

I think he also saw the long delays as a hang. The trick is in knowing when the client starts retransmitting requests by looking at the dmesg. This may be a case of excessively long timeout period in nfs clients that looks like a hung client.

In any case, the second failover still needs a bit of investigation. That may point out the problem you were facing.

Comment 13 Anand Avati 2010-10-27 08:13:55 UTC
PATCH: http://patches.gluster.com/patch/5584 in master (socket: Reduce keepalive idle and interval time)

Comment 14 Shehjar Tikoo 2010-10-28 09:45:48 UTC
In all my tests I am able to resume IO after NFS client's starts retransmitting. 
I've confirmed this while monitoring glusterfs and glusterfsd CPU usages on all bricks as well as file size.

I believe the long retransmission timeouts and back-offs are seeming like the client is hung but it is not.

The right solution will be to reduce the value of "retrans" option at mount if the IO needs to be resumed faster. Of course, that has its own disadvantages that excessive network traffic flows but I guess that is an acceptable price.

Closing this bug.

Comment 15 Harshavardhana 2010-10-28 15:46:21 UTC
(In reply to comment #14)
> In all my tests I am able to resume IO after NFS client's starts
> retransmitting. 
> I've confirmed this while monitoring glusterfs and glusterfsd CPU usages on all
> bricks as well as file size.
>  bug.

I am not following here, there has been a patch which was sent against this bug id what was that supposed to do?. I am aware that i reproduced this is with Anush on a screen session and i think Vijay already knows about it.

Comment 16 Shehjar Tikoo 2010-10-28 22:58:25 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > In all my tests I am able to resume IO after NFS client's starts
> > retransmitting. 
> > I've confirmed this while monitoring glusterfs and glusterfsd CPU usages on all
> > bricks as well as file size.
> >  bug.
> 
> I am not following here, there has been a patch which was sent against this bug
> id what was that supposed to do?. I am aware that i reproduced this is with
> Anush on a screen session and i think Vijay already knows about it.


That patch is a precautionary measure. Even without that it works.

You did not reproduce a permanent hang. Even without that patch, you just need to let the client wait long enough to unhang itself when its timeout expires.


Note You need to log in before you can comment on or make changes to this bug.