Bug 764244 (GLUSTER-2512)

Summary: Peer's death in a 3 replica cluster stops data transfer for up to 45 sec.
Product: [Community] GlusterFS Reporter: raf <milanraf>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WONTFIX QA Contact:
Severity: low Docs Contact:
Priority: medium    
Version: 3.1.2CC: gluster-bugs, joe
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: fuse
Documentation: DP CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description raf 2011-03-10 22:03:09 UTC
[192.168.0.1]#gluster volume create test replica 3 transport tcp 192.168.0.1:/var/gluster 192.168.0.2:/var/gluster 192.168.0.3:/var/gluster

[192.168.0.1]#gluster volume start test

[192.168.0.1]#mount -t glusterfs localhost:/test /mnt/gluster

share /mnt/gluster using SAMBA and start copying a bunch of data from a Window$ client

during data copy let's kill (unplug form surge) 192.168.0.2
data transfer stops for up to 45 secs. and then goes again without errors

Raf

Comment 1 Pranith Kumar K 2011-03-20 01:08:22 UTC
(In reply to comment #0)
> [192.168.0.1]#gluster volume create test replica 3 transport tcp
> 192.168.0.1:/var/gluster 192.168.0.2:/var/gluster 192.168.0.3:/var/gluster
> 
> [192.168.0.1]#gluster volume start test
> 
> [192.168.0.1]#mount -t glusterfs localhost:/test /mnt/gluster
> 
> share /mnt/gluster using SAMBA and start copying a bunch of data from a Window$
> client
> 
> during data copy let's kill (unplug form surge) 192.168.0.2
> data transfer stops for up to 45 secs. and then goes again without errors
> 
> Raf

      The network ping timeout for glusterfs is around 45 seconds. Could you check if the same happens after setting the ping-timeout to something lesser than the samba client.
example: gluster volume set test network.ping-timeout 10

Comment 2 raf 2011-03-21 07:00:02 UTC
Well, I entered
gluster volume set test network.ping-timeout 5 
and no more hang-up is noticeable.

Thank you

Raf

Comment 3 Joe Julian 2011-03-21 12:17:14 UTC
Is 42 seconds really reasonable? I know it's the answer to life, the universe, and everything, but I'm not sure it's the best answer to ping timeouts. This is a common issue on the IRC channel and I'm thinking that unless you're trying to replicate over a WAN, 2 - 10 seconds seems a much more reasonable timeout.

Comment 4 Pranith Kumar K 2011-03-22 03:40:38 UTC
(In reply to comment #3)
> Is 42 seconds really reasonable? I know it's the answer to life, the universe,
> and everything, but I'm not sure it's the best answer to ping timeouts. This is
> a common issue on the IRC channel and I'm thinking that unless you're trying to
> replicate over a WAN, 2 - 10 seconds seems a much more reasonable timeout.

hi Joe,
   In production the servers are expected to come back online within ~30 seconds, that is the reason why the ping time-out is set as > ~30 seconds, because network reconnection is a costly operation. It involves resource cleanup on the server side and client will have to redo locking etc after coming back. 
  This is exposed as an option that can be changed by the users based on their needs, we dont want to change the default. We will document this in the wiki. 
Will be closing the bug.

Thanks
Pranith.