Bug 764244 (GLUSTER-2512) - Peer's death in a 3 replica cluster stops data transfer for up to 45 sec.
Summary: Peer's death in a 3 replica cluster stops data transfer for up to 45 sec.
Keywords:
Status: CLOSED WONTFIX
Alias: GLUSTER-2512
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.1.2
Hardware: i386
OS: Linux
medium
low
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-10 22:03 UTC by raf
Modified: 2011-03-22 06:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: fuse
Documentation: DP
CRM:
Verified Versions:


Attachments (Terms of Use)

Description raf 2011-03-10 22:03:09 UTC
[192.168.0.1]#gluster volume create test replica 3 transport tcp 192.168.0.1:/var/gluster 192.168.0.2:/var/gluster 192.168.0.3:/var/gluster

[192.168.0.1]#gluster volume start test

[192.168.0.1]#mount -t glusterfs localhost:/test /mnt/gluster

share /mnt/gluster using SAMBA and start copying a bunch of data from a Window$ client

during data copy let's kill (unplug form surge) 192.168.0.2
data transfer stops for up to 45 secs. and then goes again without errors

Raf

Comment 1 Pranith Kumar K 2011-03-20 01:08:22 UTC
(In reply to comment #0)
> [192.168.0.1]#gluster volume create test replica 3 transport tcp
> 192.168.0.1:/var/gluster 192.168.0.2:/var/gluster 192.168.0.3:/var/gluster
> 
> [192.168.0.1]#gluster volume start test
> 
> [192.168.0.1]#mount -t glusterfs localhost:/test /mnt/gluster
> 
> share /mnt/gluster using SAMBA and start copying a bunch of data from a Window$
> client
> 
> during data copy let's kill (unplug form surge) 192.168.0.2
> data transfer stops for up to 45 secs. and then goes again without errors
> 
> Raf

      The network ping timeout for glusterfs is around 45 seconds. Could you check if the same happens after setting the ping-timeout to something lesser than the samba client.
example: gluster volume set test network.ping-timeout 10

Comment 2 raf 2011-03-21 07:00:02 UTC
Well, I entered
gluster volume set test network.ping-timeout 5 
and no more hang-up is noticeable.

Thank you

Raf

Comment 3 Joe Julian 2011-03-21 12:17:14 UTC
Is 42 seconds really reasonable? I know it's the answer to life, the universe, and everything, but I'm not sure it's the best answer to ping timeouts. This is a common issue on the IRC channel and I'm thinking that unless you're trying to replicate over a WAN, 2 - 10 seconds seems a much more reasonable timeout.

Comment 4 Pranith Kumar K 2011-03-22 03:40:38 UTC
(In reply to comment #3)
> Is 42 seconds really reasonable? I know it's the answer to life, the universe,
> and everything, but I'm not sure it's the best answer to ping timeouts. This is
> a common issue on the IRC channel and I'm thinking that unless you're trying to
> replicate over a WAN, 2 - 10 seconds seems a much more reasonable timeout.

hi Joe,
   In production the servers are expected to come back online within ~30 seconds, that is the reason why the ping time-out is set as > ~30 seconds, because network reconnection is a costly operation. It involves resource cleanup on the server side and client will have to redo locking etc after coming back. 
  This is exposed as an option that can be changed by the users based on their needs, we dont want to change the default. We will document this in the wiki. 
Will be closing the bug.

Thanks
Pranith.


Note You need to log in before you can comment on or make changes to this bug.