1054694 – A replicated volume takes too much to come online when one server is down

Bug 1054694 - A replicated volume takes too much to come online when one server is down

Summary: A replicated volume takes too much to come online when one server is down

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1218731 1286911 1330855
TreeView+	depends on / blocked

Reported:	2014-01-17 10:00 UTC by Xavi Hernandez
Modified:	2016-09-20 17:46 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-3.8.0
Clone Of:
Clones:	1218731 1330855 (view as bug list)
Environment:
Last Closed:	2016-06-16 12:38:43 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Xavi Hernandez 2014-01-17 10:00:01 UTC

Description of problem:

When you start a replicated volume with at least one of its bricks managed by an unreachable server, it can take up to one or two minutes to become ready.

This could be acceptable for FUSE mounts since this only happens on mount and then the volume works as expected, however other applications based on libgfapi, where connection/disconnection can happen more often, this can break them.

For example, Proxmox establishes a timeout of 30 seconds to start a virtual machine (AFAIK this timeout is not configurable). On start, it uses libgfapi to connect to the virtual disk. If one server is unreachable, it takes too long and it times out, making impossible to start machines if gluster is not healthy. When gluster is running fine, it starts a machine in 1 or 2 seconds.

Version-Release number of selected component (if applicable): discovered on 3.4.1, but also happens on master

How reproducible:

always

Steps to Reproduce:
# mkdir -p /gluster/test
# gluster volume create test replica 2 node2:/bricks/test node3:/bricks/test
volume create: test: success: please start the volume to access data
# gluster volume start test
volume start: test: success
# time mount -t glusterfs node2:/test /gluster/test; ls -l /gluster/test

real 0m0.220s
user 0m0.042s
sys 0m0.063s
total 0
# umount /gluster/test
# ssh node3 poweroff
# time mount -t glusterfs node2:/test /gluster/test; ls -l /gluster/test

real 2m7.479s
user 0m0.036s
sys 0m0.075s
total 0

Actual results:

It takes too long to make the volume accessible.

Expected results:

If at least one brick is online, it should make the volume available sooner.

Additional info:

The root cause seems related to the TCP connection timeout. If gluster is down but the server is alive, it immediately replies with a "Connection refused" and the delay disappears. However if server is down or IP is unreachable, it needs to wait for a while.

A first workaround is to lower the value of /proc/sys/net/ipv4/tcp_syn_retries. This value is tipically set to 5 or 6 that means a timeout of about one or two minutes. A value of 2 or 3 reduces the timeout to 8 or 16 seconds. However this could have other consequences depending on the environment.

As a side note, I would also change the message that AFR shows when it sees the first subvolume online as it seems misleading to me. It says "Subvolume 'test-client-0' came back up; going online.", but it really doesn't go online until the message "switched to graph 0" is shown because it waits until it has seen an answer from every subvolume (even it that answer is a timeout). An example from the log file:

[2014-01-17 09:35:03.648844] I [afr-common.c:3928:afr_notify] 0-test-replicate-0: Subvolume 'test-client-0' came back up; going online.
[2014-01-17 09:35:03.649014] I [client-handshake.c:450:client_set_lk_version_cbk] 0-test-client-0: Server lk version = 1
[2014-01-17 09:37:10.921156] E [socket.c:2161:socket_connect_finish] 0-test-client-1: connection to 10.0.0.3:24007 failed (Connection timed out)
[2014-01-17 09:37:10.932407] I [fuse-bridge.c:4956:fuse_graph_setup] 0-fuse: switched to graph 0
[2014-01-17 09:37:10.932935] I [fuse-bridge.c:3893:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2014-01-17 09:37:10.933990] I [afr-common.c:2218:afr_set_root_inode_on_first_lookup] 0-test-replicate-0: added root inode

Comment 1 Roger Lehmann 2015-04-02 12:59:00 UTC

+1 I can confirm this bug.
This is a huge problem for automatic failover in Proxmox. Without the workaround reducing the tcp_syn_retries count, any HA VM takes longer than the 30 sec timeout for Proxmox to start.

Please fix it, thank you.

Comment 2 Ravishankar N 2015-05-02 05:20:06 UTC

I'm able to reproduce the issue on a plain 2x1 distribute volume also. Mounting a client on the node which is up hangs up until the network.ping-timeout value. After changing it from the default 42 to 20 seconds, even umount seems to hang for that time:
--------------------
[2015-05-02 05:09:53.783067] I [client-handshake.c:187:client_set_lk_version_cbk] 0-testvol-client-1: Server lk version = 1
[2015-05-02 05:10:37.735298] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server 10.70.42.188:49152 has not responded in the last 20 seconds, disconnecting.
[2015-05-02 05:10:37.736622] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] ))))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-05-02 05:10:17.019716 (xid=0xa)
[2015-05-02 05:10:37.736795] W [client-rpc-fops.c:2824:client3_3_lookup_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001)
[2015-05-02 05:10:37.737991] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] ))))) 0-testvol-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-05-02 05:10:17.019759 (xid=0xb)
[2015-05-02 05:10:37.738092] W [rpc-clnt-ping.c:204:rpc_clnt_ping_cbk] 0-testvol-client-1: socket disconnected
[2015-05-02 05:10:37.738158] I [client.c:2086:client_rpc_notify] 0-testvol-client-1: disconnected from testvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2015-05-02 05:10:37.741950] I [fuse-bridge.c:4922:fuse_thread_proc] 0-fuse: unmounting /mnt/fuse_mnt
[2015-05-02 05:10:37.742171] W [glusterfsd.c:1212:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2015-05-02 05:10:37.742580] I [fuse-bridge.c:5617:fini] 0-fuse: Unmounting '/mnt/fuse_mnt'. 
-----------------------------

Comment 3 Anand Avati 2015-06-07 12:36:12 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 4 Anand Avati 2015-06-09 13:04:20 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 5 Anand Avati 2015-06-11 11:38:02 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#4) for review on master by Ravishankar N (ravishankar)

Comment 6 Anand Avati 2015-06-11 13:23:02 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#5) for review on master by Ravishankar N (ravishankar)

Comment 7 Vijay Bellur 2015-12-23 09:39:58 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#7) for review on master by Ravishankar N (ravishankar)

Comment 8 Vijay Bellur 2016-01-06 12:00:18 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#8) for review on master by Ravishankar N (ravishankar)

Comment 9 Vijay Bellur 2016-02-08 12:43:59 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#9) for review on master by Ravishankar N (ravishankar)

Comment 10 Vijay Bellur 2016-02-10 16:42:07 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#10) for review on master by Ravishankar N (ravishankar)

Comment 11 Mike McCune 2016-03-28 23:23:30 UTC

This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Vijay Bellur 2016-04-10 07:49:39 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#11) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 13 Vijay Bellur 2016-04-25 13:07:34 UTC

REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#12) for review on master by Ravishankar N (ravishankar)

Comment 14 Vijay Bellur 2016-04-27 07:35:22 UTC

COMMIT: http://review.gluster.org/11113 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 3c35329feb4dd479c9e4856ee27fa4b12c708db2
Author: Ravishankar N <ravishankar>
Date:   Wed Dec 23 13:49:14 2015 +0530

    afr: propagate child up event after timeout
    
    Problem: During mount, afr waits for response from all its children before
    notifying the parent xlator. In a 1x2 replica volume , if one of the nodes is
    down, the mount will hang for more than a minute until child down is received
    from the client xlator for that node.
    
    Fix:
    When parent up is received by afr, start a 10 second timer. In the timer call
    back, if we receive a successful child up from atleast one brick, propagate the
    event to the parent xlator.
    
    Change-Id: I31e57c8802c1a03a4a5d581ee4ab82f3a9c8799d
    BUG: 1054694
    Signed-off-by: Ravishankar N <ravishankar>
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/11113
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>

Comment 15 Niels de Vos 2016-06-16 12:38:43 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 16 Niels de Vos 2016-06-16 16:17:49 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.