Bug 1054694
| Summary: | A replicated volume takes too much to come online when one server is down | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Xavi Hernandez <jahernan> | |
| Component: | replicate | Assignee: | Ravishankar N <ravishankar> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
| Severity: | unspecified | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | mainline | CC: | bugs, jbyers, roger.lehmann | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.8.0 | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1218731 1330855 (view as bug list) | Environment: | ||
| Last Closed: | 2016-06-16 12:38:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1218731, 1286911, 1330855 | |||
|
Description
Xavi Hernandez
2014-01-17 10:00:01 UTC
+1 I can confirm this bug. This is a huge problem for automatic failover in Proxmox. Without the workaround reducing the tcp_syn_retries count, any HA VM takes longer than the 30 sec timeout for Proxmox to start. Please fix it, thank you. I'm able to reproduce the issue on a plain 2x1 distribute volume also. Mounting a client on the node which is up hangs up until the network.ping-timeout value. After changing it from the default 42 to 20 seconds, even umount seems to hang for that time: -------------------- [2015-05-02 05:09:53.783067] I [client-handshake.c:187:client_set_lk_version_cbk] 0-testvol-client-1: Server lk version = 1 [2015-05-02 05:10:37.735298] C [rpc-clnt-ping.c:161:rpc_clnt_ping_timer_expired] 0-testvol-client-1: server 10.70.42.188:49152 has not responded in the last 20 seconds, disconnecting. [2015-05-02 05:10:37.736622] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] ))))) 0-testvol-client-1: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-05-02 05:10:17.019716 (xid=0xa) [2015-05-02 05:10:37.736795] W [client-rpc-fops.c:2824:client3_3_lookup_cbk] 0-testvol-client-1: remote operation failed: Transport endpoint is not connected. Path: / (00000000-0000-0000-0000-000000000001) [2015-05-02 05:10:37.737991] E [rpc-clnt.c:362:saved_frames_unwind] (--> /usr/local/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7ff771ce0622] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_unwind+0x212)[0x7ff771aa8f02] (--> /usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7ff771aa8fff] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x11e)[0x7ff771aa9491] (--> /usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7ff771aa9e8b] ))))) 0-testvol-client-1: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2015-05-02 05:10:17.019759 (xid=0xb) [2015-05-02 05:10:37.738092] W [rpc-clnt-ping.c:204:rpc_clnt_ping_cbk] 0-testvol-client-1: socket disconnected [2015-05-02 05:10:37.738158] I [client.c:2086:client_rpc_notify] 0-testvol-client-1: disconnected from testvol-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2015-05-02 05:10:37.741950] I [fuse-bridge.c:4922:fuse_thread_proc] 0-fuse: unmounting /mnt/fuse_mnt [2015-05-02 05:10:37.742171] W [glusterfsd.c:1212:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2015-05-02 05:10:37.742580] I [fuse-bridge.c:5617:fini] 0-fuse: Unmounting '/mnt/fuse_mnt'. ----------------------------- REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#1) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#2) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#4) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#5) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#7) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#8) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#9) for review on master by Ravishankar N (ravishankar) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#10) for review on master by Ravishankar N (ravishankar) This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#11) for review on master by Pranith Kumar Karampuri (pkarampu) REVIEW: http://review.gluster.org/11113 (afr: propagate child up event after timeout) posted (#12) for review on master by Ravishankar N (ravishankar) COMMIT: http://review.gluster.org/11113 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 3c35329feb4dd479c9e4856ee27fa4b12c708db2 Author: Ravishankar N <ravishankar> Date: Wed Dec 23 13:49:14 2015 +0530 afr: propagate child up event after timeout Problem: During mount, afr waits for response from all its children before notifying the parent xlator. In a 1x2 replica volume , if one of the nodes is down, the mount will hang for more than a minute until child down is received from the client xlator for that node. Fix: When parent up is received by afr, start a 10 second timer. In the timer call back, if we receive a successful child up from atleast one brick, propagate the event to the parent xlator. Change-Id: I31e57c8802c1a03a4a5d581ee4ab82f3a9c8799d BUG: 1054694 Signed-off-by: Ravishankar N <ravishankar> Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/11113 NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.com> CentOS-regression: Gluster Build System <jenkins.com> This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report. glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/ [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report. glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/ [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user |