Setup is simple enough 2 backends replicated setup with glusterfs serving CTDB. IP failover is fine, ip migrates smoothly. But when an I/O is going on from windows CIFS client, we see a hang for 42secs on the i/o which seems unwise for a replicated volume. This in-turn makes windows believe that connection is lost and turns out the file which was being copied is aborted. Relevant log messages are as below ---------------------- [2010-11-09 17:08:33.481173] I [client-handshake.c:829:client_setvolume_cbk] repl-client-0: Connected to 10.1.10.112:24010, attached to remote volume '/sdb'. [2010-11-10 13:02:28.281725] E [socket.c:1657:socket_connect_finish] repl-client-0: connection to 10.1 .10.112:24010 failed (No route to host) [2010-11-10 13:04:51.671883] I [client-handshake.c:993:select_server_supported_programs] repl-client-0 : Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-11-10 13:04:51.675596] I [client-handshake.c:829:client_setvolume_cbk] repl-client-0: Connected to 10.1.10.112:24010, attached to remote volume '/sdb'. [2010-11-10 13:04:51.675618] I [client-handshake.c:698:client_post_handshake] repl-client-0: 1 fds ope n - Delaying child_up until they are re-opened [2010-11-12 11:42:36.160148] E [socket.c:1657:socket_connect_finish] repl-client-0: connection to 10.1 .10.112:24010 failed (No route to host) [2010-11-12 11:45:32.464700] I [client-handshake.c:993:select_server_supported_programs] repl-client-0 : Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-11-12 11:45:32.465001] I [client-handshake.c:829:client_setvolume_cbk] repl-client-0: Connected to 10.1.10.112:24010, attached to remote volume '/sdb'. [2010-11-12 15:55:25.971214] I [afr-common.c:716:afr_lookup_done] repl-replicate-0: background meta-d ata self-heal triggered. path: /lost+found [2010-11-12 15:55:26.78397] I [afr-self-heal-common.c:1526:afr_self_heal_completion_cbk] repl-replicat e-0: background meta-data self-heal completed on /lost+found [2010-11-12 15:59:07.328764] E [client-handshake.c:116:rpc_client_ping_timer_expired] repl-client-0: S erver 10.1.10.112:24010 has not responded in the last 42 seconds, disconnecting. [2010-11-12 15:59:07.423161] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_c lnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f 40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwi nding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243685 [2010-11-12 15:59:07.425476] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_c lnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f 40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwi nding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243801 [2010-11-12 15:59:07.425610] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_c lnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f 40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwi nding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243853 [2010-11-12 15:59:07.425781] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243869 [2010-11-12 15:59:07.425920] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243884 [2010-11-12 15:59:07.434647] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243899 [2010-11-12 15:59:07.434794] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2010-11-12 15:56:47.243914 [2010-11-12 15:59:07.434924] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2010-11-12 15:56:47.244556 [2010-11-12 15:59:07.435438] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FXATTROP(34)) called at 2010-11-12 15:56:47.244569 [2010-11-12 15:59:07.435566] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at 2010-11-12 15:56:47.244581 [2010-11-12 15:59:07.435603] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2010-11-12 15:56:51.668307 [2010-11-12 15:59:07.435659] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2010-11-12 15:57:01.103051 [2010-11-12 15:59:07.435693] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2010-11-12 15:57:26.739047 [2010-11-12 15:59:07.435750] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(LOOKUP(27)) called at 2010-11-12 15:58:01.676669 [2010-11-12 15:59:07.435791] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x317f40f689] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x317f40ee2e] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x317f40ed9e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(FINODELK(30)) called at 2010-11-12 15:58:25.668064 [2010-11-12 15:59:10.376323] E [socket.c:1657:socket_connect_finish] repl-client-0: connection to 10.1.10.112:24010 failed (No route to host) ------------------------------------------------ Node was shutdown to see ip migrate and CIFS does failover which didn't happen. Running a stand alone dd we could see a block for 42secs for i/o. I am using Native GlusterFS mount for CTDB.
CTDB failover works, it was a configuration issue.