Description of problem: SHD can't reconnect when other server died. Distributed-Replicate volume, 2x2 bricks, two servers. [2015-10-16 17:06:02.069511] D [socket.c:280:ssl_do] 0-gv0-client-3: syscall error (probably remote disconnect) [2015-10-16 17:06:02.069555] W [socket.c:588:__socket_rwv] 0-gv0-client-3: readv on xxxx1:49153 failed (No data available) [2015-10-16 17:06:02.069559] D [socket.c:280:ssl_do] 0-gv0-client-0: syscall error (probably remote disconnect) [2015-10-16 17:06:02.069582] E [socket.c:2501:socket_poller] 0-gv0-client-3: error in polling loop [2015-10-16 17:06:02.069606] W [socket.c:588:__socket_rwv] 0-gv0-client-0: readv on xxxx1:49152 failed (No data available) [2015-10-16 17:06:02.069656] E [socket.c:2501:socket_poller] 0-gv0-client-0: error in polling loop [2015-10-16 17:06:02.069694] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-3: disconnected from gv0-client-3. Client process will keep trying to connect to glusterd until brick's port is available [2015-10-16 17:06:02.069834] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-0: disconnected from gv0-client-0. Client process will keep trying to connect to glusterd until brick's port is available and then every 3 seconds: [2015-10-16 17:06:15.348616] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-3: attempting reconnect [2015-10-16 17:06:15.348725] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr /lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument] [2015-10-16 17:06:15.348792] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-0: attempting reconnect [2015-10-16 17:06:15.348858] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr /lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument] - The problem does not occur with disabled SSL (server.ssl: off; client.ssl: off). - The problem does not occur when second brick is deleted and only one reconnect happens - Also affects 3.6.6 Version-Release number of selected component (if applicable): Ubuntu 14.04.3 LTS glusterfs-server 3.7.5-ubuntu1~trusty1 glusterfs-client 3.7.5-ubuntu1~trusty1 glusterfs-common 3.7.5-ubuntu1~trusty1 libssl1.0.0:amd64 1.0.1f-1ubuntu2.15 How reproducible: Setup: Two nodes, two bricks, replica 2 server.ssl: on; client.ssl: on; auth.ssl-allow *; ssl.cipher-list HIGH:!SSLv2 Steps to Reproduce: 1. pkill -f gluster on node1 2. look at glustershd.log of node2, "error in polling loop" After restart everything works fine: 3. pkill -f gluster on node2 4. restart gluster on both nodes 5. -> reconnection works and healing starts Actual results: No reconnect, no healing, error msg in log every few seconds. No outgoing SYN packets. Expected results: Reconnect, healing Speculation (almost 100%): As it only happens with SSL and when 0-gv0-client-3 and 0-gv0-client-0 try to reconnect simultaneously: Race-condition in SSL handling? https://bugzilla.redhat.com/show_bug.cgi?id=906763 Additional info: Already talked to JoeJulian on #gluster.
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.