Bug 1272940

Summary: Shd can't reconnect after ping-timeout (error in polling loop; invalid argument: this->private)
Product: [Community] GlusterFS Reporter: MlHamburg <mlanz-redhat-bugzilla>
Component: replicateAssignee: Ashish Pandey <aspandey>
Status: CLOSED EOL QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.5CC: bugs, ian.conrad, joe, pierre-yves.goubet, pkarampu, ravishankar, smohan
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-08 10:51:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description MlHamburg 2015-10-19 09:23:50 UTC
Description of problem:
SHD can't reconnect when other server died. Distributed-Replicate volume, 2x2 bricks, two servers.

[2015-10-16 17:06:02.069511] D [socket.c:280:ssl_do] 0-gv0-client-3: syscall error (probably remote disconnect)
[2015-10-16 17:06:02.069555] W [socket.c:588:__socket_rwv] 0-gv0-client-3: readv on xxxx1:49153 failed (No data available)
[2015-10-16 17:06:02.069559] D [socket.c:280:ssl_do] 0-gv0-client-0: syscall error (probably remote disconnect)
[2015-10-16 17:06:02.069582] E [socket.c:2501:socket_poller] 0-gv0-client-3: error in polling loop
[2015-10-16 17:06:02.069606] W [socket.c:588:__socket_rwv] 0-gv0-client-0: readv on xxxx1:49152 failed (No data available)
[2015-10-16 17:06:02.069656] E [socket.c:2501:socket_poller] 0-gv0-client-0: error in polling loop
[2015-10-16 17:06:02.069694] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-3: disconnected from gv0-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2015-10-16 17:06:02.069834] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-0: disconnected from gv0-client-0. Client process will keep trying to connect to glusterd until brick's port is available

and then every 3 seconds:
[2015-10-16 17:06:15.348616] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-3: attempting reconnect
[2015-10-16 17:06:15.348725] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]
[2015-10-16 17:06:15.348792] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-0: attempting reconnect
[2015-10-16 17:06:15.348858] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]


- The problem does not occur with disabled SSL (server.ssl: off; client.ssl: off). 
- The problem does not occur when second brick is deleted and only one reconnect happens
- Also affects 3.6.6

Version-Release number of selected component (if applicable):
Ubuntu 14.04.3 LTS
glusterfs-server                    3.7.5-ubuntu1~trusty1
glusterfs-client                    3.7.5-ubuntu1~trusty1
glusterfs-common                    3.7.5-ubuntu1~trusty1
libssl1.0.0:amd64                   1.0.1f-1ubuntu2.15

How reproducible:
Setup: 
Two nodes, two bricks, replica 2
server.ssl: on; client.ssl: on; auth.ssl-allow *; ssl.cipher-list HIGH:!SSLv2

Steps to Reproduce:
1. pkill -f gluster on node1
2. look at glustershd.log of node2, "error in polling loop"

After restart everything works fine:
3. pkill -f gluster on node2
4. restart gluster on both nodes
5. -> reconnection works and healing starts

Actual results:
No reconnect, no healing, error msg in log every few seconds. No outgoing SYN packets.

Expected results:
Reconnect, healing

Speculation (almost 100%):
As it only happens with SSL and when 0-gv0-client-3 and 0-gv0-client-0 try to reconnect simultaneously: Race-condition in SSL handling? https://bugzilla.redhat.com/show_bug.cgi?id=906763

Additional info:
Already talked to JoeJulian on #gluster.

Comment 1 Kaushal 2017-03-08 10:51:41 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.