Bug 1272940 - Shd can't reconnect after ping-timeout (error in polling loop; invalid argument: this->private)
Shd can't reconnect after ping-timeout (error in polling loop; invalid argume...
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
Unspecified Linux
unspecified Severity high
: ---
: ---
Assigned To: Ashish Pandey
: Triaged
Depends On:
  Show dependency treegraph
Reported: 2015-10-19 05:23 EDT by MlHamburg
Modified: 2017-03-08 05:51 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-03-08 05:51:41 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description MlHamburg 2015-10-19 05:23:50 EDT
Description of problem:
SHD can't reconnect when other server died. Distributed-Replicate volume, 2x2 bricks, two servers.

[2015-10-16 17:06:02.069511] D [socket.c:280:ssl_do] 0-gv0-client-3: syscall error (probably remote disconnect)
[2015-10-16 17:06:02.069555] W [socket.c:588:__socket_rwv] 0-gv0-client-3: readv on xxxx1:49153 failed (No data available)
[2015-10-16 17:06:02.069559] D [socket.c:280:ssl_do] 0-gv0-client-0: syscall error (probably remote disconnect)
[2015-10-16 17:06:02.069582] E [socket.c:2501:socket_poller] 0-gv0-client-3: error in polling loop
[2015-10-16 17:06:02.069606] W [socket.c:588:__socket_rwv] 0-gv0-client-0: readv on xxxx1:49152 failed (No data available)
[2015-10-16 17:06:02.069656] E [socket.c:2501:socket_poller] 0-gv0-client-0: error in polling loop
[2015-10-16 17:06:02.069694] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-3: disconnected from gv0-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2015-10-16 17:06:02.069834] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-0: disconnected from gv0-client-0. Client process will keep trying to connect to glusterd until brick's port is available

and then every 3 seconds:
[2015-10-16 17:06:15.348616] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-3: attempting reconnect
[2015-10-16 17:06:15.348725] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]
[2015-10-16 17:06:15.348792] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-0: attempting reconnect
[2015-10-16 17:06:15.348858] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]

- The problem does not occur with disabled SSL (server.ssl: off; client.ssl: off). 
- The problem does not occur when second brick is deleted and only one reconnect happens
- Also affects 3.6.6

Version-Release number of selected component (if applicable):
Ubuntu 14.04.3 LTS
glusterfs-server                    3.7.5-ubuntu1~trusty1
glusterfs-client                    3.7.5-ubuntu1~trusty1
glusterfs-common                    3.7.5-ubuntu1~trusty1
libssl1.0.0:amd64                   1.0.1f-1ubuntu2.15

How reproducible:
Two nodes, two bricks, replica 2
server.ssl: on; client.ssl: on; auth.ssl-allow *; ssl.cipher-list HIGH:!SSLv2

Steps to Reproduce:
1. pkill -f gluster on node1
2. look at glustershd.log of node2, "error in polling loop"

After restart everything works fine:
3. pkill -f gluster on node2
4. restart gluster on both nodes
5. -> reconnection works and healing starts

Actual results:
No reconnect, no healing, error msg in log every few seconds. No outgoing SYN packets.

Expected results:
Reconnect, healing

Speculation (almost 100%):
As it only happens with SSL and when 0-gv0-client-3 and 0-gv0-client-0 try to reconnect simultaneously: Race-condition in SSL handling? https://bugzilla.redhat.com/show_bug.cgi?id=906763

Additional info:
Already talked to JoeJulian on #gluster.
Comment 1 Kaushal 2017-03-08 05:51:41 EST
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.