Bug 1272940 - Shd can't reconnect after ping-timeout (error in polling loop; invalid argument: this->private)
Summary: Shd can't reconnect after ping-timeout (error in polling loop; invalid argume...
Keywords:
Status: CLOSED EOL
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.7.5
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Ashish Pandey
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-19 09:23 UTC by MlHamburg
Modified: 2017-03-08 10:51 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-08 10:51:41 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description MlHamburg 2015-10-19 09:23:50 UTC
Description of problem:
SHD can't reconnect when other server died. Distributed-Replicate volume, 2x2 bricks, two servers.

[2015-10-16 17:06:02.069511] D [socket.c:280:ssl_do] 0-gv0-client-3: syscall error (probably remote disconnect)
[2015-10-16 17:06:02.069555] W [socket.c:588:__socket_rwv] 0-gv0-client-3: readv on xxxx1:49153 failed (No data available)
[2015-10-16 17:06:02.069559] D [socket.c:280:ssl_do] 0-gv0-client-0: syscall error (probably remote disconnect)
[2015-10-16 17:06:02.069582] E [socket.c:2501:socket_poller] 0-gv0-client-3: error in polling loop
[2015-10-16 17:06:02.069606] W [socket.c:588:__socket_rwv] 0-gv0-client-0: readv on xxxx1:49152 failed (No data available)
[2015-10-16 17:06:02.069656] E [socket.c:2501:socket_poller] 0-gv0-client-0: error in polling loop
[2015-10-16 17:06:02.069694] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-3: disconnected from gv0-client-3. Client process will keep trying to connect to glusterd until brick's port is available
[2015-10-16 17:06:02.069834] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 0-gv0-client-0: disconnected from gv0-client-0. Client process will keep trying to connect to glusterd until brick's port is available

and then every 3 seconds:
[2015-10-16 17:06:15.348616] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-3: attempting reconnect
[2015-10-16 17:06:15.348725] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]
[2015-10-16 17:06:15.348792] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-gv0-client-0: attempting reconnect
[2015-10-16 17:06:15.348858] E [socket.c:2863:socket_connect] (-->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_timer_proc+0xfb) [0x7f3bf4cf66bb] -->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_reconnect+0xb9) [0x7f3bf4aa8c59] -->/usr
/lib/x86_64-linux-gnu/glusterfs/3.7.5/rpc-transport/socket.so(+0x755d) [0x7f3bf034d55d] ) 0-socket: invalid argument: this->private [Invalid argument]


- The problem does not occur with disabled SSL (server.ssl: off; client.ssl: off). 
- The problem does not occur when second brick is deleted and only one reconnect happens
- Also affects 3.6.6

Version-Release number of selected component (if applicable):
Ubuntu 14.04.3 LTS
glusterfs-server                    3.7.5-ubuntu1~trusty1
glusterfs-client                    3.7.5-ubuntu1~trusty1
glusterfs-common                    3.7.5-ubuntu1~trusty1
libssl1.0.0:amd64                   1.0.1f-1ubuntu2.15

How reproducible:
Setup: 
Two nodes, two bricks, replica 2
server.ssl: on; client.ssl: on; auth.ssl-allow *; ssl.cipher-list HIGH:!SSLv2

Steps to Reproduce:
1. pkill -f gluster on node1
2. look at glustershd.log of node2, "error in polling loop"

After restart everything works fine:
3. pkill -f gluster on node2
4. restart gluster on both nodes
5. -> reconnection works and healing starts

Actual results:
No reconnect, no healing, error msg in log every few seconds. No outgoing SYN packets.

Expected results:
Reconnect, healing

Speculation (almost 100%):
As it only happens with SSL and when 0-gv0-client-3 and 0-gv0-client-0 try to reconnect simultaneously: Race-condition in SSL handling? https://bugzilla.redhat.com/show_bug.cgi?id=906763

Additional info:
Already talked to JoeJulian on #gluster.

Comment 1 Kaushal 2017-03-08 10:51:41 UTC
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.


Note You need to log in before you can comment on or make changes to this bug.