Description of problem:
On an SSL enabled volume the gluster.fuse client is not connecting back to a brick that is once disconnected.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Install RHGS 3.1.3
2. Create a 1x2 replicated volume
3. Enable SSL
4. Mount the volume using gluster.fuse client. Keeping the mount active
5. Kill one brick-1 , start it
6. Kill the second brick and start it
7 . Now try to access the mount from the client, it gives transport endpoint not connected though both the bricks are online.
]# gluster v status
Status of volume: testssl
Gluster process TCP Port RDMA Port Online Pid
Brick gluserver602:/brick01/b01 49152 0 Y 5669
Brick gluserver601:/brick01/b01 49152 0 Y 26885
NFS Server on localhost 2049 0 Y 8659
Self-heal Daemon on localhost N/A N/A Y 8673
NFS Server on gluserver601 2049 0 Y 26906
Self-heal Daemon on gluserver601 N/A N/A Y 26922
Task Status of Volume testssl
There are no active volume tasks
gluserver602:testssl on /testssl type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
[root@dhcp8-6 ~]# ll /testssl
ls: cannot access /testssl: Transport endpoint is not connected
client gives transport endpoint not connected. Without SSL everything works fine.
client should be able to reconnect to the server.
From the logs it seems it is not reconnecting.
[2017-02-08 12:14:13.146762] W [socket.c:701:__socket_rwv] 0-testssl-client-0: readv on 10.65.6.27:49152 failed (No data available)
[2017-02-08 12:14:13.146828] E [socket.c:2618:socket_poller] 0-testssl-client-0: error in polling loop
[2017-02-08 12:14:13.146982] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-0: disconnected from testssl-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.905967] W [socket.c:701:__socket_rwv] 0-testssl-client-1: readv on 10.65.6.26:49152 failed (No data available)
[2017-02-08 12:16:14.906057] E [socket.c:2618:socket_poller] 0-testssl-client-1: error in polling loop
[2017-02-08 12:16:14.906233] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-1: disconnected from testssl-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.906258] E [MSGID: 108006] [afr-common.c:4164:afr_notify] 0-testssl-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 12:16:26.318498] I [MSGID: 108006] [afr-common.c:4273:afr_local_init] 0-testssl-replicate-0: no subvolumes up
[2017-02-08 12:16:26.318589] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 527: LOOKUP() / => -1 (Transport endpoint is not connected)
# gluster volume info
Volume Name: testssl
Volume ID: c18f59f0-37b2-40ad-be2e-d784456dd1bf
Number of Bricks: 1 x 2 = 2
The issue is not reproducible on latest package(glusterfs-3.8.4-14.el6rhs.x86_64.rpm) version.
I have to find out what patch has resolved the above issue.
I have tested with the glusterfs-3.8.4-14.el6rhs.x86_64.rpm package and brick re-connection happens with the latest package.
As discussed , the brick re-connect messages were not appearing in the older version.
[2017-02-09 06:34:33.644139] I [socket.c:348:ssl_setup_connection] 0-testssl-client-0: peer CN = COMMONNAME
[2017-02-09 06:34:33.644195] I [socket.c:351:ssl_setup_connection] 0-testssl-client-0: SSL verification succeeded (client: 10.65.6.27:24007)
[2017-02-09 06:34:33.644818] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-testssl-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-02-09 06:34:33.645849] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-testssl-client-0: Connected to testssl-client-0, attached to remote volume '/brick01/b01'.
[2017-02-09 06:34:33.645872] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-testssl-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-02-09 06:34:33.646333] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testssl-client-0: Server lk version = 1
This patch seems to address the issue: https://review.gluster.org/#/c/13554/
"For an encrypted connection, sockect_connect() used to launch socket_poller() in it's own thread (ON by default), even if the connect failed. This would cause two unrefs to be done on the transport, once in socket_poller() and once in socket_connect(), causing the transport to be freed and cleaned up. This would cause further reconnect attempts from failing as the transport wouldn't be available."
Changes to rpc/rpc-transport/socket/src/socket.c file seems very small and safe. Could you build an hotfix?
Thanks for your analysis,I have not checked yet about the code change why it is not working in 3.1.3?
I was busy in some other bugzilla.In 3.2 release we have done so many changes specific to socket_poller code , I will share my analysis after check the code.
Specific to this patch as you have shared it is already merged in the release (3.7.9-12) that one you are using.
Below is the output from git log for the branch (3.1.3)
Author: Kaushal M <email@example.com>
Date: Tue Mar 1 13:04:03 2016 +0530
socket: Launch socket_poller only if connect succeeded
Backport of 92abe07 from master
downstream patch : https://code.engineering.redhat.com/gerrit/85897 is already into rhgs-3.2.0. Moving the status to MODIFIED for now, I will be moving it to ON_QA once all the acks are in place.
Followed the steps to reproduce and i am unable to reproduce the issue reported in the bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.