Description of problem: On an SSL enabled volume the gluster.fuse client is not connecting back to a brick that is once disconnected. Version-Release number of selected component (if applicable): glusterfs-3.7.9-12.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1.Install RHGS 3.1.3 2. Create a 1x2 replicated volume 3. Enable SSL 4. Mount the volume using gluster.fuse client. Keeping the mount active 5. Kill one brick-1 , start it 6. Kill the second brick and start it 7 . Now try to access the mount from the client, it gives transport endpoint not connected though both the bricks are online. Eg:- ]# gluster v status Status of volume: testssl Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gluserver602:/brick01/b01 49152 0 Y 5669 Brick gluserver601:/brick01/b01 49152 0 Y 26885 NFS Server on localhost 2049 0 Y 8659 Self-heal Daemon on localhost N/A N/A Y 8673 NFS Server on gluserver601 2049 0 Y 26906 Self-heal Daemon on gluserver601 N/A N/A Y 26922 Task Status of Volume testssl ------------------------------------------------------------------------------ There are no active volume tasks client : ----------- gluserver602:testssl on /testssl type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) [root@dhcp8-6 ~]# ll /testssl ls: cannot access /testssl: Transport endpoint is not connected Actual results: client gives transport endpoint not connected. Without SSL everything works fine. Expected results: client should be able to reconnect to the server. Additional info: From the logs it seems it is not reconnecting. [2017-02-08 12:14:13.146762] W [socket.c:701:__socket_rwv] 0-testssl-client-0: readv on 10.65.6.27:49152 failed (No data available) [2017-02-08 12:14:13.146828] E [socket.c:2618:socket_poller] 0-testssl-client-0: error in polling loop [2017-02-08 12:14:13.146982] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-0: disconnected from testssl-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2017-02-08 12:16:14.905967] W [socket.c:701:__socket_rwv] 0-testssl-client-1: readv on 10.65.6.26:49152 failed (No data available) [2017-02-08 12:16:14.906057] E [socket.c:2618:socket_poller] 0-testssl-client-1: error in polling loop [2017-02-08 12:16:14.906233] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-1: disconnected from testssl-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2017-02-08 12:16:14.906258] E [MSGID: 108006] [afr-common.c:4164:afr_notify] 0-testssl-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2017-02-08 12:16:26.318498] I [MSGID: 108006] [afr-common.c:4273:afr_local_init] 0-testssl-replicate-0: no subvolumes up [2017-02-08 12:16:26.318589] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 527: LOOKUP() / => -1 (Transport endpoint is not connected) # gluster volume info Volume Name: testssl Type: Replicate Volume ID: c18f59f0-37b2-40ad-be2e-d784456dd1bf Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gluserver602:/brick01/b01 Brick2: gluserver601:/brick01/b01 Options Reconfigured: auth.ssl-allow: * server.ssl: on client.ssl: on performance.readdir-ahead: on
Hi, The issue is not reproducible on latest package(glusterfs-3.8.4-14.el6rhs.x86_64.rpm) version. I have to find out what patch has resolved the above issue. Regards Mohit Agrawal
Hi Mohit, I have tested with the glusterfs-3.8.4-14.el6rhs.x86_64.rpm package and brick re-connection happens with the latest package. As discussed , the brick re-connect messages were not appearing in the older version. [2017-02-09 06:34:33.644139] I [socket.c:348:ssl_setup_connection] 0-testssl-client-0: peer CN = COMMONNAME [2017-02-09 06:34:33.644195] I [socket.c:351:ssl_setup_connection] 0-testssl-client-0: SSL verification succeeded (client: 10.65.6.27:24007) [2017-02-09 06:34:33.644818] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-testssl-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2017-02-09 06:34:33.645849] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-testssl-client-0: Connected to testssl-client-0, attached to remote volume '/brick01/b01'. [2017-02-09 06:34:33.645872] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-testssl-client-0: Server and Client lk-version numbers are not same, reopening the fds [2017-02-09 06:34:33.646333] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testssl-client-0: Server lk version = 1 Regards Riyas
Hi Mohit, This patch seems to address the issue: https://review.gluster.org/#/c/13554/ "For an encrypted connection, sockect_connect() used to launch socket_poller() in it's own thread (ON by default), even if the connect failed. This would cause two unrefs to be done on the transport, once in socket_poller() and once in socket_connect(), causing the transport to be freed and cleaned up. This would cause further reconnect attempts from failing as the transport wouldn't be available." Changes to rpc/rpc-transport/socket/src/socket.c file seems very small and safe. Could you build an hotfix? Regards, Pierre-Yves
Hi Pieere, Thanks for your analysis,I have not checked yet about the code change why it is not working in 3.1.3? I was busy in some other bugzilla.In 3.2 release we have done so many changes specific to socket_poller code , I will share my analysis after check the code. Specific to this patch as you have shared it is already merged in the release (3.7.9-12) that one you are using. Below is the output from git log for the branch (3.1.3) >>>>>>>>>>>>>>>> commit f125bb78b5a2abb41dec011d2f4fd525cb57ec93 Author: Kaushal M <kaushal> Date: Tue Mar 1 13:04:03 2016 +0530 socket: Launch socket_poller only if connect succeeded Backport of 92abe07 from master >>>>>>>>>>>>>>>>> Regards Mohit Agrawal
downstream patch : https://code.engineering.redhat.com/gerrit/85897 is already into rhgs-3.2.0. Moving the status to MODIFIED for now, I will be moving it to ON_QA once all the acks are in place.
Followed the steps to reproduce and i am unable to reproduce the issue reported in the bug. Version ------- glusterfs-3.8.4-15
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html