Bug 1420324

Summary: [GSS] The bricks once disconnected not connects back if SSL is enabled
Product: Red Hat Gluster Storage Reporter: Riyas Abdulrasak <rnalakka>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: Vivek Das <vdas>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, bkunal, moagrawa, nbalacha, pierre-yves.goubet, rcyriac, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: All   
OS: All   
Whiteboard: ssl
Fixed In Version: glusterfs-3.8.4-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 06:04:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1351528    

Description Riyas Abdulrasak 2017-02-08 13:04:05 UTC
Description of problem:

On an SSL enabled volume the gluster.fuse client is not connecting back to a brick that is once disconnected. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.Install RHGS 3.1.3
2. Create a 1x2 replicated volume 
3. Enable SSL 
4. Mount the volume using gluster.fuse client. Keeping the mount active
5. Kill one brick-1 , start it
6. Kill the second brick and start it 
7 . Now try to access the mount from the client, it gives transport endpoint not connected though both the bricks are online. 

]# gluster v status
Status of volume: testssl
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick gluserver602:/brick01/b01             49152     0          Y       5669 
Brick gluserver601:/brick01/b01             49152     0          Y       26885
NFS Server on localhost                     2049      0          Y       8659 
Self-heal Daemon on localhost               N/A       N/A        Y       8673 
NFS Server on gluserver601                  2049      0          Y       26906
Self-heal Daemon on gluserver601            N/A       N/A        Y       26922
Task Status of Volume testssl
There are no active volume tasks

client : 
gluserver602:testssl on /testssl type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

[root@dhcp8-6 ~]# ll /testssl 
ls: cannot access /testssl: Transport endpoint is not connected

Actual results:

client gives transport endpoint not connected. Without SSL everything works fine. 

Expected results:

client should be able to reconnect to the server. 

Additional info:

From the logs it seems it is not reconnecting. 

[2017-02-08 12:14:13.146762] W [socket.c:701:__socket_rwv] 0-testssl-client-0: readv on failed (No data available)
[2017-02-08 12:14:13.146828] E [socket.c:2618:socket_poller] 0-testssl-client-0: error in polling loop
[2017-02-08 12:14:13.146982] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-0: disconnected from testssl-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.905967] W [socket.c:701:__socket_rwv] 0-testssl-client-1: readv on failed (No data available)
[2017-02-08 12:16:14.906057] E [socket.c:2618:socket_poller] 0-testssl-client-1: error in polling loop
[2017-02-08 12:16:14.906233] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-1: disconnected from testssl-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.906258] E [MSGID: 108006] [afr-common.c:4164:afr_notify] 0-testssl-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 12:16:26.318498] I [MSGID: 108006] [afr-common.c:4273:afr_local_init] 0-testssl-replicate-0: no subvolumes up
[2017-02-08 12:16:26.318589] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 527: LOOKUP() / => -1 (Transport endpoint is not connected)

# gluster volume info
Volume Name: testssl
Type: Replicate
Volume ID: c18f59f0-37b2-40ad-be2e-d784456dd1bf
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: gluserver602:/brick01/b01
Brick2: gluserver601:/brick01/b01
Options Reconfigured:
auth.ssl-allow: *
server.ssl: on
client.ssl: on
performance.readdir-ahead: on

Comment 2 Mohit Agrawal 2017-02-09 05:40:09 UTC

The issue is not reproducible on latest package(glusterfs-3.8.4-14.el6rhs.x86_64.rpm) version.
I have to find out what patch has resolved the above issue.

Mohit Agrawal

Comment 3 Riyas Abdulrasak 2017-02-09 06:40:13 UTC
Hi Mohit, 

I have tested with the glusterfs-3.8.4-14.el6rhs.x86_64.rpm  package and brick re-connection happens with the latest package. 

As discussed , the brick re-connect messages were not appearing in the older version. 

[2017-02-09 06:34:33.644139] I [socket.c:348:ssl_setup_connection] 0-testssl-client-0: peer CN = COMMONNAME
[2017-02-09 06:34:33.644195] I [socket.c:351:ssl_setup_connection] 0-testssl-client-0: SSL verification succeeded (client:
[2017-02-09 06:34:33.644818] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-testssl-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-02-09 06:34:33.645849] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-testssl-client-0: Connected to testssl-client-0, attached to remote volume '/brick01/b01'.
[2017-02-09 06:34:33.645872] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-testssl-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-02-09 06:34:33.646333] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testssl-client-0: Server lk version = 1


Comment 4 Pierre-Yves G. 2017-02-09 10:52:41 UTC
Hi Mohit,

This patch seems to address the issue: https://review.gluster.org/#/c/13554/
"For an encrypted connection, sockect_connect() used to launch socket_poller() in it's own thread (ON by default), even if the connect failed. This would cause two unrefs to be done on the transport, once in socket_poller() and once in socket_connect(), causing the transport to be freed and cleaned up. This would cause further reconnect attempts from failing as the transport wouldn't be available."

Changes to rpc/rpc-transport/socket/src/socket.c file seems very small and safe. Could you build an hotfix?


Comment 5 Mohit Agrawal 2017-02-09 11:21:41 UTC
Hi Pieere,

 Thanks for your analysis,I have not checked yet about the code change why it is not working in 3.1.3?
 I was busy in some other bugzilla.In 3.2 release we have done so many changes specific to socket_poller code , I will share my analysis after check the code.
 Specific to this patch as you have shared it is already merged in the release (3.7.9-12) that one you are using. 

 Below is the output from git log for the branch (3.1.3)


 commit f125bb78b5a2abb41dec011d2f4fd525cb57ec93
 Author: Kaushal M <kaushal@redhat.com>
 Date:   Tue Mar 1 13:04:03 2016 +0530

    socket: Launch socket_poller only if connect succeeded
      Backport of 92abe07 from master

Mohit Agrawal

Comment 12 Atin Mukherjee 2017-02-10 12:32:42 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/85897 is already into rhgs-3.2.0. Moving the status to MODIFIED for now, I will be moving it to ON_QA once all the acks are in place.

Comment 18 Vivek Das 2017-02-26 15:12:55 UTC
Followed the steps to reproduce and i am unable to reproduce the issue reported in the bug.


Comment 21 errata-xmlrpc 2017-03-23 06:04:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.