Bug 1420324 - [GSS] The bricks once disconnected not connects back if SSL is enabled
Summary: [GSS] The bricks once disconnected not connects back if SSL is enabled
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core
Version: rhgs-3.1
Hardware: All
OS: All
Target Milestone: ---
: RHGS 3.2.0
Assignee: Mohit Agrawal
QA Contact: Vivek Das
Whiteboard: ssl
Depends On:
Blocks: 1351528
TreeView+ depends on / blocked
Reported: 2017-02-08 13:04 UTC by Riyas Abdulrasak
Modified: 2020-04-15 15:14 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2017-03-23 06:04:54 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1378528 0 unspecified CLOSED [SSL] glustershd disconnected from glusterd 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Internal Links: 1378528

Description Riyas Abdulrasak 2017-02-08 13:04:05 UTC
Description of problem:

On an SSL enabled volume the gluster.fuse client is not connecting back to a brick that is once disconnected. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:

1.Install RHGS 3.1.3
2. Create a 1x2 replicated volume 
3. Enable SSL 
4. Mount the volume using gluster.fuse client. Keeping the mount active
5. Kill one brick-1 , start it
6. Kill the second brick and start it 
7 . Now try to access the mount from the client, it gives transport endpoint not connected though both the bricks are online. 

]# gluster v status
Status of volume: testssl
Gluster process                             TCP Port  RDMA Port  Online  Pid
Brick gluserver602:/brick01/b01             49152     0          Y       5669 
Brick gluserver601:/brick01/b01             49152     0          Y       26885
NFS Server on localhost                     2049      0          Y       8659 
Self-heal Daemon on localhost               N/A       N/A        Y       8673 
NFS Server on gluserver601                  2049      0          Y       26906
Self-heal Daemon on gluserver601            N/A       N/A        Y       26922
Task Status of Volume testssl
There are no active volume tasks

client : 
gluserver602:testssl on /testssl type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

[root@dhcp8-6 ~]# ll /testssl 
ls: cannot access /testssl: Transport endpoint is not connected

Actual results:

client gives transport endpoint not connected. Without SSL everything works fine. 

Expected results:

client should be able to reconnect to the server. 

Additional info:

From the logs it seems it is not reconnecting. 

[2017-02-08 12:14:13.146762] W [socket.c:701:__socket_rwv] 0-testssl-client-0: readv on failed (No data available)
[2017-02-08 12:14:13.146828] E [socket.c:2618:socket_poller] 0-testssl-client-0: error in polling loop
[2017-02-08 12:14:13.146982] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-0: disconnected from testssl-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.905967] W [socket.c:701:__socket_rwv] 0-testssl-client-1: readv on failed (No data available)
[2017-02-08 12:16:14.906057] E [socket.c:2618:socket_poller] 0-testssl-client-1: error in polling loop
[2017-02-08 12:16:14.906233] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-1: disconnected from testssl-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.906258] E [MSGID: 108006] [afr-common.c:4164:afr_notify] 0-testssl-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 12:16:26.318498] I [MSGID: 108006] [afr-common.c:4273:afr_local_init] 0-testssl-replicate-0: no subvolumes up
[2017-02-08 12:16:26.318589] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 527: LOOKUP() / => -1 (Transport endpoint is not connected)

# gluster volume info
Volume Name: testssl
Type: Replicate
Volume ID: c18f59f0-37b2-40ad-be2e-d784456dd1bf
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Brick1: gluserver602:/brick01/b01
Brick2: gluserver601:/brick01/b01
Options Reconfigured:
auth.ssl-allow: *
server.ssl: on
client.ssl: on
performance.readdir-ahead: on

Comment 2 Mohit Agrawal 2017-02-09 05:40:09 UTC

The issue is not reproducible on latest package(glusterfs-3.8.4-14.el6rhs.x86_64.rpm) version.
I have to find out what patch has resolved the above issue.

Mohit Agrawal

Comment 3 Riyas Abdulrasak 2017-02-09 06:40:13 UTC
Hi Mohit, 

I have tested with the glusterfs-3.8.4-14.el6rhs.x86_64.rpm  package and brick re-connection happens with the latest package. 

As discussed , the brick re-connect messages were not appearing in the older version. 

[2017-02-09 06:34:33.644139] I [socket.c:348:ssl_setup_connection] 0-testssl-client-0: peer CN = COMMONNAME
[2017-02-09 06:34:33.644195] I [socket.c:351:ssl_setup_connection] 0-testssl-client-0: SSL verification succeeded (client:
[2017-02-09 06:34:33.644818] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-testssl-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-02-09 06:34:33.645849] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-testssl-client-0: Connected to testssl-client-0, attached to remote volume '/brick01/b01'.
[2017-02-09 06:34:33.645872] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-testssl-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-02-09 06:34:33.646333] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testssl-client-0: Server lk version = 1


Comment 4 Pierre-Yves G. 2017-02-09 10:52:41 UTC
Hi Mohit,

This patch seems to address the issue: https://review.gluster.org/#/c/13554/
"For an encrypted connection, sockect_connect() used to launch socket_poller() in it's own thread (ON by default), even if the connect failed. This would cause two unrefs to be done on the transport, once in socket_poller() and once in socket_connect(), causing the transport to be freed and cleaned up. This would cause further reconnect attempts from failing as the transport wouldn't be available."

Changes to rpc/rpc-transport/socket/src/socket.c file seems very small and safe. Could you build an hotfix?


Comment 5 Mohit Agrawal 2017-02-09 11:21:41 UTC
Hi Pieere,

 Thanks for your analysis,I have not checked yet about the code change why it is not working in 3.1.3?
 I was busy in some other bugzilla.In 3.2 release we have done so many changes specific to socket_poller code , I will share my analysis after check the code.
 Specific to this patch as you have shared it is already merged in the release (3.7.9-12) that one you are using. 

 Below is the output from git log for the branch (3.1.3)


 commit f125bb78b5a2abb41dec011d2f4fd525cb57ec93
 Author: Kaushal M <kaushal@redhat.com>
 Date:   Tue Mar 1 13:04:03 2016 +0530

    socket: Launch socket_poller only if connect succeeded
      Backport of 92abe07 from master

Mohit Agrawal

Comment 12 Atin Mukherjee 2017-02-10 12:32:42 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/85897 is already into rhgs-3.2.0. Moving the status to MODIFIED for now, I will be moving it to ON_QA once all the acks are in place.

Comment 18 Vivek Das 2017-02-26 15:12:55 UTC
Followed the steps to reproduce and i am unable to reproduce the issue reported in the bug.


Comment 21 errata-xmlrpc 2017-03-23 06:04:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.