Bug 1420324

Summary:	[GSS] The bricks once disconnected not connects back if SSL is enabled
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Riyas Abdulrasak <rnalakka>
Component:	core	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED ERRATA	QA Contact:	Vivek Das <vdas>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, bkunal, moagrawa, nbalacha, pierre-yves.goubet, rcyriac, rhs-bugs, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	All
OS:	All
Whiteboard:	ssl
Fixed In Version:	glusterfs-3.8.4-2	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-23 06:04:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351528

Description Riyas Abdulrasak 2017-02-08 13:04:05 UTC

Description of problem:

On an SSL enabled volume the gluster.fuse client is not connecting back to a brick that is once disconnected. 

Version-Release number of selected component (if applicable):

glusterfs-3.7.9-12.el6rhs.x86_64

How reproducible:

Always 

Steps to Reproduce:

1.Install RHGS 3.1.3
2. Create a 1x2 replicated volume 
3. Enable SSL 
4. Mount the volume using gluster.fuse client. Keeping the mount active
5. Kill one brick-1 , start it
6. Kill the second brick and start it 
7 . Now try to access the mount from the client, it gives transport endpoint not connected though both the bricks are online. 

Eg:- 
]# gluster v status
Status of volume: testssl
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluserver602:/brick01/b01             49152     0          Y       5669 
Brick gluserver601:/brick01/b01             49152     0          Y       26885
NFS Server on localhost                     2049      0          Y       8659 
Self-heal Daemon on localhost               N/A       N/A        Y       8673 
NFS Server on gluserver601                  2049      0          Y       26906
Self-heal Daemon on gluserver601            N/A       N/A        Y       26922
 
Task Status of Volume testssl
------------------------------------------------------------------------------
There are no active volume tasks

client : 
-----------
gluserver602:testssl on /testssl type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


[root@dhcp8-6 ~]# ll /testssl 
ls: cannot access /testssl: Transport endpoint is not connected




Actual results:

client gives transport endpoint not connected. Without SSL everything works fine. 

Expected results:

client should be able to reconnect to the server. 

Additional info:

From the logs it seems it is not reconnecting. 

[2017-02-08 12:14:13.146762] W [socket.c:701:__socket_rwv] 0-testssl-client-0: readv on 10.65.6.27:49152 failed (No data available)
[2017-02-08 12:14:13.146828] E [socket.c:2618:socket_poller] 0-testssl-client-0: error in polling loop
[2017-02-08 12:14:13.146982] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-0: disconnected from testssl-client-0. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.905967] W [socket.c:701:__socket_rwv] 0-testssl-client-1: readv on 10.65.6.26:49152 failed (No data available)
[2017-02-08 12:16:14.906057] E [socket.c:2618:socket_poller] 0-testssl-client-1: error in polling loop
[2017-02-08 12:16:14.906233] I [MSGID: 114018] [client.c:2037:client_rpc_notify] 0-testssl-client-1: disconnected from testssl-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2017-02-08 12:16:14.906258] E [MSGID: 108006] [afr-common.c:4164:afr_notify] 0-testssl-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 12:16:26.318498] I [MSGID: 108006] [afr-common.c:4273:afr_local_init] 0-testssl-replicate-0: no subvolumes up
[2017-02-08 12:16:26.318589] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 527: LOOKUP() / => -1 (Transport endpoint is not connected)




# gluster volume info
 
Volume Name: testssl
Type: Replicate
Volume ID: c18f59f0-37b2-40ad-be2e-d784456dd1bf
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluserver602:/brick01/b01
Brick2: gluserver601:/brick01/b01
Options Reconfigured:
auth.ssl-allow: *
server.ssl: on
client.ssl: on
performance.readdir-ahead: on

Comment 2 Mohit Agrawal 2017-02-09 05:40:09 UTC

Hi,

The issue is not reproducible on latest package(glusterfs-3.8.4-14.el6rhs.x86_64.rpm) version.
I have to find out what patch has resolved the above issue.


Regards
Mohit Agrawal

Comment 3 Riyas Abdulrasak 2017-02-09 06:40:13 UTC

Hi Mohit, 

I have tested with the glusterfs-3.8.4-14.el6rhs.x86_64.rpm  package and brick re-connection happens with the latest package. 

As discussed , the brick re-connect messages were not appearing in the older version. 

[2017-02-09 06:34:33.644139] I [socket.c:348:ssl_setup_connection] 0-testssl-client-0: peer CN = COMMONNAME
[2017-02-09 06:34:33.644195] I [socket.c:351:ssl_setup_connection] 0-testssl-client-0: SSL verification succeeded (client: 10.65.6.27:24007)
[2017-02-09 06:34:33.644818] I [MSGID: 114057] [client-handshake.c:1446:select_server_supported_programs] 0-testssl-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2017-02-09 06:34:33.645849] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-testssl-client-0: Connected to testssl-client-0, attached to remote volume '/brick01/b01'.
[2017-02-09 06:34:33.645872] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-testssl-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2017-02-09 06:34:33.646333] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testssl-client-0: Server lk version = 1

Regards
Riyas

Comment 4 Pierre-Yves G. 2017-02-09 10:52:41 UTC

Hi Mohit,

This patch seems to address the issue: https://review.gluster.org/#/c/13554/
"For an encrypted connection, sockect_connect() used to launch socket_poller() in it's own thread (ON by default), even if the connect failed. This would cause two unrefs to be done on the transport, once in socket_poller() and once in socket_connect(), causing the transport to be freed and cleaned up. This would cause further reconnect attempts from failing as the transport wouldn't be available."

Changes to rpc/rpc-transport/socket/src/socket.c file seems very small and safe. Could you build an hotfix?

Regards,
Pierre-Yves

Comment 5 Mohit Agrawal 2017-02-09 11:21:41 UTC

Hi Pieere,


 Thanks for your analysis,I have not checked yet about the code change why it is not working in 3.1.3?
 I was busy in some other bugzilla.In 3.2 release we have done so many changes specific to socket_poller code , I will share my analysis after check the code.
 
 Specific to this patch as you have shared it is already merged in the release (3.7.9-12) that one you are using. 

 Below is the output from git log for the branch (3.1.3)

 >>>>>>>>>>>>>>>>

 commit f125bb78b5a2abb41dec011d2f4fd525cb57ec93
 Author: Kaushal M <kaushal>
 Date:   Tue Mar 1 13:04:03 2016 +0530

    socket: Launch socket_poller only if connect succeeded
    
      Backport of 92abe07 from master
>>>>>>>>>>>>>>>>>

Regards
Mohit Agrawal

Comment 12 Atin Mukherjee 2017-02-10 12:32:42 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/85897 is already into rhgs-3.2.0. Moving the status to MODIFIED for now, I will be moving it to ON_QA once all the acks are in place.

Comment 18 Vivek Das 2017-02-26 15:12:55 UTC

Followed the steps to reproduce and i am unable to reproduce the issue reported in the bug.

Version
-------
glusterfs-3.8.4-15

Comment 21 errata-xmlrpc 2017-03-23 06:04:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html