1378528 – [SSL] glustershd disconnected from glusterd

Bug 1378528 - [SSL] glustershd disconnected from glusterd

Summary: [SSL] glustershd disconnected from glusterd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Mohit Agrawal
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-09-22 16:23 UTC by SATHEESARAN
Modified:	2017-03-23 05:48 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.4-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	RHEV-RHGS-HCI RHEL 7.2
Last Closed:	2017-03-23 05:48:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
glusterd.log (357.91 KB, text/plain) 2016-09-22 16:42 UTC, SATHEESARAN	no flags	Details
glustershd.log (76.58 KB, text/plain) 2016-09-22 16:43 UTC, SATHEESARAN	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1420324	0	unspecified	CLOSED	[GSS] The bricks once disconnected not connects back if SSL is enabled	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1420324

Description SATHEESARAN 2016-09-22 16:23:43 UTC

Description of problem:
-----------------------
The Grafton( HCI ) setup has SSL enabled. And after starting the volume, there is a continuous error message in glusterd.log that glustershd is disconnected from glusterd. Impact of this is - heals are not happening.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2 interim build ( glusterfs-3.8.4-1.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Setup SSL on management and data path
2. Started glusterd
3. Start the sharded-replica 3 volume

Actual results:
---------------
glustershd disconnected from glusterd

Expected results:
-----------------
glustershd should able to connect to glusterd and heals should happen

Comment 2 SATHEESARAN 2016-09-22 16:29:54 UTC

Error messages in glusterd.log

<snip>
[2016-09-22 16:00:47.345717] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:02:23.354480] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:00:47.345717] and [2016-
09-22 16:02:44.356399]
[2016-09-22 16:02:47.356743] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:04:13.854091] E [socket.c:353:ssl_setup_connection] 0-socket.management: SSL connect error (client: )
[2016-09-22 16:04:13.854170] E [socket.c:202:ssl_dump_error_stack] 0-socket.management:   error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
[2016-09-22 16:04:13.854195] E [socket.c:2419:socket_poller] 0-socket.management: server setup failed
[2016-09-22 16:04:29.366244] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:02:47.356743] and [2016-
09-22 16:04:44.367580]
[2016-09-22 16:04:47.367856] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:05:14.426256] I [MSGID: 106499] [glusterd-handler.c:4360:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine
</snip>

Comment 3 SATHEESARAN 2016-09-22 16:31:03 UTC

Error in glustershd.log

<snip>
[2016-09-22 16:09:16.412011] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket --xlator-option *replicate*.node-uuid=aec6ef55-7a72-4a14-b17e-fab690817e8c)
[2016-09-22 16:09:16.420485] I [socket.c:3974:socket_init] 0-socket.glusterfsd: SSL support for glusterd is ENABLED
[2016-09-22 16:09:16.420888] E [socket.c:4052:socket_init] 0-socket.glusterfsd: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled
[2016-09-22 16:09:16.421894] I [socket.c:3974:socket_init] 0-glusterfs: SSL support for glusterd is ENABLED
[2016-09-22 16:09:16.422010] E [socket.c:4052:socket_init] 0-glusterfs: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled
[2016-09-22 16:09:16.430991] E [socket.c:3060:socket_connect] 0-glusterfs: connection attempt on  failed, (Connection refused)
[2016-09-22 16:09:16.431141] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-09-22 16:09:16.431302] W [socket.c:590:__socket_rwv] 0-glusterfs: writev on ::1:24007 failed (Success)
[2016-09-22 16:09:16.431885] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f76666dd2a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f76664a37fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f76664a390e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f76664a5064] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f76664a5940] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2016-09-22 16:09:16.431317 (xid=0x1)
[2016-09-22 16:09:16.431927] E [glusterfsd-mgmt.c:1706:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gluster/glustershd)
[2016-09-22 16:09:16.431976] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(saved_frames_unwind+0x205) [0x7f76664a3825] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x536) [0x7f7666bd8056] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f7666bd1abb] ) 0-: received signum (0), shutting down
</snip>

Comment 4 SATHEESARAN 2016-09-22 16:41:26 UTC

Self-heals pending for long time

<snip>

[root@ ~]# gluster volume heal engine info
Brick 10.70.36.73:/rhgs/engine/enginebrick
/file2 
/.shard 
Status: Connected
Number of entries: 2

Brick 10.70.36.76:/rhgs/engine/enginebrick
Status: Connected
Number of entries: 0

Brick 10.70.36.77:/rhgs/engine/enginebrick
/file2 
/.shard 
Status: Connected
Number of entries: 2

</snip>

Note block size is just 512MB

Comment 5 SATHEESARAN 2016-09-22 16:42:25 UTC

Created attachment 1203846 [details]
glusterd.log

Comment 6 SATHEESARAN 2016-09-22 16:43:02 UTC

Created attachment 1203847 [details]
glustershd.log

Comment 7 Mohit Agrawal 2016-09-26 05:48:29 UTC

Hi,
 
 After analyse the logs it shows it is failing because other end point of socket (unix socket) is not connected so it is throwing "Connection refused".After backport patch from upstream (https://bugzilla.redhat.com/show_bug.cgi?id=1333317) it seems issue is resolved.

 Patch(http://review.gluster.org/#/c/15567/) is upload to merge the same.

Regards
Mohit Agrawal

Comment 12 SATHEESARAN 2016-10-13 10:45:42 UTC

Tested with glusterfs-3.8.4-2.el7rhgs.

After enabling SSL/TLS encryption on the management and data path, self-heal daemon comes up and there are no errors/problems with its communication with glusterd

# gluster volume status vmstore
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.36.74:/rhgs/brick1/vmb1         49153     0          Y       18318
Brick 10.70.36.75:/rhgs/brick1/vmb1         49153     0          Y       17434
Brick 10.70.36.78:/rhgs/brick1/vmb1         49153     0          Y       15767
Self-heal Daemon on localhost               N/A       N/A        Y       12863
Self-heal Daemon on 10.70.36.75             N/A       N/A        Y       14456
Self-heal Daemon on 10.70.36.78             N/A       N/A        Y       15789
 
Task Status of Volume vmstore
------------------------------------------------------------------------------
There are no active volume tasks

Comment 14 errata-xmlrpc 2017-03-23 05:48:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.