Bug 1378528 - [SSL] glustershd disconnected from glusterd
Summary: [SSL] glustershd disconnected from glusterd
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: replicate
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Mohit Agrawal
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-09-22 16:23 UTC by SATHEESARAN
Modified: 2017-03-23 05:48 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.8.4-2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
RHEV-RHGS-HCI RHEL 7.2
Last Closed: 2017-03-23 05:48:36 UTC
Embargoed:


Attachments (Terms of Use)
glusterd.log (357.91 KB, text/plain)
2016-09-22 16:42 UTC, SATHEESARAN
no flags Details
glustershd.log (76.58 KB, text/plain)
2016-09-22 16:43 UTC, SATHEESARAN
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1420324 0 unspecified CLOSED [GSS] The bricks once disconnected not connects back if SSL is enabled 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Internal Links: 1420324

Description SATHEESARAN 2016-09-22 16:23:43 UTC
Description of problem:
-----------------------
The Grafton( HCI ) setup has SSL enabled. And after starting the volume, there is a continuous error message in glusterd.log that glustershd is disconnected from glusterd. Impact of this is - heals are not happening.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2 interim build ( glusterfs-3.8.4-1.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Setup SSL on management and data path
2. Started glusterd
3. Start the sharded-replica 3 volume

Actual results:
---------------
glustershd disconnected from glusterd

Expected results:
-----------------
glustershd should able to connect to glusterd and heals should happen

Comment 2 SATHEESARAN 2016-09-22 16:29:54 UTC
Error messages in glusterd.log

<snip>
[2016-09-22 16:00:47.345717] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:02:23.354480] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:00:47.345717] and [2016-
09-22 16:02:44.356399]
[2016-09-22 16:02:47.356743] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:04:13.854091] E [socket.c:353:ssl_setup_connection] 0-socket.management: SSL connect error (client: )
[2016-09-22 16:04:13.854170] E [socket.c:202:ssl_dump_error_stack] 0-socket.management:   error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
[2016-09-22 16:04:13.854195] E [socket.c:2419:socket_poller] 0-socket.management: server setup failed
[2016-09-22 16:04:29.366244] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:02:47.356743] and [2016-
09-22 16:04:44.367580]
[2016-09-22 16:04:47.367856] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:05:14.426256] I [MSGID: 106499] [glusterd-handler.c:4360:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine
</snip>

Comment 3 SATHEESARAN 2016-09-22 16:31:03 UTC
Error in glustershd.log

<snip>
[2016-09-22 16:09:16.412011] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket --xlator-option *replicate*.node-uuid=aec6ef55-7a72-4a14-b17e-fab690817e8c)
[2016-09-22 16:09:16.420485] I [socket.c:3974:socket_init] 0-socket.glusterfsd: SSL support for glusterd is ENABLED
[2016-09-22 16:09:16.420888] E [socket.c:4052:socket_init] 0-socket.glusterfsd: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled
[2016-09-22 16:09:16.421894] I [socket.c:3974:socket_init] 0-glusterfs: SSL support for glusterd is ENABLED
[2016-09-22 16:09:16.422010] E [socket.c:4052:socket_init] 0-glusterfs: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled
[2016-09-22 16:09:16.430991] E [socket.c:3060:socket_connect] 0-glusterfs: connection attempt on  failed, (Connection refused)
[2016-09-22 16:09:16.431141] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-09-22 16:09:16.431302] W [socket.c:590:__socket_rwv] 0-glusterfs: writev on ::1:24007 failed (Success)
[2016-09-22 16:09:16.431885] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f76666dd2a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f76664a37fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f76664a390e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f76664a5064] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f76664a5940] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2016-09-22 16:09:16.431317 (xid=0x1)
[2016-09-22 16:09:16.431927] E [glusterfsd-mgmt.c:1706:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gluster/glustershd)
[2016-09-22 16:09:16.431976] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(saved_frames_unwind+0x205) [0x7f76664a3825] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x536) [0x7f7666bd8056] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f7666bd1abb] ) 0-: received signum (0), shutting down
</snip>

Comment 4 SATHEESARAN 2016-09-22 16:41:26 UTC
Self-heals pending for long time

<snip>

[root@ ~]# gluster volume heal engine info
Brick 10.70.36.73:/rhgs/engine/enginebrick
/file2 
/.shard 
Status: Connected
Number of entries: 2

Brick 10.70.36.76:/rhgs/engine/enginebrick
Status: Connected
Number of entries: 0

Brick 10.70.36.77:/rhgs/engine/enginebrick
/file2 
/.shard 
Status: Connected
Number of entries: 2

</snip>

Note block size is just 512MB

Comment 5 SATHEESARAN 2016-09-22 16:42:25 UTC
Created attachment 1203846 [details]
glusterd.log

Comment 6 SATHEESARAN 2016-09-22 16:43:02 UTC
Created attachment 1203847 [details]
glustershd.log

Comment 7 Mohit Agrawal 2016-09-26 05:48:29 UTC
Hi,
 
 After analyse the logs it shows it is failing because other end point of socket (unix socket) is not connected so it is throwing "Connection refused".After backport patch from upstream (https://bugzilla.redhat.com/show_bug.cgi?id=1333317) it seems issue is resolved.

 Patch(http://review.gluster.org/#/c/15567/) is upload to merge the same.

Regards
Mohit Agrawal

Comment 12 SATHEESARAN 2016-10-13 10:45:42 UTC
Tested with glusterfs-3.8.4-2.el7rhgs.

After enabling SSL/TLS encryption on the management and data path, self-heal daemon comes up and there are no errors/problems with its communication with glusterd

# gluster volume status vmstore
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.36.74:/rhgs/brick1/vmb1         49153     0          Y       18318
Brick 10.70.36.75:/rhgs/brick1/vmb1         49153     0          Y       17434
Brick 10.70.36.78:/rhgs/brick1/vmb1         49153     0          Y       15767
Self-heal Daemon on localhost               N/A       N/A        Y       12863
Self-heal Daemon on 10.70.36.75             N/A       N/A        Y       14456
Self-heal Daemon on 10.70.36.78             N/A       N/A        Y       15789
 
Task Status of Volume vmstore
------------------------------------------------------------------------------
There are no active volume tasks

Comment 14 errata-xmlrpc 2017-03-23 05:48:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.