Bug 1378528

Summary: [SSL] glustershd disconnected from glusterd
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: replicateAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, asrivast, moagrawa, rcyriac, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
RHEV-RHGS-HCI RHEL 7.2
Last Closed: 2017-03-23 05:48:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528    
Attachments:
Description Flags
glusterd.log
none
glustershd.log none

Description SATHEESARAN 2016-09-22 16:23:43 UTC
Description of problem:
-----------------------
The Grafton( HCI ) setup has SSL enabled. And after starting the volume, there is a continuous error message in glusterd.log that glustershd is disconnected from glusterd. Impact of this is - heals are not happening.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.2 interim build ( glusterfs-3.8.4-1.el7rhgs )

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Setup SSL on management and data path
2. Started glusterd
3. Start the sharded-replica 3 volume

Actual results:
---------------
glustershd disconnected from glusterd

Expected results:
-----------------
glustershd should able to connect to glusterd and heals should happen

Comment 2 SATHEESARAN 2016-09-22 16:29:54 UTC
Error messages in glusterd.log

<snip>
[2016-09-22 16:00:47.345717] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:02:23.354480] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:00:47.345717] and [2016-
09-22 16:02:44.356399]
[2016-09-22 16:02:47.356743] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:04:13.854091] E [socket.c:353:ssl_setup_connection] 0-socket.management: SSL connect error (client: )
[2016-09-22 16:04:13.854170] E [socket.c:202:ssl_dump_error_stack] 0-socket.management:   error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
[2016-09-22 16:04:13.854195] E [socket.c:2419:socket_poller] 0-socket.management: server setup failed
[2016-09-22 16:04:29.366244] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument)
The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:02:47.356743] and [2016-
09-22 16:04:44.367580]
[2016-09-22 16:04:47.367856] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd.
[2016-09-22 16:05:14.426256] I [MSGID: 106499] [glusterd-handler.c:4360:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine
</snip>

Comment 3 SATHEESARAN 2016-09-22 16:31:03 UTC
Error in glustershd.log

<snip>
[2016-09-22 16:09:16.412011] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket --xlator-option *replicate*.node-uuid=aec6ef55-7a72-4a14-b17e-fab690817e8c)
[2016-09-22 16:09:16.420485] I [socket.c:3974:socket_init] 0-socket.glusterfsd: SSL support for glusterd is ENABLED
[2016-09-22 16:09:16.420888] E [socket.c:4052:socket_init] 0-socket.glusterfsd: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled
[2016-09-22 16:09:16.421894] I [socket.c:3974:socket_init] 0-glusterfs: SSL support for glusterd is ENABLED
[2016-09-22 16:09:16.422010] E [socket.c:4052:socket_init] 0-glusterfs: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled
[2016-09-22 16:09:16.430991] E [socket.c:3060:socket_connect] 0-glusterfs: connection attempt on  failed, (Connection refused)
[2016-09-22 16:09:16.431141] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2016-09-22 16:09:16.431302] W [socket.c:590:__socket_rwv] 0-glusterfs: writev on ::1:24007 failed (Success)
[2016-09-22 16:09:16.431885] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f76666dd2a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f76664a37fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f76664a390e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f76664a5064] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f76664a5940] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2016-09-22 16:09:16.431317 (xid=0x1)
[2016-09-22 16:09:16.431927] E [glusterfsd-mgmt.c:1706:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gluster/glustershd)
[2016-09-22 16:09:16.431976] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(saved_frames_unwind+0x205) [0x7f76664a3825] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x536) [0x7f7666bd8056] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f7666bd1abb] ) 0-: received signum (0), shutting down
</snip>

Comment 4 SATHEESARAN 2016-09-22 16:41:26 UTC
Self-heals pending for long time

<snip>

[root@ ~]# gluster volume heal engine info
Brick 10.70.36.73:/rhgs/engine/enginebrick
/file2 
/.shard 
Status: Connected
Number of entries: 2

Brick 10.70.36.76:/rhgs/engine/enginebrick
Status: Connected
Number of entries: 0

Brick 10.70.36.77:/rhgs/engine/enginebrick
/file2 
/.shard 
Status: Connected
Number of entries: 2

</snip>

Note block size is just 512MB

Comment 5 SATHEESARAN 2016-09-22 16:42:25 UTC
Created attachment 1203846 [details]
glusterd.log

Comment 6 SATHEESARAN 2016-09-22 16:43:02 UTC
Created attachment 1203847 [details]
glustershd.log

Comment 7 Mohit Agrawal 2016-09-26 05:48:29 UTC
Hi,
 
 After analyse the logs it shows it is failing because other end point of socket (unix socket) is not connected so it is throwing "Connection refused".After backport patch from upstream (https://bugzilla.redhat.com/show_bug.cgi?id=1333317) it seems issue is resolved.

 Patch(http://review.gluster.org/#/c/15567/) is upload to merge the same.

Regards
Mohit Agrawal

Comment 12 SATHEESARAN 2016-10-13 10:45:42 UTC
Tested with glusterfs-3.8.4-2.el7rhgs.

After enabling SSL/TLS encryption on the management and data path, self-heal daemon comes up and there are no errors/problems with its communication with glusterd

# gluster volume status vmstore
Status of volume: vmstore
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.36.74:/rhgs/brick1/vmb1         49153     0          Y       18318
Brick 10.70.36.75:/rhgs/brick1/vmb1         49153     0          Y       17434
Brick 10.70.36.78:/rhgs/brick1/vmb1         49153     0          Y       15767
Self-heal Daemon on localhost               N/A       N/A        Y       12863
Self-heal Daemon on 10.70.36.75             N/A       N/A        Y       14456
Self-heal Daemon on 10.70.36.78             N/A       N/A        Y       15789
 
Task Status of Volume vmstore
------------------------------------------------------------------------------
There are no active volume tasks

Comment 14 errata-xmlrpc 2017-03-23 05:48:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html