Description of problem: ----------------------- The Grafton( HCI ) setup has SSL enabled. And after starting the volume, there is a continuous error message in glusterd.log that glustershd is disconnected from glusterd. Impact of this is - heals are not happening. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHGS 3.2 interim build ( glusterfs-3.8.4-1.el7rhgs ) How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Setup SSL on management and data path 2. Started glusterd 3. Start the sharded-replica 3 volume Actual results: --------------- glustershd disconnected from glusterd Expected results: ----------------- glustershd should able to connect to glusterd and heals should happen
Error messages in glusterd.log <snip> [2016-09-22 16:00:47.345717] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd. [2016-09-22 16:02:23.354480] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument) The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:00:47.345717] and [2016- 09-22 16:02:44.356399] [2016-09-22 16:02:47.356743] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd. [2016-09-22 16:04:13.854091] E [socket.c:353:ssl_setup_connection] 0-socket.management: SSL connect error (client: ) [2016-09-22 16:04:13.854170] E [socket.c:202:ssl_dump_error_stack] 0-socket.management: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number [2016-09-22 16:04:13.854195] E [socket.c:2419:socket_poller] 0-socket.management: server setup failed [2016-09-22 16:04:29.366244] W [socket.c:590:__socket_rwv] 0-glustershd: readv on /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket failed (Invalid argument) The message "I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd." repeated 39 times between [2016-09-22 16:02:47.356743] and [2016- 09-22 16:04:44.367580] [2016-09-22 16:04:47.367856] I [MSGID: 106006] [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: glustershd has disconnected from glusterd. [2016-09-22 16:05:14.426256] I [MSGID: 106499] [glusterd-handler.c:4360:__glusterd_handle_status_volume] 0-management: Received status volume req for volume engine </snip>
Error in glustershd.log <snip> [2016-09-22 16:09:16.412011] I [MSGID: 100030] [glusterfsd.c:2412:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6f50c5e5a1fe717d1cea6d3d9944edd4.socket --xlator-option *replicate*.node-uuid=aec6ef55-7a72-4a14-b17e-fab690817e8c) [2016-09-22 16:09:16.420485] I [socket.c:3974:socket_init] 0-socket.glusterfsd: SSL support for glusterd is ENABLED [2016-09-22 16:09:16.420888] E [socket.c:4052:socket_init] 0-socket.glusterfsd: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled [2016-09-22 16:09:16.421894] I [socket.c:3974:socket_init] 0-glusterfs: SSL support for glusterd is ENABLED [2016-09-22 16:09:16.422010] E [socket.c:4052:socket_init] 0-glusterfs: failed to open /etc/ssl/dhparam.pem, DH ciphers are disabled [2016-09-22 16:09:16.430991] E [socket.c:3060:socket_connect] 0-glusterfs: connection attempt on failed, (Connection refused) [2016-09-22 16:09:16.431141] I [MSGID: 101190] [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2016-09-22 16:09:16.431302] W [socket.c:590:__socket_rwv] 0-glusterfs: writev on ::1:24007 failed (Success) [2016-09-22 16:09:16.431885] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f76666dd2a2] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f76664a37fe] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f76664a390e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f76664a5064] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7f76664a5940] ))))) 0-glusterfs: forced unwinding frame type(GlusterFS Handshake) op(GETSPEC(2)) called at 2016-09-22 16:09:16.431317 (xid=0x1) [2016-09-22 16:09:16.431927] E [glusterfsd-mgmt.c:1706:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file (key:gluster/glustershd) [2016-09-22 16:09:16.431976] W [glusterfsd.c:1288:cleanup_and_exit] (-->/lib64/libgfrpc.so.0(saved_frames_unwind+0x205) [0x7f76664a3825] -->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x536) [0x7f7666bd8056] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7f7666bd1abb] ) 0-: received signum (0), shutting down </snip>
Self-heals pending for long time <snip> [root@ ~]# gluster volume heal engine info Brick 10.70.36.73:/rhgs/engine/enginebrick /file2 /.shard Status: Connected Number of entries: 2 Brick 10.70.36.76:/rhgs/engine/enginebrick Status: Connected Number of entries: 0 Brick 10.70.36.77:/rhgs/engine/enginebrick /file2 /.shard Status: Connected Number of entries: 2 </snip> Note block size is just 512MB
Created attachment 1203846 [details] glusterd.log
Created attachment 1203847 [details] glustershd.log
Hi, After analyse the logs it shows it is failing because other end point of socket (unix socket) is not connected so it is throwing "Connection refused".After backport patch from upstream (https://bugzilla.redhat.com/show_bug.cgi?id=1333317) it seems issue is resolved. Patch(http://review.gluster.org/#/c/15567/) is upload to merge the same. Regards Mohit Agrawal
Tested with glusterfs-3.8.4-2.el7rhgs. After enabling SSL/TLS encryption on the management and data path, self-heal daemon comes up and there are no errors/problems with its communication with glusterd # gluster volume status vmstore Status of volume: vmstore Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.36.74:/rhgs/brick1/vmb1 49153 0 Y 18318 Brick 10.70.36.75:/rhgs/brick1/vmb1 49153 0 Y 17434 Brick 10.70.36.78:/rhgs/brick1/vmb1 49153 0 Y 15767 Self-heal Daemon on localhost N/A N/A Y 12863 Self-heal Daemon on 10.70.36.75 N/A N/A Y 14456 Self-heal Daemon on 10.70.36.78 N/A N/A Y 15789 Task Status of Volume vmstore ------------------------------------------------------------------------------ There are no active volume tasks
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html