Description of problem: When you bind glusterd to a specific IP, shd operates correctly but the 'vol heal info' command attempts to contact glusterd on 127.0.0.1 and fails. Admins have no visibility of self heal activity, and any automated monitoring (nagios) etc is also not going to show heal information. Version-Release number of selected component (if applicable): glusterfs 3.7.x (and prior releases?) How reproducible: Every time! I noticed this in a test environment I'm using for containers, where I bind glusterd host IP. Steps to Reproduce: 1. use transport.socket.bind-address to bind glusterd to a specific IP on each host 2. run the vol heal <vol> info command Actual results: in the self heal log file > >>> [2015-08-25 03:50:55.412092] I [MSGID: 101190] > >>> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread > >>> with index 1 > >>> [2015-08-25 03:50:55.424020] E [socket.c:2332:socket_connect_finish] > >>> 0-gfapi: connection to 127.0.0.1:24007 failed (Connection refused) > >>> [2015-08-25 03:50:55.424055] E [MSGID: 104024] > >>> [glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with > >>> remote-host: localhost (Transport endpoint is not connected) [Transport > >>> endpoint is not connected] > >>> [2015-08-25 03:50:55.424071] I [MSGID: 104025] > >>> [glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile > >>> servers [Transport endpoint is not connected] Expected results: vol heal info should work whether glusterd is bound to a specific IP or not Additional info: The issue has been discussed with Ravi (ravishankar) and Humble (hchiramm) and identified as localhost being hardcoded in glfs_set_volfile_server() within glfs-heal.c e.g. ret = glfs_set_volfile_server (fs, "tcp", "localhost", 24007); if (ret) { printf("Setting the volfile server failed, %s\n", strerror (errno)); goto out; } Document URL: Section Number and Name: Describe the issue: Suggestions for improvement: Additional information:
afaict, the hard coded part ( -s localhost) in glusterd clients like quota,rebalance ..etc can be changed to use 'bind-address'. For ex: in quota client deamon , currently we have: runinit (&runner); runner_add_args (&runner, SBIN_DIR"/glusterfs", "-s", "localhost", "--volfile-id", volname, "--use-readdirp=no", "--client-pid", QUOTA_CRAWL_PID, "-l", logfile, mountdir, NULL); []$ iic, we have the bind address in dict (THIS->options) as 'transport.socket.bind-address"' . We can use this address instead of 'localhost' and pass it to the runner. something like vol_server ="localhost"; vol_server = data_to_str(dict_get(THIS->options, "transport.socket.bind-address")) runner_add_args (&runner, SBIN_DIR"/glusterfs", - "-s", "localhost", + "-s", vol_server,
The bz comment#2 give a workaround for glusterd clients like quota,rebalance..etc. However, iic, we need a different solution for libgfapi based clients like glfsheal
(In reply to Humble Chirammal from comment #3) > The bz comment#2 give a workaround for glusterd clients like > quota,rebalance. It should have been 'replace-brick' instead of 'quota,rebalace'. As mentioned earlier, 'heal/quota' ..etc need a solution where cli make a rpc call and get the ip and use it for client connection.
patch: https://code.engineering.redhat.com/gerrit/62580
Tested with glusterfs-server-3.7.5-11 and after having the transport.socket.bind-address = <IP address> entry in the glusterd.info able to launch heal and slef heal is working fine so marking this bug as verified
Hi Ashiq, I have updated the doc-text info. Please sign-off on the same if it looks ok.
Looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0193.html