1257343 – vol heal info fails when transport.socket.bind-address is set in glusterd

Bug 1257343 - vol heal info fails when transport.socket.bind-address is set in glusterd

Summary: vol heal info fails when transport.socket.bind-address is set in glusterd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.2
Assignee:	Mohamed Ashiq
QA Contact:	RajeshReddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1260783 1277997 1285168 1285962
TreeView+	depends on / blocked

Reported:	2015-08-26 20:48 UTC by Paul Cuzner
Modified:	2016-09-17 12:15 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.7.5-9
Doc Type:	Bug Fix
Doc Text:	Previously, when transport.socket.bind-address was set in glusterd, heal request for the volfile failed. Due to this, executing the heal command, 'gluster volume heal <VOLNAME> info' resulted in the following error: <vol_name>: Not able to fetch volfile from glusterd.Volume heal failed. With this fix, set_volfile-server-transport type is set as "unix" and executing the heal command 'gluster volume heal <VOLNAME> info' does not fail, even when glusterd is bind to a specific IP.
Clone Of:
Clones:	1277997 (view as bug list)
Environment:
Last Closed:	2016-03-01 05:34:37 UTC
Embargoed:
Dependent Products:
Flags:	mliyazud: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:0193	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 update 2	2016-03-01 10:20:36 UTC

Description Paul Cuzner 2015-08-26 20:48:29 UTC

Description of problem:
When you bind glusterd to a specific IP, shd operates correctly but the 'vol heal info' command attempts to contact glusterd on 127.0.0.1 and fails. Admins have no visibility of self heal activity, and any automated monitoring (nagios) etc is also not going to show heal information.

Version-Release number of selected component (if applicable):
glusterfs 3.7.x (and prior releases?)

How reproducible:
Every time! I noticed this in a test environment I'm using for containers, where I bind glusterd host IP.

Steps to Reproduce:
1. use transport.socket.bind-address to bind glusterd to a specific IP on each host
2. run the vol heal <vol> info command


Actual results:
in the self heal log file
> >>> [2015-08-25 03:50:55.412092] I [MSGID: 101190]
> >>> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
> >>> with index 1
> >>> [2015-08-25 03:50:55.424020] E [socket.c:2332:socket_connect_finish]
> >>> 0-gfapi: connection to 127.0.0.1:24007 failed (Connection refused)
> >>> [2015-08-25 03:50:55.424055] E [MSGID: 104024]
> >>> [glfs-mgmt.c:738:mgmt_rpc_notify] 0-glfs-mgmt: failed to connect with
> >>> remote-host: localhost (Transport endpoint is not connected) [Transport
> >>> endpoint is not connected]
> >>> [2015-08-25 03:50:55.424071] I [MSGID: 104025]
> >>> [glfs-mgmt.c:744:mgmt_rpc_notify] 0-glfs-mgmt: Exhausted all volfile
> >>> servers [Transport endpoint is not connected]

Expected results:
vol heal info should work whether glusterd is bound to a specific IP or not

Additional info:
The issue has been discussed with Ravi (ravishankar) and Humble (hchiramm) and identified as localhost being hardcoded in glfs_set_volfile_server() within glfs-heal.c

e.g. 
        ret = glfs_set_volfile_server (fs, "tcp", "localhost", 24007);
        if (ret) {
                printf("Setting the volfile server failed, %s\n", strerror (errno));
                goto out;
        }


Document URL: 

Section Number and Name: 

Describe the issue: 

Suggestions for improvement: 

Additional information:

Comment 2 Humble Chirammal 2015-10-07 11:07:13 UTC

afaict, the hard coded part ( -s localhost) in glusterd clients like quota,rebalance ..etc can be changed to use 'bind-address'.

For ex: in quota client deamon , currently we have:

        runinit (&runner);
        runner_add_args (&runner, SBIN_DIR"/glusterfs",
                         "-s", "localhost",
                         "--volfile-id", volname,
			 "--use-readdirp=no",
                         "--client-pid", QUOTA_CRAWL_PID,
                         "-l", logfile, mountdir, NULL);
[]$ 



iic, we have the bind address in dict (THIS->options) as 'transport.socket.bind-address"' . We can use this address instead of 'localhost' and pass it to the runner.

something like 

vol_server ="localhost";
vol_server  = data_to_str(dict_get(THIS->options, "transport.socket.bind-address"))

         runner_add_args (&runner, SBIN_DIR"/glusterfs",
-                         "-s", "localhost",
+                         "-s", vol_server,

Comment 3 Humble Chirammal 2015-10-12 09:00:25 UTC

The bz comment#2 give a workaround for glusterd clients like quota,rebalance..etc. However, iic, we need a different solution for libgfapi based clients like glfsheal

Comment 4 Humble Chirammal 2015-11-02 05:47:24 UTC

(In reply to Humble Chirammal from comment #3)
> The bz comment#2 give a workaround for glusterd clients like
> quota,rebalance.

It should have been 'replace-brick' instead of 'quota,rebalace'. As mentioned earlier, 'heal/quota' ..etc need a solution where cli  make a rpc call and get the ip and use it for client connection.

Comment 5 Mohamed Ashiq 2015-11-30 10:53:29 UTC

patch:

https://code.engineering.redhat.com/gerrit/62580

Comment 7 RajeshReddy 2015-12-15 10:33:05 UTC

Tested with glusterfs-server-3.7.5-11 and after having the transport.socket.bind-address = <IP address> entry in the glusterd.info able to launch heal and slef heal is working fine so marking this bug as verified

Comment 8 Bhavana 2016-02-04 07:01:57 UTC

Hi Ashiq,

I have updated the doc-text info. Please sign-off on the same if it looks ok.

Comment 9 Mohamed Ashiq 2016-02-04 09:40:21 UTC

Looks good to me.

Comment 11 errata-xmlrpc 2016-03-01 05:34:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html

Note You need to log in before you can comment on or make changes to this bug.