981185 – [RFE] : AFR : Resolve inconsistency in the heal info commands outputs when self-heal daemon process is offline

Bug 981185 - [RFE] : AFR : Resolve inconsistency in the heal info commands outputs when self-heal daemon process is offline

Summary: [RFE] : AFR : Resolve inconsistency in the heal info commands outputs when se...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	spandura
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-04 08:34 UTC by spandura
Modified:	2016-09-17 12:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-10 08:42:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description spandura 2013-07-04 08:34:54 UTC

Description of problem:
=======================
consider 1 x 2 replicate volume ( node1 and node2 )

###############################################################################
Case1 : self-heal daemon process offline on node2
=================================================
Command : gluster v heal <volume_name> on node1 and node2 .

The output of the above command when executed on node1 and node2 are not providing same information. 

Output on node2 :
~~~~~~~~~~~~~~~~~
root@king [Jul-03-2013-16:01:59] >gluster v heal <volume_name>
Self-heal daemon is not running. Check self-heal daemon log file.

Output on node1 : 
~~~~~~~~~~~~~~~~~
root@luigi [Jul-03-2013-16:02:18] >gluster v heal <volume_name> 
Staging failed on 10.70.34.119. Please check the log file for more details.

Node2 reports Self-heal daemon is not running . But doesn't report on which machine self-heal daemon is not running. 

Node1 reports completely different output "Staging failed on node2. Please check the log file". This output is not informative to user. a) what does staging failed mean? b) which log file to refer? 

For the same heal command executed on the different nodes which are part of the cluster, we get 2 different outputs. The output can be more informative and same across all the nodes
###############################################################################

Case2 : a) glusterd and glustershd process offline on node2 b) glusterd is offline , glustershd is online on node2
=========================================================
Command : volume heal <VOLNAME> info {healed | heal-failed | split-brain}

In case of a) or b) , executing heal info commands doesn't report any information about the offline status of the node2. We are unable to fetch self-heal information because glusterd is not available. Hence we have can be more appropriate about unable to get the self-heal information than just reporting the following message about the offline node:  

"Brick hicks:/rhs/brick1/brick1
Number of entries: 0"

Output on node1:
~~~~~~~~~~~~~~~~
root@king [Jul-03-2013-15:40:09] >gluster v heal `gluster v list` info  
Gathering Heal info on volume vol_rep has been successful

Brick king:/rhs/brick1/brick0
Number of entries: 11
/
/dir.1
/dir.2
/dir.3
/dir.4
/dir.5
/dir.6
/dir.7
/dir.8
/dir.9
/dir.10

Brick hicks:/rhs/brick1/brick1
Number of entries: 0

root@king [Jul-03-2013-15:40:30] >gluster v heal `gluster v list` info  healed
Gathering Heal info on volume vol_rep has been successful

Brick king:/rhs/brick1/brick0
Number of entries: 0

Brick hicks:/rhs/brick1/brick1
Number of entries: 0

root@king [Jul-03-2013-15:41:07] >gluster v heal `gluster v list` info  heal-failed
Gathering Heal info on volume vol_rep has been successful

Brick king:/rhs/brick1/brick0
Number of entries: 0

Brick hicks:/rhs/brick1/brick1
Number of entries: 0

root@king [Jul-03-2013-15:41:11] >gluster v heal `gluster v list` info  split-brain
Gathering Heal info on volume vol_rep has been successful

Brick king:/rhs/brick1/brick0
Number of entries: 0

Brick hicks:/rhs/brick1/brick1
Number of entries: 0
###############################################################################

Case3 : When self-heal daemon process is offline, execution of "heal info" command is successful but execution of "heal info <healed|heal-failed|split-brain>" fails with "Staging failed on <node>. Please check the log file for more details". 

How did heal info command gathered information when  glustershd is offline? why in this case command execution didn't fail with "staging failed" ? 

Also the "Staging failed" output itself can be improved as explained in "CASE1"

Output on node1 when glusterd, glustershd was offline on node2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@luigi [Jul-04-2013-13:15:38] >gluster v heal vol_rep info
Gathering Heal info on volume vol_rep has been successful

Brick luigi:/rhs/brick1/brick0
Number of entries: 11
/
/file.11
/file.12
/file.13
/file.14
/file.15
/file.16
/file.17
/file.18
/file.19
/file.20

Brick lizzie:/rhs/brick1/brick1
Number of entries: 0

Output on node1 when glusterd was online and glustershd process was offline on node2 :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
root@luigi [Jul-04-2013-13:17:43] >gluster v heal vol_rep info
Gathering Heal info on volume vol_rep has been successful

Brick luigi:/rhs/brick1/brick0
Number of entries: 0

Brick lizzie:/rhs/brick1/brick1
Number of entries: 0

root@luigi [Jul-04-2013-13:17:46] >gluster v heal vol_rep info healed
Staging failed on lizzie. Please check the log file for more details.

root@luigi [Jul-04-2013-13:18:01] >gluster v heal vol_rep info heal-failed
Staging failed on lizzie. Please check the log file for more details.

root@luigi [Jul-04-2013-13:18:05] >gluster v heal vol_rep info split-brain
Staging failed on lizzie. Please check the log file for more details.


Version-Release number of selected component (if applicable):
=============================================================
root@king [Jul-04-2013-13:58:28] >rpm -qa | grep glusterfs-server
glusterfs-server-3.4.0.12rhs.beta1-1.el6rhs.x86_64

root@king [Jul-04-2013-13:58:34] >gluster --version
glusterfs 3.4.0.12rhs.beta1 built on Jun 28 2013 06:41:38
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

Comment 3 Ravishankar N 2015-08-27 11:20:04 UTC

The current implementation addresses all concerns raised in the BZ:
---------------------------------------------------------------------
Case1: For this case, the behaviour is consistent and displays what log to check:
[root@ravi2 glusterfs]# gluster v heal testvol
Self-heal daemon is not running. Check self-heal daemon log file.

[root@ravi1 ~]# gluster v heal testvol 
Staging failed on 10.70.42.252. Error: Self-heal daemon is not running. Check self-heal daemon log file.

Note: 'Staging failed' cannot be removed because of the way glusterd transaction works.
-----------------------------------------------------------------------

Case2: If brick/glusterd of a node is down, heal-info gives ENOTCONN now.
[root@ravi1 ~]# gluster v  heal testvol info
Brick ravi1:/brick/brick1/
Number of entries: 0

Brick 10.70.42.252:/brick/brick1
Status: Transport endpoint is not connected

--------------------------------------------------------------------------

Case3: 'healed|heal-failed' have been deprecated.
'info' and 'info split-brain' have been re-implemented using glfsheal binary and will show the correct output even if self-heal daemon is offline.
------------------------------------------------------------------

Note You need to log in before you can comment on or make changes to this bug.