Description of problem: ======================= consider 1 x 2 replicate volume ( node1 and node2 ) ############################################################################### Case1 : self-heal daemon process offline on node2 ================================================= Command : gluster v heal <volume_name> on node1 and node2 . The output of the above command when executed on node1 and node2 are not providing same information. Output on node2 : ~~~~~~~~~~~~~~~~~ root@king [Jul-03-2013-16:01:59] >gluster v heal <volume_name> Self-heal daemon is not running. Check self-heal daemon log file. Output on node1 : ~~~~~~~~~~~~~~~~~ root@luigi [Jul-03-2013-16:02:18] >gluster v heal <volume_name> Staging failed on 10.70.34.119. Please check the log file for more details. Node2 reports Self-heal daemon is not running . But doesn't report on which machine self-heal daemon is not running. Node1 reports completely different output "Staging failed on node2. Please check the log file". This output is not informative to user. a) what does staging failed mean? b) which log file to refer? For the same heal command executed on the different nodes which are part of the cluster, we get 2 different outputs. The output can be more informative and same across all the nodes ############################################################################### Case2 : a) glusterd and glustershd process offline on node2 b) glusterd is offline , glustershd is online on node2 ========================================================= Command : volume heal <VOLNAME> info {healed | heal-failed | split-brain} In case of a) or b) , executing heal info commands doesn't report any information about the offline status of the node2. We are unable to fetch self-heal information because glusterd is not available. Hence we have can be more appropriate about unable to get the self-heal information than just reporting the following message about the offline node: "Brick hicks:/rhs/brick1/brick1 Number of entries: 0" Output on node1: ~~~~~~~~~~~~~~~~ root@king [Jul-03-2013-15:40:09] >gluster v heal `gluster v list` info Gathering Heal info on volume vol_rep has been successful Brick king:/rhs/brick1/brick0 Number of entries: 11 / /dir.1 /dir.2 /dir.3 /dir.4 /dir.5 /dir.6 /dir.7 /dir.8 /dir.9 /dir.10 Brick hicks:/rhs/brick1/brick1 Number of entries: 0 root@king [Jul-03-2013-15:40:30] >gluster v heal `gluster v list` info healed Gathering Heal info on volume vol_rep has been successful Brick king:/rhs/brick1/brick0 Number of entries: 0 Brick hicks:/rhs/brick1/brick1 Number of entries: 0 root@king [Jul-03-2013-15:41:07] >gluster v heal `gluster v list` info heal-failed Gathering Heal info on volume vol_rep has been successful Brick king:/rhs/brick1/brick0 Number of entries: 0 Brick hicks:/rhs/brick1/brick1 Number of entries: 0 root@king [Jul-03-2013-15:41:11] >gluster v heal `gluster v list` info split-brain Gathering Heal info on volume vol_rep has been successful Brick king:/rhs/brick1/brick0 Number of entries: 0 Brick hicks:/rhs/brick1/brick1 Number of entries: 0 ############################################################################### Case3 : When self-heal daemon process is offline, execution of "heal info" command is successful but execution of "heal info <healed|heal-failed|split-brain>" fails with "Staging failed on <node>. Please check the log file for more details". How did heal info command gathered information when glustershd is offline? why in this case command execution didn't fail with "staging failed" ? Also the "Staging failed" output itself can be improved as explained in "CASE1" Output on node1 when glusterd, glustershd was offline on node2: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@luigi [Jul-04-2013-13:15:38] >gluster v heal vol_rep info Gathering Heal info on volume vol_rep has been successful Brick luigi:/rhs/brick1/brick0 Number of entries: 11 / /file.11 /file.12 /file.13 /file.14 /file.15 /file.16 /file.17 /file.18 /file.19 /file.20 Brick lizzie:/rhs/brick1/brick1 Number of entries: 0 Output on node1 when glusterd was online and glustershd process was offline on node2 : ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@luigi [Jul-04-2013-13:17:43] >gluster v heal vol_rep info Gathering Heal info on volume vol_rep has been successful Brick luigi:/rhs/brick1/brick0 Number of entries: 0 Brick lizzie:/rhs/brick1/brick1 Number of entries: 0 root@luigi [Jul-04-2013-13:17:46] >gluster v heal vol_rep info healed Staging failed on lizzie. Please check the log file for more details. root@luigi [Jul-04-2013-13:18:01] >gluster v heal vol_rep info heal-failed Staging failed on lizzie. Please check the log file for more details. root@luigi [Jul-04-2013-13:18:05] >gluster v heal vol_rep info split-brain Staging failed on lizzie. Please check the log file for more details. Version-Release number of selected component (if applicable): ============================================================= root@king [Jul-04-2013-13:58:28] >rpm -qa | grep glusterfs-server glusterfs-server-3.4.0.12rhs.beta1-1.el6rhs.x86_64 root@king [Jul-04-2013-13:58:34] >gluster --version glusterfs 3.4.0.12rhs.beta1 built on Jun 28 2013 06:41:38 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
The current implementation addresses all concerns raised in the BZ: --------------------------------------------------------------------- Case1: For this case, the behaviour is consistent and displays what log to check: [root@ravi2 glusterfs]# gluster v heal testvol Self-heal daemon is not running. Check self-heal daemon log file. [root@ravi1 ~]# gluster v heal testvol Staging failed on 10.70.42.252. Error: Self-heal daemon is not running. Check self-heal daemon log file. Note: 'Staging failed' cannot be removed because of the way glusterd transaction works. ----------------------------------------------------------------------- Case2: If brick/glusterd of a node is down, heal-info gives ENOTCONN now. [root@ravi1 ~]# gluster v heal testvol info Brick ravi1:/brick/brick1/ Number of entries: 0 Brick 10.70.42.252:/brick/brick1 Status: Transport endpoint is not connected -------------------------------------------------------------------------- Case3: 'healed|heal-failed' have been deprecated. 'info' and 'info split-brain' have been re-implemented using glfsheal binary and will show the correct output even if self-heal daemon is offline. ------------------------------------------------------------------