1302291 – Self heal command gives error "Launching heal operation to perform index self heal on volume vol0 has been unsuccessful"

Bug 1302291 - Self heal command gives error "Launching heal operation to perform index self heal on volume vol0 has been unsuccessful"

Summary: Self heal command gives error "Launching heal operation to perform index self...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1200252 1294612
Blocks:	1306922
TreeView+	depends on / blocked

Reported:	2016-01-27 12:17 UTC by Ravishankar N
Modified:	2016-06-16 13:56 UTC (History)
CC List:	1 user (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:	1294612
Clones:	1306922 (view as bug list)
Environment:
Last Closed:	2016-06-16 13:56:06 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ravishankar N 2016-01-27 12:17:42 UTC

+++ This bug was initially created as a clone of Bug #1294612 +++

+++ This bug was initially created as a clone of Bug #1200252 +++

Description of problem:

When there are multiple source to heal, starting self heal daemon when one of the source brick is down give error "Launching heal operation to perform index self heal on volume vol0 has been unsuccessful" even though heal is successful  

Version-Release number of selected component (if applicable):

[root@localhost ~]# rpm -qa | grep glusterfs
glusterfs-api-3.6.0.50-1.el6rhs.x86_64
glusterfs-geo-replication-3.6.0.50-1.el6rhs.x86_64
glusterfs-3.6.0.50-1.el6rhs.x86_64
samba-glusterfs-3.6.509-169.4.el6rhs.x86_64
glusterfs-fuse-3.6.0.50-1.el6rhs.x86_64
glusterfs-server-3.6.0.50-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.50-1.el6rhs.x86_64
glusterfs-libs-3.6.0.50-1.el6rhs.x86_64
glusterfs-cli-3.6.0.50-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.50-1.el6rhs.x86_64


How reproducible:
100%

Steps to Reproduce:
1. create 2x3 distribute replicate volume and do fuse mount 
2. set self-heal daemon, data , metadata and entry self-heal off
3. kill brick 3 and brick 6 
4. create some files on mount point
5. bring brick 3 and 6 up
6. kill the brick 2 and brick 4 from subvolume
7. make self-heal-daemon on
8. trigger the self heal daemon e.g. gluster v heal vol0 


Actual results:

"Launching heal operation to perform index self heal on volume vol0 has been unsuccessful" though self heal completes 

Expected results:

error message should not be displayed

Additional info:


[root@localhost ~]# gluster v info
 
Volume Name: vol0
Type: Distributed-Replicate
Volume ID: d0e9e55c-a62d-4b2b-907d-d56f90e5d06f
Status: Started
Snap Volume: no
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.143:/rhs/brick1/b1
Brick2: 10.70.47.145:/rhs/brick1/b2
Brick3: 10.70.47.150:/rhs/brick1/b3
Brick4: 10.70.47.151:/rhs/brick1/b4
Brick5: 10.70.47.143:/rhs/brick2/b5
Brick6: 10.70.47.145:/rhs/brick2/b6
Options Reconfigured:
cluster.quorum-type: auto
performance.readdir-ahead: on
performance.write-behind: off
performance.read-ahead: off
performance.io-cache: off
performance.quick-read: off
performance.open-behind: off
cluster.self-heal-daemon: on
cluster.data-self-heal: off
cluster.metadata-self-heal: off
cluster.entry-self-heal: off
snap-max-hard-limit: 256
snap-max-soft-limit: 90
auto-delete: disable

[root@localhost ~]# gluster v status
Status of volume: vol0
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.143:/rhs/brick1/b1           49152     0          Y       32485
Brick 10.70.47.145:/rhs/brick1/b2           N/A       N/A        N       N/A  
Brick 10.70.47.150:/rhs/brick1/b3           49152     0          Y       31465
Brick 10.70.47.151:/rhs/brick1/b4           49152     0          Y       16654
Brick 10.70.47.143:/rhs/brick2/b5           N/A       N/A        N       N/A  
Brick 10.70.47.145:/rhs/brick2/b6           49153     0          Y       16126
NFS Server on localhost                     2049      0          Y       19006
Self-heal Daemon on localhost               N/A       N/A        Y       19015
NFS Server on 10.70.47.145                  2049      0          Y       15920
Self-heal Daemon on 10.70.47.145            N/A       N/A        Y       15929
NFS Server on 10.70.47.150                  2049      0          Y       31251
Self-heal Daemon on 10.70.47.150            N/A       N/A        Y       31262
NFS Server on 10.70.47.151                  2049      0          Y       31815
Self-heal Daemon on 10.70.47.151            N/A       N/A        Y       31824
 
Task Status of Volume vol0
------------------------------------------------------------------------------
There are no active volume tasks


--- Additional comment from Ravishankar N on 2015-12-29 05:05:16 EST ---

This is pretty much easy to reproduce like so:
1. create a 1x3 replica using a 3 node cluster
2. Kill one brick, run 'gluster vol heal <volname>`

RCA:
If any of the bricks is down, glustershd of that node sends a -1 op_ret to glusterd which eventually propagates it to the CLI. If op_ret is non zero, CLI prints "Launching heal...unsuccessful". For the bricks that are up and need heal, the healing happens without any issues.

A reasonable fix seems to be to print a more meaningful message on the CLI like "Launching heal operation to perform index self heal on volume vol0 has not been been successful on all nodes. Please check if all brick processes are running."

Comment 1 Vijay Bellur 2016-01-27 12:20:48 UTC

REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 2 Vijay Bellur 2016-02-03 06:06:19 UTC

REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#2) for review on master by Ravishankar N (ravishankar)

Comment 3 Vijay Bellur 2016-02-04 07:33:40 UTC

REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#3) for review on master by Ravishankar N (ravishankar)

Comment 4 Vijay Bellur 2016-02-09 13:01:24 UTC

REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#4) for review on master by Ravishankar N (ravishankar)

Comment 5 Vijay Bellur 2016-02-12 07:31:04 UTC

COMMIT: http://review.gluster.org/13303 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit da33097c3d6492e3b468b4347e47c70828fb4320
Author: Ravishankar N <ravishankar>
Date:   Mon Jan 18 12:16:31 2016 +0000

    cli/ afr: op_ret for index heal launch
    
    Problem:
    If index heal is launched when some of the bricks are down, glustershd of that
    node sends a -1 op_ret to glusterd which eventually propagates it to the CLI.
    Also, glusterd sometimes sends an err_str and sometimes not (depending on the
    failure happening in the brick-op phase or commit-op phase). So the message that
    gets displayed varies in each case:
    
    "Launching heal operation to perform index self heal on volume testvol has been
    unsuccessful"
                    (OR)
    "Commit failed on <host>. Please check log file for details."
    
    Fix:
    1. Modify afr_xl_op() to return -1 even if index healing of atleast one brick
    fails.
    2. Ignore glusterd's error string in gf_cli_heal_volume_cbk and print a more
    meaningful message.
    
    The patch also fixes a bug in glusterfs_handle_translator_op() where if we
    encounter an error in notify of one xlator, we break out of the loop instead of
    sending the notify to other xlators.
    
    Change-Id: I957f6c4b4d0a45453ffd5488e425cab5a3e0acca
    BUG: 1302291
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/13303
    Reviewed-by: Anuradha Talur <atalur>
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 6 Niels de Vos 2016-06-16 13:56:06 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.