+++ This bug was initially created as a clone of Bug #1294612 +++ +++ This bug was initially created as a clone of Bug #1200252 +++ Description of problem: When there are multiple source to heal, starting self heal daemon when one of the source brick is down give error "Launching heal operation to perform index self heal on volume vol0 has been unsuccessful" even though heal is successful Version-Release number of selected component (if applicable): [root@localhost ~]# rpm -qa | grep glusterfs glusterfs-api-3.6.0.50-1.el6rhs.x86_64 glusterfs-geo-replication-3.6.0.50-1.el6rhs.x86_64 glusterfs-3.6.0.50-1.el6rhs.x86_64 samba-glusterfs-3.6.509-169.4.el6rhs.x86_64 glusterfs-fuse-3.6.0.50-1.el6rhs.x86_64 glusterfs-server-3.6.0.50-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.50-1.el6rhs.x86_64 glusterfs-libs-3.6.0.50-1.el6rhs.x86_64 glusterfs-cli-3.6.0.50-1.el6rhs.x86_64 glusterfs-debuginfo-3.6.0.50-1.el6rhs.x86_64 How reproducible: 100% Steps to Reproduce: 1. create 2x3 distribute replicate volume and do fuse mount 2. set self-heal daemon, data , metadata and entry self-heal off 3. kill brick 3 and brick 6 4. create some files on mount point 5. bring brick 3 and 6 up 6. kill the brick 2 and brick 4 from subvolume 7. make self-heal-daemon on 8. trigger the self heal daemon e.g. gluster v heal vol0 Actual results: "Launching heal operation to perform index self heal on volume vol0 has been unsuccessful" though self heal completes Expected results: error message should not be displayed Additional info: [root@localhost ~]# gluster v info Volume Name: vol0 Type: Distributed-Replicate Volume ID: d0e9e55c-a62d-4b2b-907d-d56f90e5d06f Status: Started Snap Volume: no Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.143:/rhs/brick1/b1 Brick2: 10.70.47.145:/rhs/brick1/b2 Brick3: 10.70.47.150:/rhs/brick1/b3 Brick4: 10.70.47.151:/rhs/brick1/b4 Brick5: 10.70.47.143:/rhs/brick2/b5 Brick6: 10.70.47.145:/rhs/brick2/b6 Options Reconfigured: cluster.quorum-type: auto performance.readdir-ahead: on performance.write-behind: off performance.read-ahead: off performance.io-cache: off performance.quick-read: off performance.open-behind: off cluster.self-heal-daemon: on cluster.data-self-heal: off cluster.metadata-self-heal: off cluster.entry-self-heal: off snap-max-hard-limit: 256 snap-max-soft-limit: 90 auto-delete: disable [root@localhost ~]# gluster v status Status of volume: vol0 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.143:/rhs/brick1/b1 49152 0 Y 32485 Brick 10.70.47.145:/rhs/brick1/b2 N/A N/A N N/A Brick 10.70.47.150:/rhs/brick1/b3 49152 0 Y 31465 Brick 10.70.47.151:/rhs/brick1/b4 49152 0 Y 16654 Brick 10.70.47.143:/rhs/brick2/b5 N/A N/A N N/A Brick 10.70.47.145:/rhs/brick2/b6 49153 0 Y 16126 NFS Server on localhost 2049 0 Y 19006 Self-heal Daemon on localhost N/A N/A Y 19015 NFS Server on 10.70.47.145 2049 0 Y 15920 Self-heal Daemon on 10.70.47.145 N/A N/A Y 15929 NFS Server on 10.70.47.150 2049 0 Y 31251 Self-heal Daemon on 10.70.47.150 N/A N/A Y 31262 NFS Server on 10.70.47.151 2049 0 Y 31815 Self-heal Daemon on 10.70.47.151 N/A N/A Y 31824 Task Status of Volume vol0 ------------------------------------------------------------------------------ There are no active volume tasks --- Additional comment from Ravishankar N on 2015-12-29 05:05:16 EST --- This is pretty much easy to reproduce like so: 1. create a 1x3 replica using a 3 node cluster 2. Kill one brick, run 'gluster vol heal <volname>` RCA: If any of the bricks is down, glustershd of that node sends a -1 op_ret to glusterd which eventually propagates it to the CLI. If op_ret is non zero, CLI prints "Launching heal...unsuccessful". For the bricks that are up and need heal, the healing happens without any issues. A reasonable fix seems to be to print a more meaningful message on the CLI like "Launching heal operation to perform index self heal on volume vol0 has not been been successful on all nodes. Please check if all brick processes are running."
REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#1) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#2) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#3) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/13303 (cli/ afr: op_ret for index heal launch) posted (#4) for review on master by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/13303 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit da33097c3d6492e3b468b4347e47c70828fb4320 Author: Ravishankar N <ravishankar> Date: Mon Jan 18 12:16:31 2016 +0000 cli/ afr: op_ret for index heal launch Problem: If index heal is launched when some of the bricks are down, glustershd of that node sends a -1 op_ret to glusterd which eventually propagates it to the CLI. Also, glusterd sometimes sends an err_str and sometimes not (depending on the failure happening in the brick-op phase or commit-op phase). So the message that gets displayed varies in each case: "Launching heal operation to perform index self heal on volume testvol has been unsuccessful" (OR) "Commit failed on <host>. Please check log file for details." Fix: 1. Modify afr_xl_op() to return -1 even if index healing of atleast one brick fails. 2. Ignore glusterd's error string in gf_cli_heal_volume_cbk and print a more meaningful message. The patch also fixes a bug in glusterfs_handle_translator_op() where if we encounter an error in notify of one xlator, we break out of the loop instead of sending the notify to other xlators. Change-Id: I957f6c4b4d0a45453ffd5488e425cab5a3e0acca BUG: 1302291 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/13303 Reviewed-by: Anuradha Talur <atalur> Smoke: Gluster Build System <jenkins.com> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report. glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/ [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user