Bug 1676812

Summary:	Manual Index heal throws error which is misguiding when heal is triggered to heal a brick if another brick is down
Product:	[Community] GlusterFS	Reporter:	Sanju <srakonde>
Component:	glusterd	Assignee:	Sanju <srakonde>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	low
Version:	mainline	CC:	amukherj, aspandey, bmekala, bugs, moagrawa, nchilaka, rhinduja, rhs-bugs, sankarshan, sheggodu, srakonde, storage-qa-internal, ubansal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1603082	Environment:
Last Closed:	2019-02-26 04:51:29 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1603082
Bug Blocks:

Comment 1 Sanju 2019-02-13 09:55:12 UTC

Description of problem:
Healing fails as long as a brick is down in a 4+2 EC volume

Version-Release number of selected component (if applicable):
=============================================================
glusterfs-server-3.12.2-13.el7rhgs.x86_64

How reproducible:
Always (3/3)

Steps to Reproduce:
===================
1.create a 4+2 ec vol
2.keep appending a file
3.bring down b1
4.wait for a minute or so and bring down b2
5.now again after a minute or so bring up b1
6.Healing fails to start for b1

Actual results:
==============
Healing is failing for b1

Expected results:
================
Healing should start for b1

Additional info:
===============
[root@dhcp35-56 ~]# gluster v heal dispersed
Launching heal operation to perform index self heal on volume dispersed has been unsuccessful:
Commit failed on 10.70.35.3. Please check log file for details.
[root@dhcp35-56 ~]# 
10.70.35.3 is the node which has b2 which is down.

Logs - [2018-07-19 06:22:02.123328] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-dispersed-client-3: changing port to 49152 (from 0)
[2018-07-19 06:22:02.132033] E [socket.c:2369:socket_connect_finish] 0-dispersed-client-3: connection to 10.70.35.3:49152 failed (Connection refused); disconnecting socket
[2018-07-19 06:22:06.132851] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-dispersed-client-3: changing port to 49152 (from 0)
[2018-07-19 06:22:06.137905] E [socket.c:2369:socket_connect_finish] 0-dispersed-client-3: connection to 10.70.35.3:49152 failed (Connection refused); disconnecting socket
[2018-07-19 06:22:10.151806] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-dispersed-client-3: changing port to 49152 (from 0)
[2018-07-19 06:22:10.156943] E [socket.c:2369:socket_connect_finish] 0-dispersed-client-3: connection to 10.70.35.3:49152 failed (Connection refused); disconnecting socket
[2018-07-19 06:22:14.155717] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-dispersed-client-3: changing port to 49152 (from 0)
[2018-07-19 06:22:14.163562] E [socket.c:2369:socket_connect_finish] 0-dispersed-client-3: connection to 10.70.35.3:49152 failed (Connection refused); disconnecting socket
[2018-07-19 06:22:18.163595] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-dispersed-client-3: changing port to 49152 (from 0)
[2018-07-19 06:22:18.172639] E [socket.c:2369:socket_connect_finish] 0-dispersed-client-3: connection to 10.70.35.3:49152 failed (Connection refused); disconnecting socket
[2018-07-19 06:22:22.174819] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 0-dispersed-client-3: changing port to 49152 (from 0)
[2018-07-19 06:22:22.184626] E [socket.c:2369:socket_connect_finish] 0-dispersed-client-3: connection to 10.70.35.3:49152 failed (Connection refused); disconnecting socket
[2018-07-19 06:22:26

--- Additional comment from nchilaka on 2018-08-08 12:30:31 IST ---

I discussed with Upasana, and based on futher analysis, below is the summary
The heal was happening when i checked on my setup. 
However the error message is misguiding.
Also the error message has a regression introduced

Hence changing title.
However, If incase Upasana, sees that the file is not healing(as she is unable to recollect at this point given that this bug was raised about 20days back), she will raise a new bug again, and also the reason behind calling a heal not happening.

Also One very important note is that the error message is different between latest live 3.8.4-54.15 and 3.12.2-15

For the steps mentioned by Upasana, below is error message (used pkill):

Also, Simple testcase, dont even have any IOs running
have an ecvolume, kill brick on one node, then kill another brick on another node, and issue a heal command

3.8.4-54.15 :
-----------
Launching heal operation to perform index self heal on volume ecv has been unsuccessful on bricks that are down. Please check if all brick processes are running.

Note: I checked with kill -9/-15 and even with brickmux on  , and saw the same error message

3.12.2-15:
--------
Launching heal operation to perform index self heal on volume dispersed has been unsuccessful:
Commit failed on 10.70.35.3. Please check log file for details.


pkill glusterfsd//kill 15 <glusterfsd-pid>
Launching heal operation to perform index self heal on volume ecv has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.

--- Additional comment from nchilaka on 2018-08-08 12:31:30 IST ---

To confirm it is still a regression from CLI error part.
Healing as such has no problem

--- Additional comment from Atin Mukherjee on 2018-08-09 14:41:05 IST ---

The only difference I see that instead of mentioning some of the bricks are down, we're highlighting "commit has failed on node x, please check log file" .
The change which introduced this is as follows:

Author: Mohit Agrawal <moagrawa>
Date:   Tue Oct 25 19:57:02 2016 +0530
 
    cli/afr: gluster volume heal info "healed" command output is not appropriate
   
    Problem: "gluster volume heal info [healed] [heal-failed]" command
              output on terminal is not appropriate in case of down any volume.
   
    Solution: To make message more appropriate change the condition
              in function "gd_syncop_mgmt_brick_op".
   
    Test : To verify the fix followed below procedure
           1) Create 2*3 distribute replicate volume
           2) set self-heal daemon off
           3) kill two bricks (3, 6)
           4) create some file on mount point
           5) bring brick 3,6 up
           6) kill other two brick (2 and 4)
           7) make self heal daemon on
           8) run "gluster v heal <vol-name>"
   
    Note: After apply the patch options (healed | heal-failed) will deprecate
          from command line.
   
    > BUG: 1388509
    > Change-Id: I229c320c9caeb2525c76b78b44a53a64b088545a
    > Signed-off-by: Mohit Agrawal <moagrawa>
    > (Cherry pick from commit d1f15cdeb609a1b720a04a502f7a63b2d3922f41)

Comment 2 Worker Ant 2019-02-13 09:59:53 UTC

REVIEW: https://review.gluster.org/22209 (cli/afr: modify the error message) posted (#1) for review on master by Sanju Rakonde

Comment 3 Atin Mukherjee 2019-02-26 04:51:29 UTC

Based on the discussion on the patch, we decide not to fix this.