Bug 1452915

Summary:	healing fails with wrong error when one of the glusterd holds a lock
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	replicate	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED ERRATA	QA Contact:	Vijay Avuthu <vavuthu>
Severity:	medium	Docs Contact:
Priority:	high
Version:	rhgs-3.3	CC:	amukherj, moagrawa, ravishankar, rhinduja, rhs-bugs, sheggodu, srmukher, storage-qa-internal
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	rebase
Fixed In Version:	glusterfs-3.12.2-1	Doc Type:	Bug Fix
Doc Text:	Previously, when the heal daemon was disabled by using the heal disable command, you had to manually trigger a heal by using "gluster volume heal <volname>" command. The command used to provide a message which was not useful. With this fix, when you try to trigger a manual heal on a disabled daemon, the message now says to start the daemon in order to trigger a heal.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-04 06:32:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1333705, 1500658, 1500660, 1500662
Bug Blocks:	1503134

Description Nag Pavan Chilakam 2017-05-20 07:20:30 UTC

Description of problem:
=========================
in my brick mux setup
In a situation where one of the cluster node's glusterd holds a lock , and we try to trigger heal, the heal fails with below message
Launching heal operation to perform index self heal on volume cross3-23 has been unsuccessful on bricks that are down. Please check if all brick processes are running.

However there are no bricks down.

I understand the node doesnt know if bricks are down or not, but as it is not able  to communicate with the other node, it must say a more generic error.



I checked the glusterd log which says below


[2017-05-20 07:04:34.371968] W [glusterd-locks.c:572:glusterd_mgmt_v3_lock] (-->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd4000) [0x7f058c6b5000] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd3f2e) [0x7f058c6b4f2e] -->/usr/lib64/glusterfs/3.8.4/xlator/mgmt/glusterd.so(+0xd930f) [0x7f058c6ba30f] ) 0-management: Lock for cross3-23 held by c4f9ba86-a666-4c72-a3cf-0d1339b36820


[root@dhcp35-45 ~]# gluster v heal cross3-23
Launching heal operation to perform index self heal on volume cross3-23 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
[root@dhcp35-45 ~]# 
[root@dhcp35-45 ~]# gluster v status cross3-23
Another transaction is in progress for cross3-23. Please try again after sometime.
[root@dhcp35-45 ~]# 
[root@dhcp35-45 ~]# gluster v status cross3-23
Status of volume: cross3-23
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.45:/rhs/brick23/cross3-23    49152     0          Y       6094 
Brick 10.70.35.130:/rhs/brick23/cross3-23   49152     0          Y       22705
Brick 10.70.35.122:/rhs/brick23/cross3-23   49152     0          Y       21893
Self-heal Daemon on localhost               N/A       N/A        Y       7811 
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       23028
Self-heal Daemon on 10.70.35.130            N/A       N/A        Y       23835
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       7709 
 
Task Status of Volume cross3-23
------------------------------------------------------------------------------
There are no active volume tasks



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.have a cluster setup with 3 nodes and create a 1x3 volume and make sure all bricks are up and no heal is pending
2.simulate a situation where say n3 glusterd holds a lock on vol status
3.now issue a vol status, it would say Another transaction is in progress for cross3-23. Please try again after sometime.
4. Now trigger a manual heal by issuing gluster v heal <vname>

Actual results:

it throws a wrong error saying bricks are down


Expected results:

It should throw a better generic error, instead of a misguiding statement

Comment 4 Ravishankar N 2017-09-27 09:48:23 UTC

Moving to POST, patch is https://review.gluster.org/#/c/15724/

Comment 7 Vijay Avuthu 2018-04-12 10:24:57 UTC

Update:
=========

Build Used : glusterfs-fuse-3.12.2-7.el7rhgs.x86_64

1. create 1 * 3 replicate volume and start 
2. bring 1 brick down
3. Issue heal

# gluster vol status 13
Status of volume: 13
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.61:/bricks/brick2/b0         49154     0          Y       27632
Brick 10.70.35.174:/bricks/brick2/b1        N/A       N/A        N       N/A  
Brick 10.70.35.17:/bricks/brick2/b1         49152     0          Y       17443
Self-heal Daemon on localhost               N/A       N/A        Y       27654
Self-heal Daemon on dhcp35-136.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       8012 
Self-heal Daemon on dhcp35-17.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       17465
Self-heal Daemon on dhcp35-163.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       18956
Self-heal Daemon on dhcp35-214.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       17538
Self-heal Daemon on dhcp35-174.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       11243
 
Task Status of Volume 13
------------------------------------------------------------------------------
There are no active volume tasks
 
# gluster vol heal 13 
Launching heal operation to perform index self heal on volume 13 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.
# 

Observation: The message "Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details." is not appropriate for the brick down scenario. It should be user friendly. In this case it should be something like Bricks are down

> Also the patch mentioned in comment #4 looks different from the description of the problem. Could you please confirm

Comment 13 Vijay Avuthu 2018-04-13 14:31:15 UTC

Update:
==========

Build used : glusterfs-server-3.12.2-7.el7rhgs.x86_64

Scenario's Verified:

1. Self-heal-daemon Disabled

# gluster vol heal 23
Launching heal operation to perform index self heal on volume 23 has been unsuccessful:
Self-heal-daemon is disabled. Heal will not be triggered on volume 23
#

2. Self-heal-daemon Not running

# gluster vol heal 23
Launching heal operation to perform index self heal on volume 23 has been unsuccessful:
Self-heal daemon is not running. Check self-heal daemon log file.
#

3. volume stop

# gluster vol heal 23
Launching heal operation to perform index self heal on volume 23 has been unsuccessful:
Volume 23 is not started.
# 

4. brick down 

# gluster vol heal 23
Launching heal operation to perform index self heal on volume 23 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.
# 

5. Locking

# gluster vol heal 23
Launching heal operation to perform index self heal on volume 23 has been unsuccessful:
Another transaction is in progress for 23. Please try again after sometime.
#

Comment 15 errata-xmlrpc 2018-09-04 06:32:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Comment 16 Red Hat Bugzilla 2023-09-14 03:57:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days