Bug 1396010

Summary:	[Disperse] healing should not start if only data bricks are UP
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Prasad Desala <tdesala>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, aspandey, asrivast, pkarampu, rhinduja, rhs-bugs, storage-qa-internal, ubansal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-19	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1399072 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:28:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1399072, 1399989, 1417147

Description Prasad Desala 2016-11-17 09:04:30 UTC

Description of problem:
=======================
On a 2 x (4 + 2) Distributed-Disperse volume considering the redundancy count killed 4 bricks and can see that healing has started which is not expected as the data bricks are up and running.

Version-Release number of selected component (if applicable):
3.8.4-5.el7rhgs.x86_64

How reproducible:
=================
Always

Steps to Reproduce:
===================
1) Create a distributed disperse volume and start it.
2) Fuse mount the volume on a client.
3) Kill the bricks based on the redundancy count.
4) From mount point, untar linux kernel package and wait till it completes.

check gluster vol heal <volname> info, we can see that heal is getting triggered.
I am seeing a high cpu utilization on the nodes and we are suspecting because of this issue the cpu utilization is growing.

d x [k + n] --> where k is data bricks count and n is the redundancy count 
2 x (4+2)

Actual results:
===============
Even though all the data bricks are up, healing is getting started.

Expected results:
=================
Healing should not happen as all the data bricks are up and running.

Comment 7 Atin Mukherjee 2017-03-24 08:45:19 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101286

Comment 9 Nag Pavan Chilakam 2017-06-23 06:57:31 UTC

qatp:
QATP:
======
tc#1) test above scenario as mentioned in Description ie bring down redundant number of bricks first and then do IO , with fix CPU consumption of shd should come down====>PASS cpu is mostly at <2% and sometimes peaks to 10%(but hardly for a second) so acceptable
tc#2)keep doing IOs and then bring down one redundant brick after other--->CPU utilization should be reduced --->but it is not as long as IOs are going on ....try with linux untar  ====>hence FAIL===>raising a new bz

The steps in this bz ie tc#1 is passing. however this fix has not considered all the cases, hence moving it to verified, while raising a new bz for tc#2
Riased bZ#1464336 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another  for tc#2
test version:3.8.4-28

Comment 11 errata-xmlrpc 2017-09-21 04:28:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 12 errata-xmlrpc 2017-09-21 04:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774