Bug 1396010

Summary: [Disperse] healing should not start if only data bricks are UP
Product: Red Hat Gluster Storage Reporter: Prasad Desala <tdesala>
Component: disperseAssignee: Ashish Pandey <aspandey>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, aspandey, asrivast, pkarampu, rhinduja, rhs-bugs, storage-qa-internal, ubansal
Target Milestone: ---   
Target Release: RHGS 3.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-19 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1399072 (view as bug list) Environment:
Last Closed: 2017-09-21 04:28:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1399072, 1399989, 1417147    

Description Prasad Desala 2016-11-17 09:04:30 UTC
Description of problem:
=======================
On a 2 x (4 + 2) Distributed-Disperse volume considering the redundancy count killed 4 bricks and can see that healing has started which is not expected as the data bricks are up and running.

Version-Release number of selected component (if applicable):
3.8.4-5.el7rhgs.x86_64

How reproducible:
=================
Always

Steps to Reproduce:
===================
1) Create a distributed disperse volume and start it.
2) Fuse mount the volume on a client.
3) Kill the bricks based on the redundancy count.
4) From mount point, untar linux kernel package and wait till it completes.

check gluster vol heal <volname> info, we can see that heal is getting triggered.
I am seeing a high cpu utilization on the nodes and we are suspecting because of this issue the cpu utilization is growing.

d x [k + n] --> where k is data bricks count and n is the redundancy count 
2 x (4+2)

Actual results:
===============
Even though all the data bricks are up, healing is getting started.

Expected results:
=================
Healing should not happen as all the data bricks are up and running.

Comment 7 Atin Mukherjee 2017-03-24 08:45:19 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101286

Comment 9 Nag Pavan Chilakam 2017-06-23 06:57:31 UTC
qatp:
QATP:
======
tc#1) test above scenario as mentioned in Description ie bring down redundant number of bricks first and then do IO , with fix CPU consumption of shd should come down====>PASS cpu is mostly at <2% and sometimes peaks to 10%(but hardly for a second) so acceptable
tc#2)keep doing IOs and then bring down one redundant brick after other--->CPU utilization should be reduced --->but it is not as long as IOs are going on ....try with linux untar  ====>hence FAIL===>raising a new bz

The steps in this bz ie tc#1 is passing. however this fix has not considered all the cases, hence moving it to verified, while raising a new bz for tc#2
Riased bZ#1464336 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another  for tc#2
test version:3.8.4-28

Comment 11 errata-xmlrpc 2017-09-21 04:28:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 12 errata-xmlrpc 2017-09-21 04:54:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774