Bug 1396010 - [Disperse] healing should not start if only data bricks are UP
Summary: [Disperse] healing should not start if only data bricks are UP
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Ashish Pandey
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks: 1399072 1399989 1417147
TreeView+ depends on / blocked
 
Reported: 2016-11-17 09:04 UTC by Prasad Desala
Modified: 2018-08-16 06:43 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.8.4-19
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1399072 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:28:23 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Prasad Desala 2016-11-17 09:04:30 UTC
Description of problem:
=======================
On a 2 x (4 + 2) Distributed-Disperse volume considering the redundancy count killed 4 bricks and can see that healing has started which is not expected as the data bricks are up and running.

Version-Release number of selected component (if applicable):
3.8.4-5.el7rhgs.x86_64

How reproducible:
=================
Always

Steps to Reproduce:
===================
1) Create a distributed disperse volume and start it.
2) Fuse mount the volume on a client.
3) Kill the bricks based on the redundancy count.
4) From mount point, untar linux kernel package and wait till it completes.

check gluster vol heal <volname> info, we can see that heal is getting triggered.
I am seeing a high cpu utilization on the nodes and we are suspecting because of this issue the cpu utilization is growing.

d x [k + n] --> where k is data bricks count and n is the redundancy count 
2 x (4+2)

Actual results:
===============
Even though all the data bricks are up, healing is getting started.

Expected results:
=================
Healing should not happen as all the data bricks are up and running.

Comment 7 Atin Mukherjee 2017-03-24 08:45:19 UTC
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101286

Comment 9 Nag Pavan Chilakam 2017-06-23 06:57:31 UTC
qatp:
QATP:
======
tc#1) test above scenario as mentioned in Description ie bring down redundant number of bricks first and then do IO , with fix CPU consumption of shd should come down====>PASS cpu is mostly at <2% and sometimes peaks to 10%(but hardly for a second) so acceptable
tc#2)keep doing IOs and then bring down one redundant brick after other--->CPU utilization should be reduced --->but it is not as long as IOs are going on ....try with linux untar  ====>hence FAIL===>raising a new bz

The steps in this bz ie tc#1 is passing. however this fix has not considered all the cases, hence moving it to verified, while raising a new bz for tc#2
Riased bZ#1464336 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another  for tc#2
test version:3.8.4-28

Comment 11 errata-xmlrpc 2017-09-21 04:28:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 12 errata-xmlrpc 2017-09-21 04:54:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.