1396010 – [Disperse] healing should not start if only data bricks are UP

Bug 1396010 - [Disperse] healing should not start if only data bricks are UP

Summary: [Disperse] healing should not start if only data bricks are UP

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Ashish Pandey
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1399072 1399989 1417147
TreeView+	depends on / blocked

Reported:	2016-11-17 09:04 UTC by Prasad Desala
Modified:	2018-08-16 06:43 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-19
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1399072 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:28:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Prasad Desala 2016-11-17 09:04:30 UTC

Description of problem:
=======================
On a 2 x (4 + 2) Distributed-Disperse volume considering the redundancy count killed 4 bricks and can see that healing has started which is not expected as the data bricks are up and running.

Version-Release number of selected component (if applicable):
3.8.4-5.el7rhgs.x86_64

How reproducible:
=================
Always

Steps to Reproduce:
===================
1) Create a distributed disperse volume and start it.
2) Fuse mount the volume on a client.
3) Kill the bricks based on the redundancy count.
4) From mount point, untar linux kernel package and wait till it completes.

check gluster vol heal <volname> info, we can see that heal is getting triggered.
I am seeing a high cpu utilization on the nodes and we are suspecting because of this issue the cpu utilization is growing.

d x [k + n] --> where k is data bricks count and n is the redundancy count 
2 x (4+2)

Actual results:
===============
Even though all the data bricks are up, healing is getting started.

Expected results:
=================
Healing should not happen as all the data bricks are up and running.

Comment 7 Atin Mukherjee 2017-03-24 08:45:19 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/101286

Comment 9 Nag Pavan Chilakam 2017-06-23 06:57:31 UTC

qatp:
QATP:
======
tc#1) test above scenario as mentioned in Description ie bring down redundant number of bricks first and then do IO , with fix CPU consumption of shd should come down====>PASS cpu is mostly at <2% and sometimes peaks to 10%(but hardly for a second) so acceptable
tc#2)keep doing IOs and then bring down one redundant brick after other--->CPU utilization should be reduced --->but it is not as long as IOs are going on ....try with linux untar  ====>hence FAIL===>raising a new bz

The steps in this bz ie tc#1 is passing. however this fix has not considered all the cases, hence moving it to verified, while raising a new bz for tc#2
Riased bZ#1464336 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another  for tc#2
test version:3.8.4-28

Comment 11 errata-xmlrpc 2017-09-21 04:28:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Comment 12 errata-xmlrpc 2017-09-21 04:54:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.