Bug 1464336

Summary:	selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Nag Pavan Chilakam <nchilaka>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, pkarampu, rcyriac, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, ubansal
Target Milestone:	---
Target Release:	RHGS 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	3.3.0-devel-freeze-exception
Fixed In Version:	glusterfs-3.8.4-33	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1464359 (view as bug list)		Environment:
Last Closed:	2017-09-21 04:59:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1417151, 1464359, 1468457

Description Nag Pavan Chilakam 2017-06-23 06:56:59 UTC

Description of problem:
-=========================
Hit this while verifying BZ#1396010 - [Disperse] healing should not start if only data bricks are UP
The fix in bz#1396010 takes care of reducing cpu when the heal deamon notices at the beginning itself all the redundant bricks are down, but if we bring down redundant brick one after another and parallelly IOs were happening, then the CPU consumption doesnt reduce.
Hence raising this bz

Version-Release number of selected component (if applicable):
===
3.8.4-28

How reproducible:
========
always

Steps to Reproduce:
1.create a  1x(4+2) ec vol(offline all  the vols on this cluster except this vol)
2.trigger IOs say linux kenrel untar
3. keep capturing CPU usage of shd process on all nodes
3.kill b1
4. wait for say 2 minutes and kill b2

Actual results:
=====
it can be seen that the cpu usage is more than 100% as long as IOs go on, even though only data number bricks are up

Expected results:
============
cpu usage should reduce for shd as there is nothing to heal

Comment 9 Nag Pavan Chilakam 2017-07-26 12:28:47 UTC

on_qa validation on 3.8.4-35

Moving to verified, as I don't see the issue anymore

Noticed that by running above case the cpu utilization by shd is mostly null or max of 0-6%, hence bringing down the utilization significantly

Problems/observation:
1)However, I also issued a ls -lRt from another client and the command was hung when both bricks were down(both bricks hosted on same node, 2 brick per node in 3node cluster) --->raised a BZ#1475310 	







Checked for about 10 min and below is the snippet(refer glusterfs for shd proc)
################## LOOP 198 ###############
Mon Jul 24 19:15:41 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:22.61 glusterfsd
28143 root      20   0 1483380  75680   4680 S  18.8  0.9   3:10.42 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.32 glusterfs
################### LOOP 199 ###############
Mon Jul 24 19:15:44 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:23.42 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:11.19 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.33 glusterfs
################### LOOP 200 ###############
Mon Jul 24 19:15:47 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  31.2  0.9   3:24.24 glusterfsd
28143 root      20   0 1483380  75680   4680 S  18.8  0.9   3:11.93 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.34 glusterfs
################### LOOP 201 ###############
Mon Jul 24 19:15:50 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:25.01 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:12.65 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.35 glusterfs
################### LOOP 202 ###############
Mon Jul 24 19:15:54 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28143 root      20   0 1483380  75680   4680 S  31.2  0.9   3:13.44 glusterfsd
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:25.87 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.36 glusterfs



################### LOOP 203 ###############
Mon Jul 24 19:15:57 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:26.71 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:14.19 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.36 glusterfs
################### LOOP 204 ###############
Mon Jul 24 19:16:00 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:27.52 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:14.93 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.39 glusterfs
################### LOOP 205 ###############
Mon Jul 24 19:16:03 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  29.4  0.9   3:28.36 glusterfsd
28143 root      20   0 1483380  75680   4680 S  17.6  0.9   3:15.68 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.39 glusterfs
################### LOOP 206 ###############
Mon Jul 24 19:16:06 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  29.4  0.9   3:29.22 glusterfsd
28143 root      20   0 1483380  75680   4680 S  23.5  0.9   3:16.48 glusterfsd
28163 root      20   0 1465612  63520   3248 S   5.9  0.8   1:52.41 glusterfs
################### LOOP 207 ###############
Mon Jul 24 19:16:10 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  31.2  0.9   3:30.07 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:17.26 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.41 glusterfs
################### LOOP 208 ###############

Comment 11 errata-xmlrpc 2017-09-21 04:59:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774