1464336 – selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another

Bug 1464336 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another

Summary: selfheal deamon cpu consumption not reducing when IOs are going on and all r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Ashish Pandey
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	3.3.0-devel-freeze-exception
Depends On:
Blocks:	1417151 1464359 1468457
TreeView+	depends on / blocked

Reported:	2017-06-23 06:56 UTC by Nag Pavan Chilakam
Modified:	2018-08-16 06:51 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-33
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1464359 (view as bug list)
Environment:
Last Closed:	2017-09-21 04:59:42 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-06-23 06:56:59 UTC

Description of problem:
-=========================
Hit this while verifying BZ#1396010 - [Disperse] healing should not start if only data bricks are UP
The fix in bz#1396010 takes care of reducing cpu when the heal deamon notices at the beginning itself all the redundant bricks are down, but if we bring down redundant brick one after another and parallelly IOs were happening, then the CPU consumption doesnt reduce.
Hence raising this bz

Version-Release number of selected component (if applicable):
===
3.8.4-28

How reproducible:
========
always

Steps to Reproduce:
1.create a  1x(4+2) ec vol(offline all  the vols on this cluster except this vol)
2.trigger IOs say linux kenrel untar
3. keep capturing CPU usage of shd process on all nodes
3.kill b1
4. wait for say 2 minutes and kill b2

Actual results:
=====
it can be seen that the cpu usage is more than 100% as long as IOs go on, even though only data number bricks are up

Expected results:
============
cpu usage should reduce for shd as there is nothing to heal

Comment 9 Nag Pavan Chilakam 2017-07-26 12:28:47 UTC

on_qa validation on 3.8.4-35

Moving to verified, as I don't see the issue anymore

Noticed that by running above case the cpu utilization by shd is mostly null or max of 0-6%, hence bringing down the utilization significantly

Problems/observation:
1)However, I also issued a ls -lRt from another client and the command was hung when both bricks were down(both bricks hosted on same node, 2 brick per node in 3node cluster) --->raised a BZ#1475310 	







Checked for about 10 min and below is the snippet(refer glusterfs for shd proc)
################## LOOP 198 ###############
Mon Jul 24 19:15:41 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:22.61 glusterfsd
28143 root      20   0 1483380  75680   4680 S  18.8  0.9   3:10.42 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.32 glusterfs
################### LOOP 199 ###############
Mon Jul 24 19:15:44 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:23.42 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:11.19 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.33 glusterfs
################### LOOP 200 ###############
Mon Jul 24 19:15:47 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  31.2  0.9   3:24.24 glusterfsd
28143 root      20   0 1483380  75680   4680 S  18.8  0.9   3:11.93 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.34 glusterfs
################### LOOP 201 ###############
Mon Jul 24 19:15:50 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:25.01 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:12.65 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.35 glusterfs
################### LOOP 202 ###############
Mon Jul 24 19:15:54 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28143 root      20   0 1483380  75680   4680 S  31.2  0.9   3:13.44 glusterfsd
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:25.87 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.36 glusterfs



################### LOOP 203 ###############
Mon Jul 24 19:15:57 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:26.71 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:14.19 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.36 glusterfs
################### LOOP 204 ###############
Mon Jul 24 19:16:00 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  25.0  0.9   3:27.52 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:14.93 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.39 glusterfs
################### LOOP 205 ###############
Mon Jul 24 19:16:03 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  29.4  0.9   3:28.36 glusterfsd
28143 root      20   0 1483380  75680   4680 S  17.6  0.9   3:15.68 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.39 glusterfs
################### LOOP 206 ###############
Mon Jul 24 19:16:06 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  29.4  0.9   3:29.22 glusterfsd
28143 root      20   0 1483380  75680   4680 S  23.5  0.9   3:16.48 glusterfsd
28163 root      20   0 1465612  63520   3248 S   5.9  0.8   1:52.41 glusterfs
################### LOOP 207 ###############
Mon Jul 24 19:16:10 IST 2017
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
28124 root      20   0 1549176  75608   4700 S  31.2  0.9   3:30.07 glusterfsd
28143 root      20   0 1483380  75680   4680 S  25.0  0.9   3:17.26 glusterfsd
28163 root      20   0 1465612  63520   3248 S   0.0  0.8   1:52.41 glusterfs
################### LOOP 208 ###############

Comment 11 errata-xmlrpc 2017-09-21 04:59:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.