1468457 – selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another

Bug 1468457 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another

Summary: selfheal deamon cpu consumption not reducing when IOs are going on and all r...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	3.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1464336 1464359
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-07 07:31 UTC by Ashish Pandey
Modified:	2017-08-12 13:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.11.2
Clone Of:	1464359
Environment:
Last Closed:	2017-08-12 13:07:33 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Worker Ant 2017-07-07 07:33:57 UTC

REVIEW: https://review.gluster.org/17724 (cluster/ec : Don't try to heal when no sink is UP) posted (#1) for review on release-3.11 by Ashish Pandey (aspandey)

Comment 2 Ashish Pandey 2017-07-07 07:36:51 UTC

Description of problem:
-=========================
Hit this while verifying BZ#1396010 - [Disperse] healing should not start if only data bricks are UP
The fix in bz#1396010 takes care of reducing cpu when the heal deamon notices at the beginning itself all the redundant bricks are down, but if we bring down redundant brick one after another and parallelly IOs were happening, then the CPU consumption doesnt reduce.
Hence raising this bz

Version-Release number of selected component (if applicable):
===
3.8.4-28

How reproducible:
========
always

Steps to Reproduce:
1.create a  1x(4+2) ec vol(offline all  the vols on this cluster except this vol)
2.trigger IOs say linux kenrel untar
3. keep capturing CPU usage of shd process on all nodes
3.kill b1
4. wait for say 2 minutes and kill b2

Actual results:
=====
it can be seen that the cpu usage is more than 100% as long as IOs go on, even though only data number bricks are up

Expected results:
============
cpu usage should reduce for shd as there is nothing to heal

Comment 3 Worker Ant 2017-07-10 13:58:36 UTC

COMMIT: https://review.gluster.org/17724 committed in release-3.11 by Shyamsundar Ranganathan (srangana) 
------
commit af569e4a418a65b452cd8842d6999734677ad5f3
Author: Ashish Pandey <aspandey>
Date:   Tue Jul 4 16:18:20 2017 +0530

    cluster/ec : Don't try to heal when no sink is UP
    
    Problem:
    4 + 2 EC volume configuration.
    If untar of linux is going on and we kill a brick,
    indices will be created for the files/dir which need
    to be healed. ec_shd_index_sweep spawns threads to
    scan these entries and start heal. If in the middle
    of this we kill one more brick, we end up in a
    situation where we can not heal an entry as there
    are only "ec->fragment" number of bricks are UP.
    However, the scan will be continued and it will
    trigger the heal for those entries.
    
    Solution:
    When a heal is triggered for an entry, check if it
    *CAN* be healed or not. If not come out with ENOTCONN.
    
    >Change-Id: I305be7701c289f36bd7bde22491b71074771424f
    >BUG: 1464359
    >Signed-off-by: Ashish Pandey <aspandey>
    >Reviewed-on: https://review.gluster.org/17692
    >Smoke: Gluster Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    >Reviewed-by: Sunil Kumar Acharya <sheggodu>
    >Reviewed-by: Xavier Hernandez <xhernandez>
    >Signed-off-by: Ashish Pandey <aspandey>
    
    Change-Id: I305be7701c289f36bd7bde22491b71074771424f
    BUG: 1468457
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: https://review.gluster.org/17724
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Xavier Hernandez <xhernandez>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Shyamsundar 2017-08-12 13:07:33 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.2, please open a new bug report.

glusterfs-3.11.2 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-July/031908.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.