1464359 – selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another

Bug 1464359 - selfheal deamon cpu consumption not reducing when IOs are going on and all redundant bricks are brought down one after another

Summary: selfheal deamon cpu consumption not reducing when IOs are going on and all r...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1464336
Blocks:	1468457
TreeView+	depends on / blocked

Reported:	2017-06-23 08:53 UTC by Ashish Pandey
Modified:	2017-09-05 17:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.12.0
Clone Of:	1464336
Clones:	1468457 (view as bug list)
Environment:
Last Closed:	2017-08-21 08:05:13 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Ashish Pandey 2017-06-23 09:27:28 UTC

Description of problem:
-=========================
Hit this while verifying BZ#1396010 - [Disperse] healing should not start if only data bricks are UP
The fix in bz#1396010 takes care of reducing cpu when the heal deamon notices at the beginning itself all the redundant bricks are down, but if we bring down redundant brick one after another and parallelly IOs were happening, then the CPU consumption doesnt reduce.
Hence raising this bz

Version-Release number of selected component (if applicable):
===
3.8.4-28

How reproducible:
========
always

Steps to Reproduce:
1.create a  1x(4+2) ec vol(offline all  the vols on this cluster except this vol)
2.trigger IOs say linux kenrel untar
3. keep capturing CPU usage of shd process on all nodes
3.kill b1
4. wait for say 2 minutes and kill b2

Actual results:
=====
it can be seen that the cpu usage is more than 100% as long as IOs go on, even though only data number bricks are up

Expected results:
============
cpu usage should reduce for shd as there is nothing to heal

Comment 2 Ashish Pandey 2017-06-23 09:28:42 UTC

RCA:

SHD takes care of this scenario when redundant number of bricks are down.
SHD does not trigger heal in this case.

Now consider 4+2 volume.
When a continuous IO is going on and 2 bricks are down, an update fop will find out that the fop is not successful on 2 bricks and so it will immediately trigger a heal. This is what client side heal does. In this case also, it is of NO use to trigger a heal when you can not actually heal a file as 2 bricks are down.

This is causing unnecessary CPU hogging.

Solution:

1 - While triggering a client side heal check if more than 4 bricks are UP or not. Trigger heal accordingly.

OR

2 - Disable background heal as soon as  2 bricks go down and we can not heal.
Enable it again as soon as we see more than 4 bricks UP.

I think [1] would be a better solution which can further be improved on.
What if the brick, which requires heal, is down. Even in this case we should not trigger heal.

Comment 3 Worker Ant 2017-07-04 11:27:36 UTC

REVIEW: https://review.gluster.org/17692 (cluster/ec : Don't try to heal when no sink is UP) posted (#1) for review on master by Ashish Pandey (aspandey)

Comment 4 Worker Ant 2017-07-07 06:13:16 UTC

COMMIT: https://review.gluster.org/17692 committed in master by Xavier Hernandez (xhernandez) 
------
commit 0ae38df6403942a2438404d46a6e05b503db3485
Author: Ashish Pandey <aspandey>
Date:   Tue Jul 4 16:18:20 2017 +0530

    cluster/ec : Don't try to heal when no sink is UP
    
    Problem:
    4 + 2 EC volume configuration.
    If untar of linux is going on and we kill a brick,
    indices will be created for the files/dir which need
    to be healed. ec_shd_index_sweep spawns threads to
    scan these entries and start heal. If in the middle
    of this we kill one more brick, we end up in a
    situation where we can not heal an entry as there
    are only "ec->fragment" number of bricks are UP.
    However, the scan will be continued and it will
    trigger the heal for those entries.
    
    Solution:
    When a heal is triggered for an entry, check if it
    *CAN* be healed or not. If not come out with ENOTCONN.
    
    Change-Id: I305be7701c289f36bd7bde22491b71074771424f
    BUG: 1464359
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: https://review.gluster.org/17692
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Reviewed-by: Sunil Kumar Acharya <sheggodu>
    Reviewed-by: Xavier Hernandez <xhernandez>

Comment 5 Shyamsundar 2017-09-05 17:35:13 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.