1342954 – self heal deamon killed due to oom kills on a dist-disperse volume using nfs ganesha

Bug 1342954 - self heal deamon killed due to oom kills on a dist-disperse volume using nfs ganesha

Summary: self heal deamon killed due to oom kills on a dist-disperse volume using nfs ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	3.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Ashish Pandey
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1342426 1342796
Blocks:	1342964
TreeView+	depends on / blocked

Reported:	2016-06-06 08:27 UTC by Ashish Pandey
Modified:	2016-06-16 12:33 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.0
Clone Of:	1342796
Clones:	1342964 (view as bug list)
Environment:
Last Closed:	2016-06-16 12:33:36 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Vijay Bellur 2016-06-06 08:40:04 UTC

REVIEW: http://review.gluster.org/14651 (cluster/ec: Restrict the launch of replace brick heal) posted (#1) for review on release-3.8 by Ashish Pandey (aspandey)

Comment 2 Ashish Pandey 2016-06-07 03:39:01 UTC

Additional comment from Pranith Kumar K on 2016-06-04 07:52:19 EDT ---

Steps to re-create the issue without nfs-ganesha (The issue seems to be with cache-invalidation + ec. Cache invalidation is enabled when nfs-ganesha is enabled):

On single machine this issue can be re-created with the following steps:
1) glusterd && gluster v create ec2 redundancy 2 localhost.localdomain:/home/gfs/ec_{0..5} force && gluster v  start ec2 && mount -t glusterfs localhost.localdomain:/ec2 /mnt/fuse1 && mount -t glusterfs localhost.localdomain:/ec2 /mnt/ec2

2) gluster volume set ec2 features.cache-invalidation on

3) On two different terminals which are in /mnt/ec2 and /mnt/fuse1 execute:
while true; do echo abc > a; done

4) Execute gluster volume heal ec2 in a loop for 10 times, it may hang in the middle when we do this.

5) Keep observing memory usage going up in shd by doing:
top -p <pid-of-shd>

Comment 3 Vijay Bellur 2016-06-08 07:00:01 UTC

REVIEW: http://review.gluster.org/14651 (cluster/ec: Restrict the launch of replace brick heal) posted (#2) for review on release-3.8 by Ashish Pandey (aspandey)

Comment 4 Vijay Bellur 2016-06-13 10:12:16 UTC

COMMIT: http://review.gluster.org/14651 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit c8d78fa265b8b938bbaee5bc8a59b60a58ae0440
Author: Ashish Pandey <aspandey>
Date:   Mon Jun 6 10:17:54 2016 +0530

    cluster/ec: Restrict the launch of replace brick heal
    
    Problem: When features.cache-invalidation is ON, a lot of
    ec_notify function gets called which leads to launch of
    too many heals. This leads to no heal completion,
    which causes accumulation of heals.
    
    Solution: ec_launch_replace_heal should not be launch
    for every event. Replace brick will trigger a child up
    event and then only this heal function should be called.
    
    master -
    http://review.gluster.org/#/c/14649/
    
    Change-Id: I57b44c6a279d57230daea1d93229be6069245b7d
    BUG: 1342954
    Signed-off-by: Ashish Pandey <aspandey>
    Reviewed-on: http://review.gluster.org/14651
    Reviewed-by: Xavier Hernandez <xhernandez>
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 5 Niels de Vos 2016-06-16 12:33:36 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.