1806844 – [EC] shd crashed while heal failed due to out of memory error.

Bug 1806844 - [EC] shd crashed while heal failed due to out of memory error.

Summary: [EC] shd crashed while heal failed due to out of memory error.

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1729085
Blocks:	1805057 1806836
TreeView+	depends on / blocked

Reported:	2020-02-25 06:52 UTC by Pranith Kumar K
Modified:	2020-03-12 14:47 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:	1729085
Environment:
Last Closed:	2020-03-12 14:47:26 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	24171	0	None	Open	cluster/ec: Change handling of heal failure to avoid crash	2020-02-25 06:58:59 UTC

Description Pranith Kumar K 2020-02-25 06:52:08 UTC

+++ This bug was initially created as a clone of Bug #1729085 +++

Description of problem:
The main trigger point of this crash is NO memory available for synctasks -
[2019-07-03 15:13:13.801297] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no memory available for size (2097224) current memory usage in kilobytes 5515680 [call stack follows]

As the backtrace suggests ec_heal_throttle tries to launch heal and failed because it could not create new synctask.

ec_launch_heal calls ec_heal_fail which is sending NULL as an argument which is being dereferenced.

ec_launch_heal(ec_t *ec, ec_fop_data_t *fop)
{
    int ret = 0;

    ret = synctask_new(ec->xl->ctx->env, ec_synctask_heal_wrap, ec_heal_done,
                       NULL, fop);
    if (ret < 0) {
        ec_fop_set_error(fop, ENOMEM);
        ec_heal_fail(ec, fop);
    }
}

ec_heal_fail is calling ec_getxattr_heal_cbk with op_errno=12 which is ENOMEM
#0  ec_getxattr_heal_cbk (frame=0x7f796de7dd38, cookie=0x0, xl=0x7f6f215e5800, op_ret=-1, op_errno=12, mask=0, good=0, bad=0, xdata=0x0) at ec-inode-read.c:399
second argument is NULL which is being dereference 

399	    fop_getxattr_cbk_t func = fop->data;

So, while the reason for out of memory could be related to the way shd-mux is working, we need to fix this code in EC so that we should never dereference NULL pointer over here.

--- Additional comment from Worker Ant on 2019-07-15 13:06:58 UTC ---

REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoide crash) posted (#1) for review on master by Ashish Pandey

--- Additional comment from Worker Ant on 2019-11-04 11:01:35 UTC ---

REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoid crash) merged (#10) on master by Xavi Hernandez

Comment 1 Worker Ant 2020-02-25 06:59:00 UTC

REVIEW: https://review.gluster.org/24171 (cluster/ec: Change handling of heal failure to avoid crash) posted (#1) for review on release-7 by Pranith Kumar Karampuri

Comment 2 Worker Ant 2020-03-12 14:47:26 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/1061, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.