Bug 1806844 - [EC] shd crashed while heal failed due to out of memory error.
Summary: [EC] shd crashed while heal failed due to out of memory error.
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: disperse
Version: 7
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1729085
Blocks: 1805057 1806836
TreeView+ depends on / blocked
 
Reported: 2020-02-25 06:52 UTC by Pranith Kumar K
Modified: 2020-03-12 14:47 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1729085
Environment:
Last Closed: 2020-03-12 14:47:26 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 24171 0 None Open cluster/ec: Change handling of heal failure to avoid crash 2020-02-25 06:58:59 UTC

Description Pranith Kumar K 2020-02-25 06:52:08 UTC
+++ This bug was initially created as a clone of Bug #1729085 +++

Description of problem:
The main trigger point of this crash is NO memory available for synctasks -
[2019-07-03 15:13:13.801297] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no memory available for size (2097224) current memory usage in kilobytes 5515680 [call stack follows]

As the backtrace suggests ec_heal_throttle tries to launch heal and failed because it could not create new synctask.

ec_launch_heal calls ec_heal_fail which is sending NULL as an argument which is being dereferenced.

ec_launch_heal(ec_t *ec, ec_fop_data_t *fop)
{
    int ret = 0;

    ret = synctask_new(ec->xl->ctx->env, ec_synctask_heal_wrap, ec_heal_done,
                       NULL, fop);
    if (ret < 0) {
        ec_fop_set_error(fop, ENOMEM);
        ec_heal_fail(ec, fop);
    }
}

ec_heal_fail is calling ec_getxattr_heal_cbk with op_errno=12 which is ENOMEM
#0  ec_getxattr_heal_cbk (frame=0x7f796de7dd38, cookie=0x0, xl=0x7f6f215e5800, op_ret=-1, op_errno=12, mask=0, good=0, bad=0, xdata=0x0) at ec-inode-read.c:399
second argument is NULL which is being dereference 

399	    fop_getxattr_cbk_t func = fop->data;

So, while the reason for out of memory could be related to the way shd-mux is working, we need to fix this code in EC so that we should never dereference NULL pointer over here.

--- Additional comment from Worker Ant on 2019-07-15 13:06:58 UTC ---

REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoide crash) posted (#1) for review on master by Ashish Pandey

--- Additional comment from Worker Ant on 2019-11-04 11:01:35 UTC ---

REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoid crash) merged (#10) on master by Xavi Hernandez

Comment 1 Worker Ant 2020-02-25 06:59:00 UTC
REVIEW: https://review.gluster.org/24171 (cluster/ec: Change handling of heal failure to avoid crash) posted (#1) for review on release-7 by Pranith Kumar Karampuri

Comment 2 Worker Ant 2020-03-12 14:47:26 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/1061, and will be tracked there from now on. Visit GitHub issues URL for further details


Note You need to log in before you can comment on or make changes to this bug.