1805057 – [EC] shd crashed while heal failed due to out of memory error.

Bug 1805057 - [EC] shd crashed while heal failed due to out of memory error.

Summary: [EC] shd crashed while heal failed due to out of memory error.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	disperse
Sub Component:
Version:	5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1729085 1806836 1806844
Blocks:	1806848
TreeView+	depends on / blocked

Reported:	2020-02-20 07:44 UTC by Pranith Kumar K
Modified:	2020-03-03 14:09 UTC (History)
CC List:	3 users (show)
Fixed In Version:	glusterfs-5.12
Clone Of:	1729085
Environment:
Last Closed:	2020-02-25 10:17:41 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	24153	0	None	Merged	cluster/ec: Change handling of heal failure to avoid crash	2020-02-25 10:17:40 UTC

Description Pranith Kumar K 2020-02-20 07:44:33 UTC

+++ This bug was initially created as a clone of Bug #1729085 +++

Description of problem:
The main trigger point of this crash is NO memory available for synctasks -
[2019-07-03 15:13:13.801297] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no memory available for size (2097224) current memory usage in kilobytes 5515680 [call stack follows]

As the backtrace suggests ec_heal_throttle tries to launch heal and failed because it could not create new synctask.

ec_launch_heal calls ec_heal_fail which is sending NULL as an argument which is being dereferenced.

ec_launch_heal(ec_t *ec, ec_fop_data_t *fop)
{
    int ret = 0;

    ret = synctask_new(ec->xl->ctx->env, ec_synctask_heal_wrap, ec_heal_done,
                       NULL, fop);
    if (ret < 0) {
        ec_fop_set_error(fop, ENOMEM);
        ec_heal_fail(ec, fop);
    }
}

ec_heal_fail is calling ec_getxattr_heal_cbk with op_errno=12 which is ENOMEM
#0  ec_getxattr_heal_cbk (frame=0x7f796de7dd38, cookie=0x0, xl=0x7f6f215e5800, op_ret=-1, op_errno=12, mask=0, good=0, bad=0, xdata=0x0) at ec-inode-read.c:399
second argument is NULL which is being dereference 

399	    fop_getxattr_cbk_t func = fop->data;

So, while the reason for out of memory could be related to the way shd-mux is working, we need to fix this code in EC so that we should never dereference NULL pointer over here.

--- Additional comment from Worker Ant on 2019-07-15 13:06:58 UTC ---

REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoide crash) posted (#1) for review on master by Ashish Pandey

--- Additional comment from Worker Ant on 2019-11-04 11:01:35 UTC ---

REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoid crash) merged (#10) on master by Xavi Hernandez

Comment 1 Worker Ant 2020-02-20 08:38:03 UTC

REVIEW: https://review.gluster.org/24153 (cluster/ec: Change handling of heal failure to avoid crash) posted (#1) for review on release-5 by Pranith Kumar Karampuri

Comment 2 Worker Ant 2020-02-25 10:17:41 UTC

REVIEW: https://review.gluster.org/24153 (cluster/ec: Change handling of heal failure to avoid crash) merged (#3) on release-5 by Pranith Kumar Karampuri

Comment 3 hari gowtham 2020-03-02 08:36:50 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-5.12, please open a new bug report.

glusterfs-5.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/gluster-users/2020-March/037797.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.