Bug 1729085
| Summary: | [EC] shd crashed while heal failed due to out of memory error. | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Ashish Pandey <aspandey> | |
| Component: | disperse | Assignee: | Ashish Pandey <aspandey> | |
| Status: | CLOSED NEXTRELEASE | QA Contact: | ||
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | mainline | CC: | bugs, pasik | |
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1805057 1806836 1806844 (view as bug list) | Environment: | ||
| Last Closed: | 2019-11-04 11:01:35 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1805057, 1806836, 1806844 | |||
REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoide crash) posted (#1) for review on master by Ashish Pandey REVIEW: https://review.gluster.org/23050 (cluster/ec: Change handling of heal failure to avoid crash) merged (#10) on master by Xavi Hernandez |
Description of problem: The main trigger point of this crash is NO memory available for synctasks - [2019-07-03 15:13:13.801297] A [MSGID: 0] [mem-pool.c:145:__gf_calloc] : no memory available for size (2097224) current memory usage in kilobytes 5515680 [call stack follows] As the backtrace suggests ec_heal_throttle tries to launch heal and failed because it could not create new synctask. ec_launch_heal calls ec_heal_fail which is sending NULL as an argument which is being dereferenced. ec_launch_heal(ec_t *ec, ec_fop_data_t *fop) { int ret = 0; ret = synctask_new(ec->xl->ctx->env, ec_synctask_heal_wrap, ec_heal_done, NULL, fop); if (ret < 0) { ec_fop_set_error(fop, ENOMEM); ec_heal_fail(ec, fop); } } ec_heal_fail is calling ec_getxattr_heal_cbk with op_errno=12 which is ENOMEM #0 ec_getxattr_heal_cbk (frame=0x7f796de7dd38, cookie=0x0, xl=0x7f6f215e5800, op_ret=-1, op_errno=12, mask=0, good=0, bad=0, xdata=0x0) at ec-inode-read.c:399 second argument is NULL which is being dereference 399 fop_getxattr_cbk_t func = fop->data; So, while the reason for out of memory could be related to the way shd-mux is working, we need to fix this code in EC so that we should never dereference NULL pointer over here.