Bug 1401817
Summary: | glusterfsd crashed while taking snapshot using scheduler | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Anil Shah <ashah> | |
Component: | core | Assignee: | Atin Mukherjee <amukherj> | |
Status: | CLOSED ERRATA | QA Contact: | Anil Shah <ashah> | |
Severity: | urgent | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, rcyriac, rhs-bugs, storage-qa-internal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.2.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-8 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1401921 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 05:54:37 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1351528, 1401921, 1402694, 1402697 |
Description
Anil Shah
2016-12-06 08:07:05 UTC
The function from where this core was generated is glusterfs_handle_barrier (). From the core it looks like glusterfsd_ctx (global context) in the brick process didn't have ctx->active initialized which happens during graph initialization. We also saw that when barrier brick op was sent by GlusterD brick process just came up. The hypothesis we have here is as follows: T1. Brick process was in its init. However it still didn't finish doing the graph generation. T2. GlusterD sent a barrier brick op (as a trigger to snapshot initiated by snapshot scheduler) as it understood the brick to be connected (received the rpc connect notify from brick process) The time gap between T1 & T2 is very minimum and currently GlusterD doesn't know whether the brick process has finished all its initialization including the graph generation. One mitigation approach to avoid this crash is to avoid null pointer dereferencing which can be addressed by a simple patch and then even if we hit this race, barrier would fail. But to fix this race entirely we need to come up with a concrete solution which may not be feasible in 3.2.0 time lines. Many thanks to Rajesh Jospeh for helping in the RCA. The patch to address null pointer dereferencing is put up for review at upstream master : http://review.gluster.org/#/c/16043 upstream mainline : http://review.gluster.org/#/c/16043 upstream 3.8 : http://review.gluster.org/#/c/16066/ upstream 3.9 : http://review.gluster.org/#/c/16067/ downstream patch : https://code.engineering.redhat.com/gerrit/92447 Created 256 snapshots on volume using scheduler. Not seeing any brick crash. Bug verified on build glusterfs-3.8.4-8.el7rhgs.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |