Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1401817 - glusterfsd crashed while taking snapshot using scheduler
glusterfsd crashed while taking snapshot using scheduler
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: core (Show other bugs)
3.2
x86_64 Linux
unspecified Severity urgent
: ---
: RHGS 3.2.0
Assigned To: Atin Mukherjee
Anil Shah
:
Depends On:
Blocks: 1351528 1401921 1402694 1402697
  Show dependency treegraph
 
Reported: 2016-12-06 03:07 EST by Anil Shah
Modified: 2017-03-23 01:54 EDT (History)
4 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-8
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1401921 (view as bug list)
Environment:
Last Closed: 2017-03-23 01:54:37 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 05:18:45 EDT

  None (edit)
Description Anil Shah 2016-12-06 03:07:05 EST
Description of problem:

While taking snapshot using scheduler one of the brick process crashed.


Version-Release number of selected component (if applicable):

glusterfs-3.8.4-6.el7rhgs.x86_64

How reproducible:

1/1


Steps to Reproduce:
1. Create 2*2 distributed replicate volume
2. enabled scheduler,
3. scheduled snapshot every one minute

Actual results:

One of the brick process crashed

Expected results:



Additional info:

bt
=======================

#0  0x00007f19a2a12394 in glusterfs_handle_barrier (req=0x7f19a30cffcc) at glusterfsd-mgmt.c:1348
        ret = <optimized out>
        brick_req = {name = 0x7f198c0008e0 "repvol", op = 10, input = {input_len = 1783, 
            input_val = 0x7f198c000900 ""}}
        brick_rsp = {op_ret = 0, op_errno = 0, output = {output_len = 0, output_val = 0x0}, op_errstr = 0x0}
        ctx = 0x7f19a3085010
        active = 0x0
        any = 0x0
        xlator = 0x0
        old_THIS = 0x0
        dict = 0x0
        name = '\000' <repeats 1023 times>
        barrier = _gf_true
        barrier_err = _gf_false
        __FUNCTION__ = "glusterfs_handle_barrier"
#1  0x00007f19a2550a92 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
        task = 0x7f1990002510
#2  0x00007f19a0c0fcf0 in ?? () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.
Comment 5 Atin Mukherjee 2016-12-06 07:03:04 EST
The function from where this core was generated is glusterfs_handle_barrier (). From the core it looks like glusterfsd_ctx (global context) in the brick process didn't have ctx->active initialized which happens during graph initialization. We also saw that when barrier brick op was sent by GlusterD brick process just came up. The hypothesis we have here is as follows:

T1. Brick process was in its init. However it still didn't finish doing the graph generation.
T2. GlusterD sent a barrier brick op (as a trigger to snapshot initiated by snapshot scheduler) as it understood the brick to be connected (received the rpc connect notify from brick process)

The time gap between T1 & T2 is very minimum and currently GlusterD doesn't know whether the brick process has finished all its initialization including the graph generation.

One mitigation approach to avoid this crash is to avoid null pointer dereferencing which can be addressed by a simple patch and then even if we hit this race, barrier would fail. But to fix this race entirely we need to come up with a concrete solution which may not be feasible in 3.2.0 time lines.
Comment 6 Atin Mukherjee 2016-12-06 07:15:53 EST
Many thanks to Rajesh Jospeh for helping in the RCA.
Comment 7 Atin Mukherjee 2016-12-06 07:20:41 EST
The patch to address null pointer dereferencing is put up for review at upstream master : http://review.gluster.org/#/c/16043
Comment 10 Atin Mukherjee 2016-12-08 02:54:20 EST
upstream mainline : http://review.gluster.org/#/c/16043
upstream 3.8 : http://review.gluster.org/#/c/16066/
upstream 3.9 : http://review.gluster.org/#/c/16067/
downstream patch : https://code.engineering.redhat.com/gerrit/92447
Comment 13 Anil Shah 2016-12-19 02:50:54 EST
Created 256 snapshots on volume using scheduler. Not seeing any brick crash.
Bug verified on build glusterfs-3.8.4-8.el7rhgs.x86_64
Comment 15 errata-xmlrpc 2017-03-23 01:54:37 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.