Bug 1341942 - glusterd coredump due to assert failed with GF_ASSERT (GD_OP_HEAL_VOLUME == op)
Summary: glusterd coredump due to assert failed with GF_ASSERT (GD_OP_HEAL_VOLUME ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: 3.6.9
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-02 06:00 UTC by George
Modified: 2016-08-01 04:43 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-08-01 04:42:12 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
coredump file (608.17 KB, application/octet-stream)
2016-06-07 13:50 UTC, George
no flags Details

Description George 2016-06-02 06:00:55 UTC
Description of problem:
glusterd failed with coredump when assert failed

Version-Release number of selected component (if applicable):


How reproducible:
loop running CLI command gluster volume heal ...

Steps to Reproduce:
1.
2.
3.

Actual results: coredump trace as the below:
(gdb) bt
#0  0x00007f3abd94b177 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f3abd94c5fa in __GI_abort () at abort.c:89
#2  0x00007f3abd94415d in __assert_fail_base (fmt=0x7f3abda7b768 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7f3aba3345c0 "GD_OP_HEAL_VOLUME == op", file=file@entry=0x7f3aba32fc02 "glusterd-utils.c", line=line@entry=10921,
    function=function@entry=0x7f3aba338120 "glusterd_volume_heal_use_rsp_dict") at assert.c:92
#3  0x00007f3abd944212 in __GI___assert_fail (assertion=0x7f3aba3345c0 "GD_OP_HEAL_VOLUME == op", file=0x7f3aba32fc02 "glusterd-utils.c", line=10921,
    function=0x7f3aba338120 "glusterd_volume_heal_use_rsp_dict") at assert.c:101
#4  0x00007f3aba295e30 in glusterd_volume_heal_use_rsp_dict () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#5  0x00007f3aba2f651f in glusterd_syncop_aggr_rsp_dict () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#6  0x00007f3aba2f7e9c in _gd_syncop_commit_op_cbk () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#7  0x00007f3aba29f3a8 in glusterd_big_locked_cbk () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#8  0x00007f3aba2f7fac in gd_syncop_commit_op_cbk () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#9  0x00007f3abe75d6a0 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0
#10 0x00007f3abe75d914 in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0
#11 0x00007f3abe75a073 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#12 0x00007f3ab952b89e in ?? () from /usr/lib64/glusterfs/3.6.9/rpc-transport/socket.so
#13 0x00007f3ab952dc58 in ?? () from /usr/lib64/glusterfs/3.6.9/rpc-transport/socket.so
#14 0x00007f3abe9da9f9 in ?? () from /usr/lib64/libglusterfs.so.0
#15 0x0000000000405288 in main ()


Expected results:


Additional info:
root cause draft investigate:
is global variable opinfo.op cleared by another thread?  seems no lock when access opinfo? or get opinfo not from global variable but get from transaction?(through API glusterd_get_txn_opinfo?)

Comment 1 Mohammed Rafi KC 2016-06-07 12:36:17 UTC
if possible, can you upload generated core file? was there any parallel cli commands running from any other server ?

Comment 2 George 2016-06-07 13:50:35 UTC
Created attachment 1165657 [details]
coredump file

Comment 3 George 2016-06-07 14:02:17 UTC
(In reply to Mohammed Rafi KC from comment #1)
> if possible, can you upload generated core file? was there any parallel cli
> commands running from any other server ?

core file uploaded as the attachment.
maybe CLI run parallel, don't sure.

the function glusterd_volume_heal_use_rsp_dict in 3.6.9 is 
        GF_ASSERT (rsp_dict);

        op = glusterd_op_get_op ();
        GF_ASSERT (GD_OP_HEAL_VOLUME == op);



and I just find in newest code in git repository the function "glusterd_volume_heal_use_rsp_dict" is changed as the below:


        GF_ASSERT (rsp_dict);

        ret = dict_get_bin (aggr, "transaction_id", (void **)&txn_id);
        if (ret)
                goto out;
        gf_msg_debug (THIS->name, 0, "transaction ID = %s",
                uuid_utoa (*txn_id));

        ret = glusterd_get_txn_opinfo (txn_id, &txn_op_info);
        if (ret) {
                gf_msg_callingfn (THIS->name, GF_LOG_ERROR, 0,
                        GD_MSG_TRANS_OPINFO_GET_FAIL,
                        "Unable to get transaction opinfo "
                        "for transaction ID : %s",
                        uuid_utoa (*txn_id));
                goto out;
        }

        op = txn_op_info.op;
        GF_ASSERT (GD_OP_HEAL_VOLUME == op); 

it should resove the GF_assert issue what I happen, but I am still confuse:
1) as the code in 3.6.9, I suppose if the first parameter is NULL, it will find the dict from global variable "opinfo.op" to get the dict.
but from the latest code, it has no this logic.  is it acceptable?

2) for the latest code, the added code seems only to check txn_op_info.op is valid or not, and seems no other use. it really confuse me.

3) and the else branch of code script in function as the below seems never will be ENTER!

         if (aggr) {
                ctx_dict = aggr;

        } else {
                ctx_dict = txn_op_info.op_ctx;
        }

REASON: IF aggr is NULL, it will goto out in previous code, if it not null, else branch will not enter, so else branch never will be ENTER.

Comment 4 Atin Mukherjee 2016-08-01 04:42:12 UTC
This is not a security bug, not going to fix this in 3.6.x because of
http://www.gluster.org/pipermail/gluster-users/2016-July/027682.html

Comment 5 Atin Mukherjee 2016-08-01 04:43:43 UTC
If the issue persists in the latest releases, please feel free to clone them


Note You need to log in before you can comment on or make changes to this bug.