Bug 1341942

Summary: glusterd coredump due to assert failed with GF_ASSERT (GD_OP_HEAL_VOLUME == op)
Product: [Community] GlusterFS Reporter: George <george.lian>
Component: glusterdAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WONTFIX QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.6.9CC: bugs, george.lian, rkavunga
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-01 04:42:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
coredump file none

Description George 2016-06-02 06:00:55 UTC
Description of problem:
glusterd failed with coredump when assert failed

Version-Release number of selected component (if applicable):


How reproducible:
loop running CLI command gluster volume heal ...

Steps to Reproduce:
1.
2.
3.

Actual results: coredump trace as the below:
(gdb) bt
#0  0x00007f3abd94b177 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f3abd94c5fa in __GI_abort () at abort.c:89
#2  0x00007f3abd94415d in __assert_fail_base (fmt=0x7f3abda7b768 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7f3aba3345c0 "GD_OP_HEAL_VOLUME == op", file=file@entry=0x7f3aba32fc02 "glusterd-utils.c", line=line@entry=10921,
    function=function@entry=0x7f3aba338120 "glusterd_volume_heal_use_rsp_dict") at assert.c:92
#3  0x00007f3abd944212 in __GI___assert_fail (assertion=0x7f3aba3345c0 "GD_OP_HEAL_VOLUME == op", file=0x7f3aba32fc02 "glusterd-utils.c", line=10921,
    function=0x7f3aba338120 "glusterd_volume_heal_use_rsp_dict") at assert.c:101
#4  0x00007f3aba295e30 in glusterd_volume_heal_use_rsp_dict () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#5  0x00007f3aba2f651f in glusterd_syncop_aggr_rsp_dict () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#6  0x00007f3aba2f7e9c in _gd_syncop_commit_op_cbk () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#7  0x00007f3aba29f3a8 in glusterd_big_locked_cbk () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#8  0x00007f3aba2f7fac in gd_syncop_commit_op_cbk () from /usr/lib64/glusterfs/3.6.9/xlator/mgmt/glusterd.so
#9  0x00007f3abe75d6a0 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0
#10 0x00007f3abe75d914 in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0
#11 0x00007f3abe75a073 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0
#12 0x00007f3ab952b89e in ?? () from /usr/lib64/glusterfs/3.6.9/rpc-transport/socket.so
#13 0x00007f3ab952dc58 in ?? () from /usr/lib64/glusterfs/3.6.9/rpc-transport/socket.so
#14 0x00007f3abe9da9f9 in ?? () from /usr/lib64/libglusterfs.so.0
#15 0x0000000000405288 in main ()


Expected results:


Additional info:
root cause draft investigate:
is global variable opinfo.op cleared by another thread?  seems no lock when access opinfo? or get opinfo not from global variable but get from transaction?(through API glusterd_get_txn_opinfo?)

Comment 1 Mohammed Rafi KC 2016-06-07 12:36:17 UTC
if possible, can you upload generated core file? was there any parallel cli commands running from any other server ?

Comment 2 George 2016-06-07 13:50:35 UTC
Created attachment 1165657 [details]
coredump file

Comment 3 George 2016-06-07 14:02:17 UTC
(In reply to Mohammed Rafi KC from comment #1)
> if possible, can you upload generated core file? was there any parallel cli
> commands running from any other server ?

core file uploaded as the attachment.
maybe CLI run parallel, don't sure.

the function glusterd_volume_heal_use_rsp_dict in 3.6.9 is 
        GF_ASSERT (rsp_dict);

        op = glusterd_op_get_op ();
        GF_ASSERT (GD_OP_HEAL_VOLUME == op);



and I just find in newest code in git repository the function "glusterd_volume_heal_use_rsp_dict" is changed as the below:


        GF_ASSERT (rsp_dict);

        ret = dict_get_bin (aggr, "transaction_id", (void **)&txn_id);
        if (ret)
                goto out;
        gf_msg_debug (THIS->name, 0, "transaction ID = %s",
                uuid_utoa (*txn_id));

        ret = glusterd_get_txn_opinfo (txn_id, &txn_op_info);
        if (ret) {
                gf_msg_callingfn (THIS->name, GF_LOG_ERROR, 0,
                        GD_MSG_TRANS_OPINFO_GET_FAIL,
                        "Unable to get transaction opinfo "
                        "for transaction ID : %s",
                        uuid_utoa (*txn_id));
                goto out;
        }

        op = txn_op_info.op;
        GF_ASSERT (GD_OP_HEAL_VOLUME == op); 

it should resove the GF_assert issue what I happen, but I am still confuse:
1) as the code in 3.6.9, I suppose if the first parameter is NULL, it will find the dict from global variable "opinfo.op" to get the dict.
but from the latest code, it has no this logic.  is it acceptable?

2) for the latest code, the added code seems only to check txn_op_info.op is valid or not, and seems no other use. it really confuse me.

3) and the else branch of code script in function as the below seems never will be ENTER!

         if (aggr) {
                ctx_dict = aggr;

        } else {
                ctx_dict = txn_op_info.op_ctx;
        }

REASON: IF aggr is NULL, it will goto out in previous code, if it not null, else branch will not enter, so else branch never will be ENTER.

Comment 4 Atin Mukherjee 2016-08-01 04:42:12 UTC
This is not a security bug, not going to fix this in 3.6.x because of
http://www.gluster.org/pipermail/gluster-users/2016-July/027682.html

Comment 5 Atin Mukherjee 2016-08-01 04:43:43 UTC
If the issue persists in the latest releases, please feel free to clone them