Bug 799265

Summary: [glusterfs-3.3.0qa25]: glustershd process asserted since the dictionary for sending the reply was NULL
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: gluster-bugs, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:09:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description Raghavendra Bhat 2012-03-02 10:18:09 UTC
Description of problem:

Same setup as the bug 799262 (i.e. 2 replica volume, added 2 more bricks to make it 2x2 dist-repl volume). 1 fuse client and 1 nfs client. On fuse ran fs-perf-test with 4444 fds as th argument. While the test was going on brought a brick down and brought it up after some time. Gave gluster volume heal <volname> command to trigger self-heal. glustershd crashed with the following backtrace.

Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'.
Program terminated with signal 6, Aborted.
#0  0x000000390f432905 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64
(gdb) bt
#0  0x000000390f432905 in raise () from /lib64/libc.so.6
#1  0x000000390f4340e5 in abort () from /lib64/libc.so.6
#2  0x000000390f42b9be in __assert_fail_base () from /lib64/libc.so.6
#3  0x000000390f42ba80 in __assert_fail () from /lib64/libc.so.6
#4  0x0000000000408d5b in glusterfs_xlator_op_response_send (req=0x1623d9c, op_ret=-1, msg=0x40e2d8 "", output=0x0)
    at ../../../glusterfsd/src/glusterfsd-mgmt.c:328
#5  0x000000000040a26d in glusterfs_handle_translator_op (data=0x1623d9c) at ../../../glusterfsd/src/glusterfsd-mgmt.c:731
#6  0x00007f87d3b18753 in synctask_wrap (old_task=0x17302a0) at ../../../libglusterfs/src/syncop.c:144
#7  0x000000390f443690 in ?? () from /lib64/libc.so.6
#8  0x0000000000000000 in ?? ()
(gdb)f 4
#4  0x0000000000408d5b in glusterfs_xlator_op_response_send (req=0x1623d9c, op_ret=-1, msg=0x40e2d8 "", output=0x0)
    at ../../../glusterfsd/src/glusterfsd-mgmt.c:328
328             GF_ASSERT (output);
(gdb) p output
$1 = (dict_t *) 0x0
(gdb) f 5
#5  0x000000000040a26d in glusterfs_handle_translator_op (data=0x1623d9c) at ../../../glusterfsd/src/glusterfsd-mgmt.c:731
731             glusterfs_xlator_op_response_send (req, ret, "", output);
(gdb) p output
$2 = (dict_t *) 0x0
(gdb) (gdb) l glusterfs_handle_translator_op 
651             return NULL;
652     }
653
654     int
655     glusterfs_handle_translator_op (void *data)
656     {
657             int32_t                  ret     = -1;
658             gd1_mgmt_brick_op_req    xlator_req = {0,};
659             dict_t                   *input    = NULL;
660             xlator_t                 *xlator = NULL;
(gdb) 
661             xlator_t                 *any = NULL;
662             dict_t                   *output = NULL;
663             char                     key[2048] = {0};
664             char                    *xname = NULL;
665             glusterfs_ctx_t          *ctx = NULL;
666             glusterfs_graph_t        *active = NULL;
667             xlator_t                 *this = NULL;
668             int                      i = 0;
669             int                      count = 0;
670             rpcsvc_request_t         *req = data;
(gdb) 
671
672             GF_ASSERT (req);
673             this = THIS;
674             GF_ASSERT (this);
675
676             if (!xdr_to_generic (req->msg[0], &xlator_req,
677                                  (xdrproc_t)xdr_gd1_mgmt_brick_op_req)) {
678                     //failed to decode msg;
679                     req->rpc_err = GARBAGE_ARGS;
680                     goto out;
(gdb) 
681             }
682
683             ctx = glusterfs_ctx_get ();
684             active = ctx->active;
685             any = active->first;
686             input = dict_new ();
687             ret = dict_unserialize (xlator_req.input.input_val,
688                                     xlator_req.input.input_len,
689                                     &input);
690             if (ret < 0) {
(gdb) 
691                     gf_log (this->name, GF_LOG_ERROR,
692                             "failed to "
693                             "unserialize req-buffer to dictionary");
694                     goto out;
695             } else {
696                     input->extra_stdfree = xlator_req.input.input_val;
697             }
698
699             ret = dict_get_int32 (input, "count", &count);
700
(gdb) 

 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a replicate volume, start it and mount it.
2. Run fs-perf-test with number of fds as argument (4444 in this case)
3. Bring a brick down, bring it up after some time and trigger self-heal via gluster cli command
  
Actual results:

gluster self-heal daemon crashed.

Expected results:

gluster self-heal-daemon should not crash.

Additional info:

In the function glusterfs_handle_translator_op, suppose we cannot unserialize the dictionary that glusterd has sent, then we are going out to call glusterfs_translator_info_response_send, which expects the output dictionary to be present.(glusterfs_handle_translator_op creates the new output dictionary after unserializing the input dictionary it has received from glusterd).

[2012-03-02 02:43:27.951724] I [afr-self-heal-common.c:2028:afr_self_heal_completion_cbk] 0-mirror-replicate-1: background  meta-data data sel
f-heal completed on
[2012-03-02 02:43:27.953170] I [afr-common.c:1290:afr_launch_self_heal] 0-mirror-replicate-1: background  meta-data data self-heal triggered.
path: , reason: lookup detected pending operations
[2012-03-02 02:43:33.736881] I [afr-self-heal-algorithm.c:131:sh_loop_driver_done] 0-mirror-replicate-1: diff self-heal on : completed. (134 b
locks of 278 were different (48.20%))
[2012-03-02 02:43:33.739047] I [afr-self-heal-common.c:2028:afr_self_heal_completion_cbk] 0-mirror-replicate-1: background  meta-data data self-heal completed on
[2012-03-02 02:43:33.742508] I [afr-common.c:1290:afr_launch_self_heal] 0-mirror-replicate-1: background  meta-data data self-heal triggered. path: , reason: lookup detected pending operations
[2012-03-02 02:43:39.836988] I [afr-self-heal-algorithm.c:131:sh_loop_driver_done] 0-mirror-replicate-1: diff self-heal on : completed. (134 blocks of 278 were different (48.20%))
[2012-03-02 02:43:39.839189] I [afr-self-heal-common.c:2028:afr_self_heal_completion_cbk] 0-mirror-replicate-1: background  meta-data data self-heal completed on
[2012-03-02 02:43:39.842079] I [afr-common.c:1290:afr_launch_self_heal] 0-mirror-replicate-1: background  meta-data data self-heal triggered. path: , reason: lookup detected pending operations
[2012-03-02 02:43:42.506148] I [afr-self-heald.c:890:afr_find_child_position] 0-mirror-replicate-0: child mirror-client-1 is remote
[2012-03-02 02:43:42.522395] W [dict.c:2578:dict_unserialize] (-->/lib64/libc.so.6() [0x390f443690] (-->/usr/local/lib/libglusterfs.so.0(synctask_wrap+0x38) [0x7f87d3b18753] (-->/usr/local/sbin/glusterfs(glusterfs_handle_translator_op+0x1a8) [0x409f55]))) 0-dict: buf is null!
[2012-03-02 02:43:42.522427] E [glusterfsd-mgmt.c:693:glusterfs_handle_translator_op] 0-glusterfs: failed to unserialize req-buffer to dictionary
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2012-03-02 02:43:42
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0qa25
/lib64/libc.so.6[0x390f432980]
/lib64/libc.so.6(gsignal+0x35)[0x390f432905]
/lib64/libc.so.6(abort+0x175)[0x390f4340e5]
/lib64/libc.so.6[0x390f42b9be]
/lib64/libc.so.6(__assert_perror_fail+0x0)[0x390f42ba80]

Comment 1 Amar Tumballi 2012-03-12 09:46:14 UTC
please update these bugs w.r.to 3.3.0qa27, need to work on it as per target milestone set.

Comment 2 Anand Avati 2012-03-18 06:24:27 UTC
CHANGE: http://review.gluster.com/2961 (glusterfsd: Handle errors in response send) merged in master by Anand Avati (avati)

Comment 3 Raghavendra Bhat 2012-04-05 10:33:25 UTC
Not seen with glusterfs-3.3.0qa33.