Description of problem: Same setup as the bug 799262 (i.e. 2 replica volume, added 2 more bricks to make it 2x2 dist-repl volume). 1 fuse client and 1 nfs client. On fuse ran fs-perf-test with 4444 fds as th argument. While the test was going on brought a brick down and brought it up after some time. Gave gluster volume heal <volname> command to trigger self-heal. glustershd crashed with the following backtrace. Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'. Program terminated with signal 6, Aborted. #0 0x000000390f432905 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 libgcc-4.4.5-6.el6.x86_64 (gdb) bt #0 0x000000390f432905 in raise () from /lib64/libc.so.6 #1 0x000000390f4340e5 in abort () from /lib64/libc.so.6 #2 0x000000390f42b9be in __assert_fail_base () from /lib64/libc.so.6 #3 0x000000390f42ba80 in __assert_fail () from /lib64/libc.so.6 #4 0x0000000000408d5b in glusterfs_xlator_op_response_send (req=0x1623d9c, op_ret=-1, msg=0x40e2d8 "", output=0x0) at ../../../glusterfsd/src/glusterfsd-mgmt.c:328 #5 0x000000000040a26d in glusterfs_handle_translator_op (data=0x1623d9c) at ../../../glusterfsd/src/glusterfsd-mgmt.c:731 #6 0x00007f87d3b18753 in synctask_wrap (old_task=0x17302a0) at ../../../libglusterfs/src/syncop.c:144 #7 0x000000390f443690 in ?? () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? () (gdb)f 4 #4 0x0000000000408d5b in glusterfs_xlator_op_response_send (req=0x1623d9c, op_ret=-1, msg=0x40e2d8 "", output=0x0) at ../../../glusterfsd/src/glusterfsd-mgmt.c:328 328 GF_ASSERT (output); (gdb) p output $1 = (dict_t *) 0x0 (gdb) f 5 #5 0x000000000040a26d in glusterfs_handle_translator_op (data=0x1623d9c) at ../../../glusterfsd/src/glusterfsd-mgmt.c:731 731 glusterfs_xlator_op_response_send (req, ret, "", output); (gdb) p output $2 = (dict_t *) 0x0 (gdb) (gdb) l glusterfs_handle_translator_op 651 return NULL; 652 } 653 654 int 655 glusterfs_handle_translator_op (void *data) 656 { 657 int32_t ret = -1; 658 gd1_mgmt_brick_op_req xlator_req = {0,}; 659 dict_t *input = NULL; 660 xlator_t *xlator = NULL; (gdb) 661 xlator_t *any = NULL; 662 dict_t *output = NULL; 663 char key[2048] = {0}; 664 char *xname = NULL; 665 glusterfs_ctx_t *ctx = NULL; 666 glusterfs_graph_t *active = NULL; 667 xlator_t *this = NULL; 668 int i = 0; 669 int count = 0; 670 rpcsvc_request_t *req = data; (gdb) 671 672 GF_ASSERT (req); 673 this = THIS; 674 GF_ASSERT (this); 675 676 if (!xdr_to_generic (req->msg[0], &xlator_req, 677 (xdrproc_t)xdr_gd1_mgmt_brick_op_req)) { 678 //failed to decode msg; 679 req->rpc_err = GARBAGE_ARGS; 680 goto out; (gdb) 681 } 682 683 ctx = glusterfs_ctx_get (); 684 active = ctx->active; 685 any = active->first; 686 input = dict_new (); 687 ret = dict_unserialize (xlator_req.input.input_val, 688 xlator_req.input.input_len, 689 &input); 690 if (ret < 0) { (gdb) 691 gf_log (this->name, GF_LOG_ERROR, 692 "failed to " 693 "unserialize req-buffer to dictionary"); 694 goto out; 695 } else { 696 input->extra_stdfree = xlator_req.input.input_val; 697 } 698 699 ret = dict_get_int32 (input, "count", &count); 700 (gdb) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create a replicate volume, start it and mount it. 2. Run fs-perf-test with number of fds as argument (4444 in this case) 3. Bring a brick down, bring it up after some time and trigger self-heal via gluster cli command Actual results: gluster self-heal daemon crashed. Expected results: gluster self-heal-daemon should not crash. Additional info: In the function glusterfs_handle_translator_op, suppose we cannot unserialize the dictionary that glusterd has sent, then we are going out to call glusterfs_translator_info_response_send, which expects the output dictionary to be present.(glusterfs_handle_translator_op creates the new output dictionary after unserializing the input dictionary it has received from glusterd). [2012-03-02 02:43:27.951724] I [afr-self-heal-common.c:2028:afr_self_heal_completion_cbk] 0-mirror-replicate-1: background meta-data data sel f-heal completed on [2012-03-02 02:43:27.953170] I [afr-common.c:1290:afr_launch_self_heal] 0-mirror-replicate-1: background meta-data data self-heal triggered. path: , reason: lookup detected pending operations [2012-03-02 02:43:33.736881] I [afr-self-heal-algorithm.c:131:sh_loop_driver_done] 0-mirror-replicate-1: diff self-heal on : completed. (134 b locks of 278 were different (48.20%)) [2012-03-02 02:43:33.739047] I [afr-self-heal-common.c:2028:afr_self_heal_completion_cbk] 0-mirror-replicate-1: background meta-data data self-heal completed on [2012-03-02 02:43:33.742508] I [afr-common.c:1290:afr_launch_self_heal] 0-mirror-replicate-1: background meta-data data self-heal triggered. path: , reason: lookup detected pending operations [2012-03-02 02:43:39.836988] I [afr-self-heal-algorithm.c:131:sh_loop_driver_done] 0-mirror-replicate-1: diff self-heal on : completed. (134 blocks of 278 were different (48.20%)) [2012-03-02 02:43:39.839189] I [afr-self-heal-common.c:2028:afr_self_heal_completion_cbk] 0-mirror-replicate-1: background meta-data data self-heal completed on [2012-03-02 02:43:39.842079] I [afr-common.c:1290:afr_launch_self_heal] 0-mirror-replicate-1: background meta-data data self-heal triggered. path: , reason: lookup detected pending operations [2012-03-02 02:43:42.506148] I [afr-self-heald.c:890:afr_find_child_position] 0-mirror-replicate-0: child mirror-client-1 is remote [2012-03-02 02:43:42.522395] W [dict.c:2578:dict_unserialize] (-->/lib64/libc.so.6() [0x390f443690] (-->/usr/local/lib/libglusterfs.so.0(synctask_wrap+0x38) [0x7f87d3b18753] (-->/usr/local/sbin/glusterfs(glusterfs_handle_translator_op+0x1a8) [0x409f55]))) 0-dict: buf is null! [2012-03-02 02:43:42.522427] E [glusterfsd-mgmt.c:693:glusterfs_handle_translator_op] 0-glusterfs: failed to unserialize req-buffer to dictionary pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 6 time of crash: 2012-03-02 02:43:42 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0qa25 /lib64/libc.so.6[0x390f432980] /lib64/libc.so.6(gsignal+0x35)[0x390f432905] /lib64/libc.so.6(abort+0x175)[0x390f4340e5] /lib64/libc.so.6[0x390f42b9be] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x390f42ba80]
please update these bugs w.r.to 3.3.0qa27, need to work on it as per target milestone set.
CHANGE: http://review.gluster.com/2961 (glusterfsd: Handle errors in response send) merged in master by Anand Avati (avati)
Not seen with glusterfs-3.3.0qa33.