Description of problem: ------------------------- glusterd is seen to crash with the following backtrace - pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-12-16 07:13:10configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.49rhs /lib64/libc.so.6[0x309fc32960] /usr/lib64/libglusterfs.so.0(synctask_yield+0x10)[0x30a104ae00] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(gd_stop_rebalance_process+0x2c5)[0x7fd2cae55f45] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(gd_check_and_update_rebalance_info+0xd8)[0x7fd2cae5d368] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_import_friend_volume+0x19f)[0x7fd2cae6b40f] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_import_friend_volumes+0x66)[0x7fd2cae6b536] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_compare_friend_data+0x142)[0x7fd2cae6b752] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(+0x378ac)[0x7fd2cae478ac] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x19e)[0x7fd2cae47f2e] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(__glusterd_handle_incoming_friend_req+0x2fe)[0x7fd2cae4672e] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd2cae3619f] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x295)[0x30a1809585] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x30a18097c3] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x30a180adf8] /usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0x8d86)[0x7fd2c94bcd86] /usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0xa69d)[0x7fd2c94be69d] /usr/lib64/libglusterfs.so.0[0x30a1062387] /usr/sbin/glusterd(main+0x6c7)[0x4069d7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x309fc1ecdd] /usr/sbin/glusterd[0x404619] --------- I had performed rebalance and remove-brick a couple of times. Restarted glusterd. Version-Release number of selected component (if applicable): glusterfs 3.4.0.49rhs How reproducible: Seen it once. Steps to Reproduce: 1. Perform rebalance and remove-brick a few times. Restarted glusterd a few times. Actual results: glusterd crashed. Expected results: glusterd should not crash. Additional info:
Created attachment 837263 [details] core
Shruti, This issue is the clone of, https://bugzilla.redhat.com/show_bug.cgi?id=1024316 and that has BLOCKER flag added to that.
I am not working in glusterd segment. Just came across it and loaded the core. Quick Core analysis: (gdb) where #0 synctask_yield (task=0x0) at syncop.c:247 #1 0x00007fd2cae55f45 in gd_stop_rebalance_process (volinfo=0x15a28c0) at glusterd-utils.c:9102 #2 0x00007fd2cae5d368 in gd_check_and_update_rebalance_info (old_volinfo=0x15a28c0, new_volinfo=0x15b1c40) at glusterd-utils.c:3241 #3 0x00007fd2cae6b40f in glusterd_import_friend_volume (vols=0x7fd2cd0cbef0, count=2) at glusterd-utils.c:3287 #4 0x00007fd2cae6b536 in glusterd_import_friend_volumes (vols=0x7fd2cd0cbef0) at glusterd-utils.c:3327 #5 0x00007fd2cae6b752 in glusterd_compare_friend_data (vols=0x7fd2cd0cbef0, status=0x7fffea9c40ec, hostname=0x15a3ac0 "10.70.37.169") at glusterd-utils.c:3471 #6 0x00007fd2cae478ac in glusterd_ac_handle_friend_add_req (event=<value optimized out>, ctx=0x160f7a0) at glusterd-sm.c:654 #7 0x00007fd2cae47f2e in glusterd_friend_sm () at glusterd-sm.c:1026 #8 0x00007fd2cae4672e in __glusterd_handle_incoming_friend_req (req=0x7fd2c96c702c) at glusterd-handler.c:2043 #9 0x00007fd2cae3619f in glusterd_big_locked_handler (req=0x7fd2c96c702c, actor_fn=0x7fd2cae46430 <__glusterd_handle_incoming_friend_req>) at glusterd-handler.c:77 #10 0x00000030a1809585 in rpcsvc_handle_rpc_call (svc=<value optimized out>, trans=<value optimized out>, msg=0x15cdf70) at rpcsvc.c:629 #11 0x00000030a18097c3 in rpcsvc_notify (trans=0x15d07f0, mydata=<value optimized out>, event=<value optimized out>, data=0x15cdf70) at rpcsvc.c:723 #12 0x00000030a180adf8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512 #13 0x00007fd2c94bcd86 in socket_event_poll_in (this=0x15d07f0) at socket.c:2119 #14 0x00007fd2c94be69d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x15d07f0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2229 #15 0x00000030a1062387 in event_dispatch_epoll_handler (event_pool=0x1584ee0) at event-epoll.c:384 #16 event_dispatch_epoll (event_pool=0x1584ee0) at event-epoll.c:445 #17 0x00000000004069d7 in main (argc=2, argv=0x7fffea9c5ed8) at glusterfsd.c:2050 synctask_yield() segfaults because of NULL-ptr-deref. Looks like the routine synctask_get() in GD_SYNCOP() macro returns NULL. glusterd expert(s) can tell what can cause synctask to be NULL. I guess validating the inputs and logging would be much better than assuming and crashing. -Santosh
Saw another crash later on the same server - pending frames: frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2013-12-16 22:13:49configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.4.0.49rhs /lib64/libc.so.6[0x309fc32960] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(__glusterd_defrag_notify+0x1d0)[0x7fd916e095d0] /usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_notify+0x60)[0x7fd916db93c0] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x109)[0x30a180f539] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x30a180adf8] /usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0x557c)[0x7fd91543c57c] /usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0xa5b8)[0x7fd9154415b8] /usr/lib64/libglusterfs.so.0[0x30a1062387] /usr/sbin/glusterd(main+0x6c7)[0x4069d7] /lib64/libc.so.6(__libc_start_main+0xfd)[0x309fc1ecdd] /usr/sbin/glusterd[0x404619] ---------
Saw another crash, find core attached.
Created attachment 837656 [details] core - new
Patch posted for review at https://code.engineering.redhat.com/gerrit/17693
As per discussion with krishnan simplified steps to reproduce the problem is Scenario 1 ---------- 1. created a distributed-replicate volume 2. Run rebalance and remove-bricks on the volume 3. stop the volume and delete the volume 4 Run some gluster commands result: ---- No crash in glusterd scenario 2 ---------- 1. created a distributed volume using 2 node cluster 2. add-brick and ran rebalance on the volume 3. bring down one of the node 4. while a node is down run volume set command from another node 5. after node comes back run some gluster commands result: ------ No crash in glusterd verified on 3.4.0.54rhs-2.el6rhs.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html