Description of problem: Core was generated by `/root/330/inst/sbin/glusterfs -f /etc/glusterd/nfs/nfs-server.vol -p /etc/glust'. Program terminated with signal 11, Segmentation fault. #0 0x00007f5d051453c7 in __list_splice (list=0x7f5ce0014908, head=0x7f5ce00148a8) at ../../../libglusterfs/src/list.h:106 106 (head->next)->prev = (list->prev); Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 libgcc-4.4.6-3.el6.x86_64 (gdb) bt #0 0x00007f5d051453c7 in __list_splice (list=0x7f5ce0014908, head=0x7f5ce00148a8) at ../../../libglusterfs/src/list.h:106 #1 0x00007f5d0514541d in list_splice_init (list=0x7f5ce0014908, head=0x7f5ce00148a8) at ../../../libglusterfs/src/list.h:130 #2 0x00007f5d051467ae in saved_frames_unwind (saved_frames=0x7f5ce00148a0) at rpc-clnt.c:360 #3 0x00007f5d05146aa3 in saved_frames_destroy (frames=0x7f5ce00148a0) at rpc-clnt.c:405 #4 0x00007f5d051495d5 in rpc_clnt_destroy (rpc=0x7f5cdc000fd0) at rpc-clnt.c:1578 #5 0x00007f5d05149698 in rpc_clnt_unref (rpc=0x7f5cdc000fd0) at rpc-clnt.c:1604 #6 0x00007f5d00caebee in nlm_set_rpc_clnt (rpc_clnt=0x7f5ce0000fd0, caller_name=0x1cf4930 "RHSSA1") at nlm4.c:319 #7 0x00007f5d00cb0e6d in nlm4_establish_callback (csarg=0x7f5cff1085d4) at nlm4.c:945 #8 0x0000003c5be077f1 in start_thread () from /lib64/libpthread.so.0 #9 0x0000003c5bae592d in clone () from /lib64/libc.so.6 (gdb) quit Version-Release number of selected component (if applicable): 3.3.0qa27 How reproducible: happened once Steps to Reproduce: 1. create a distribute-replicate volume 2. nfs mount 3. create 100 files 4. start putting locks(shared lock) on them 5. in the mean time start add-brick and rebalance 6. also on try to hold lock for a file with exclusive one lock Actual results: crash is seen while add-brick/rebalance Expected results: 1. crash should not have happened 2. the lock in step 6 should not be held till the already held lock on that file is released Additional info: nfs.log information [2012-03-12 09:26:56.641761] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-dist-rep-replicate-0: added root inode [2012-03-12 09:26:56.642783] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-dist-rep-replicate-2: added root inode [2012-03-12 09:26:56.642861] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-dist-rep-replicate-1: added root inode [2012-03-12 09:31:32.800115] E [rpc-clnt.c:382:saved_frames_unwind] (-->/root/330/inst/lib/libgfrpc.so.0(+0x135b2) [0x7f8e5b4bc5b2] (-->/root/330/inst/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8e5b4ba05d] (-->/root/330/inst/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f8e5b4b9aa3]))) 0-NLM-client: forced unwinding frame type(NLMv4) op(GRANTED(5)) called at 2012-03-12 09:31:32.799004 (xid=0x4x) [2012-03-12 09:31:32.800206] I [mem-pool.c:578:mem_pool_destroy] 0-nfs-server: size=2236 max=2 total=4 [2012-03-12 09:31:32.800227] I [mem-pool.c:578:mem_pool_destroy] 0-nfs-server: size=124 max=2 total=4 pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-03-12 09:31:32 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.3.0qa27 /lib64/libc.so.6[0x3c5ba32900] /root/330/inst/lib/libgfrpc.so.0(+0xf3c7)[0x7f8e5b4b83c7] /root/330/inst/lib/libgfrpc.so.0(+0xf41d)[0x7f8e5b4b841d] /root/330/inst/lib/libgfrpc.so.0(saved_frames_unwind+0x88)[0x7f8e5b4b97ae] /root/330/inst/lib/libgfrpc.so.0(saved_frames_destroy+0x1f)[0x7f8e5b4b9aa3] /root/330/inst/lib/libgfrpc.so.0(+0x135d5)[0x7f8e5b4bc5d5] /root/330/inst/lib/libgfrpc.so.0(rpc_clnt_unref+0x6f)[0x7f8e5b4bc698] /root/330/inst/lib/glusterfs/3.3.0qa27/xlator/nfs/server.so(nlm_set_rpc_clnt+0x221)[0x7f8e57021bee] /root/330/inst/lib/glusterfs/3.3.0qa27/xlator/nfs/server.so(nlm4_establish_callback+0x5a0)[0x7f8e57023e6d] /lib64/libpthread.so.0[0x3c5be077f1] /lib64/libc.so.6(clone+0x6d)[0x3c5bae592d] ---------
rpc_clnt_connection_cleanup() called during unref() does saved_frames_destroy which in turn does a ref() and unref() on the rpc_clnt. Because of this behavior rpc_clnt gets destroyed again causing memory corruption. I will send a patch which will do rpc_clnt_connection_cleanup() before unref()ing the rpc_clnt which will prevent the mem corruptuon/crash but still a kludgy approach.
*** Bug 804489 has been marked as a duplicate of this bug. ***
CHANGE: http://review.gluster.com/2979 (rpc-clnt: separate out connection_cleanup() from destroy()) merged in master by Vijay Bellur (vijay)