Bug 802403

Summary: nfs-nlm: (crash)add-brick while locks are getting held
Product: [Community] GlusterFS Reporter: Saurabh <saujain>
Component: nfsAssignee: Amar Tumballi <amarts>
Severity: high Docs Contact:
Priority: unspecified    
Version: pre-releaseCC: gluster-bugs, mzywusko, shwetha.h.panduranga, vbellur, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 14:01:10 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: 3.3.0qa33 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 817967    

Description Saurabh 2012-03-12 09:36:41 EDT
Description of problem:

Core was generated by `/root/330/inst/sbin/glusterfs -f /etc/glusterd/nfs/nfs-server.vol -p /etc/glust'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f5d051453c7 in __list_splice (list=0x7f5ce0014908, head=0x7f5ce00148a8) at ../../../libglusterfs/src/list.h:106
106		(head->next)->prev = (list->prev);
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64 libgcc-4.4.6-3.el6.x86_64
(gdb) bt
#0  0x00007f5d051453c7 in __list_splice (list=0x7f5ce0014908, head=0x7f5ce00148a8) at ../../../libglusterfs/src/list.h:106
#1  0x00007f5d0514541d in list_splice_init (list=0x7f5ce0014908, head=0x7f5ce00148a8) at ../../../libglusterfs/src/list.h:130
#2  0x00007f5d051467ae in saved_frames_unwind (saved_frames=0x7f5ce00148a0) at rpc-clnt.c:360
#3  0x00007f5d05146aa3 in saved_frames_destroy (frames=0x7f5ce00148a0) at rpc-clnt.c:405
#4  0x00007f5d051495d5 in rpc_clnt_destroy (rpc=0x7f5cdc000fd0) at rpc-clnt.c:1578
#5  0x00007f5d05149698 in rpc_clnt_unref (rpc=0x7f5cdc000fd0) at rpc-clnt.c:1604
#6  0x00007f5d00caebee in nlm_set_rpc_clnt (rpc_clnt=0x7f5ce0000fd0, caller_name=0x1cf4930 "RHSSA1") at nlm4.c:319
#7  0x00007f5d00cb0e6d in nlm4_establish_callback (csarg=0x7f5cff1085d4) at nlm4.c:945
#8  0x0000003c5be077f1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003c5bae592d in clone () from /lib64/libc.so.6
(gdb) quit

Version-Release number of selected component (if applicable):


How reproducible:
happened once

Steps to Reproduce:
1. create a distribute-replicate volume
2. nfs mount
3. create 100 files
4. start putting locks(shared lock) on them
5. in the mean time start add-brick and rebalance
6. also on try to hold lock for a file with exclusive one lock
Actual results:
crash is seen while add-brick/rebalance

Expected results:
1. crash should not have happened
2. the lock in step 6 should not be held till the already held lock on that file is released

Additional info:

nfs.log information

[2012-03-12 09:26:56.641761] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-dist-rep-replicate-0: added root inode
[2012-03-12 09:26:56.642783] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-dist-rep-replicate-2: added root inode
[2012-03-12 09:26:56.642861] I [afr-common.c:1850:afr_set_root_inode_on_first_lookup] 0-dist-rep-replicate-1: added root inode
[2012-03-12 09:31:32.800115] E [rpc-clnt.c:382:saved_frames_unwind] (-->/root/330/inst/lib/libgfrpc.so.0(+0x135b2) [0x7f8e5b4bc5b2] (-->/root/330/inst/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f8e5b4ba05d] (-->/root/330/inst/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f8e5b4b9aa3]))) 0-NLM-client: forced unwinding frame type(NLMv4) op(GRANTED(5)) called at 2012-03-12 09:31:32.799004 (xid=0x4x)
[2012-03-12 09:31:32.800206] I [mem-pool.c:578:mem_pool_destroy] 0-nfs-server: size=2236 max=2 total=4
[2012-03-12 09:31:32.800227] I [mem-pool.c:578:mem_pool_destroy] 0-nfs-server: size=124 max=2 total=4
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2012-03-12 09:31:32
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0qa27
Comment 1 Krishna Srinivas 2012-03-14 06:44:06 EDT
rpc_clnt_connection_cleanup() called during unref() does saved_frames_destroy which in turn does a ref() and unref() on the rpc_clnt. Because of this behavior rpc_clnt gets destroyed again causing memory corruption.

I will send a patch which will do rpc_clnt_connection_cleanup() before unref()ing the rpc_clnt which will prevent the mem corruptuon/crash but still a kludgy approach.
Comment 2 Shwetha Panduranga 2012-03-19 08:16:16 EDT
*** Bug 804489 has been marked as a duplicate of this bug. ***
Comment 3 Anand Avati 2012-03-19 12:14:46 EDT
CHANGE: http://review.gluster.com/2979 (rpc-clnt: separate out connection_cleanup() from destroy()) merged in master by Vijay Bellur (vijay@gluster.com)