Bug 765407 (GLUSTER-3675)

Summary: [glusterfs-3.3.0qa13]: glustershd crashed due to race in stack_destroy, unlock
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: pre-releaseCC: gluster-bugs, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Raghavendra Bhat 2011-10-01 13:04:45 UTC
glustershd crashed because of a race present in stack_destroy and unlock in which case by the time the reply has come for the unlock op, the frame would have been destroyed.

This is the backtrace.


Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002aaaacf1ff53 in afr_unlock_inodelk_cbk (frame=0x2aadaa8ad854, cookie=0x1, this=0xa608a50, op_ret=0, op_errno=0)
    at ../../../../../xlators/cluster/afr/src/afr-lk-common.c:571
571             int_lock->inode_locked_nodes[child_index] &= LOCKED_NO;
(gdb) bt
#0  0x00002aaaacf1ff53 in afr_unlock_inodelk_cbk (frame=0x2aadaa8ad854, cookie=0x1, this=0xa608a50, op_ret=0, op_errno=0)
    at ../../../../../xlators/cluster/afr/src/afr-lk-common.c:571
#1  0x00002aaaaccb54e0 in client3_1_inodelk_cbk (req=0x2aaaad36b520, iov=0x2aaaad36b560, count=1, myframe=0x2aadaa616070)
    at ../../../../../xlators/protocol/client/src/client3_1-fops.c:1237
#2  0x00002aada996825e in rpc_clnt_handle_reply (clnt=0xa60ec80, pollin=0x2aaab804f6e0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:789
#3  0x00002aada9968586 in rpc_clnt_notify (trans=0xa60efb0, mydata=0xa60ecb0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x2aaab804f6e0)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:902
#4  0x00002aada99649f3 in rpc_transport_notify (this=0xa60efb0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x2aaab804f6e0)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:498
#5  0x00002aaaaab5cee3 in socket_event_poll_in (this=0xa60efb0) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1675
#6  0x00002aaaaab5d425 in socket_event_handler (fd=31, idx=23, data=0xa60efb0, poll_in=1, poll_out=0, poll_err=0)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1790
#7  0x00002aada970fb40 in event_dispatch_epoll_handler (event_pool=0xa5fb960, events=0xa6003f0, i=0) at ../../../libglusterfs/src/event.c:794
#8  0x00002aada970fd45 in event_dispatch_epoll (event_pool=0xa5fb960) at ../../../libglusterfs/src/event.c:856
#9  0x00002aada971009f in event_dispatch (event_pool=0xa5fb960) at ../../../libglusterfs/src/event.c:956
#10 0x000000000040784d in main (argc=11, argv=0x7fffe64c4ef8) at ../../../glusterfsd/src/glusterfsd.c:1592
(gdb) info thr
  4 Thread 25842  0x00000030b960e838 in do_sigwait () from /lib64/libpthread.so.0
  3 Thread 25843  0x00000030b960aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 Thread 25844  0x00000030b8e9a541 in nanosleep () from /lib64/libc.so.6
* 1 Thread 25841  0x00002aaaacf1ff53 in afr_unlock_inodelk_cbk (frame=0x2aadaa8ad854, cookie=0x1, this=0xa608a50, op_ret=0, op_errno=0)
    at ../../../../../xlators/cluster/afr/src/afr-lk-common.c:571
(gdb) f 0
#0  0x00002aaaacf1ff53 in afr_unlock_inodelk_cbk (frame=0x2aadaa8ad854, cookie=0x1, this=0xa608a50, op_ret=0, op_errno=0)
    at ../../../../../xlators/cluster/afr/src/afr-lk-common.c:571
571             int_lock->inode_locked_nodes[child_index] &= LOCKED_NO;
(gdb) p int_lock
$1 = (afr_internal_lock_t *) 0x108
(gdb) p local
$2 = (afr_local_t *) 0x0
(gdb) 


These are the logs:

[2011-10-01 07:05:16.831656] I [afr-common.c:1083:afr_detect_self_heal_by_iatt] 0-mirror-replicate-0: size differs for /run24646/fileop_L1_2/f
ileop_L1_2_L2_5/fileop_file_2_5_11 
[2011-10-01 07:05:16.831684] I [afr-common.c:1234:afr_launch_self_heal] 0-mirror-replicate-0: background  meta-data data self-heal triggered. 
path: /run24646/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_11, reason: lookup detected pending operations
[2011-10-01 07:05:16.836310] I [afr-self-heal-algorithm.c:126:sh_loop_driver_done] 0-mirror-replicate-0: full self-heal completed on /run24646
/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_8
[2011-10-01 07:05:17.331482] I [afr-self-heal-common.c:2055:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background  meta-data data sel
f-heal completed on /run24646/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_8
[2011-10-01 07:05:17.717995] W [socket.c:1510:__socket_proto_state_machine] 0-mirror-client-0: reading from socket failed. Error (Transport en
dpoint is not connected), peer (10.1.11.73:24009)
[2011-10-01 07:05:17.718481] E [rpc-clnt.c:380:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x158) [0x2aada9968455] (
-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x101) [0x2aada9967a03] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1c
) [0x2aada996752f]))) 0-mirror-client-0: forced unwinding frame type(GlusterFS 3.1) op(INODELK(29)) called at 2011-10-01 07:05:17.332284
[2011-10-01 07:05:17.718515] W [client3_1-fops.c:1235:client3_1_inodelk_cbk] 0-mirror-client-0: remote operation failed: Transport endpoint is
 not connected
[2011-10-01 07:05:17.718603] I [client.c:1885:client_rpc_notify] 0-mirror-client-0: disconnected
[2011-10-01 07:05:17.724161] W [client3_1-fops.c:4649:client3_1_inodelk] 0-mirror-client-0: failed to send the fop: Transport endpoint is not 
connected
[2011-10-01 07:05:17.744728] W [client3_1-fops.c:4369:client3_1_fxattrop] 0-mirror-client-0: (00000000-0000-0000-0000-000000000000): failed to
 get fd ctx. EBADFD
[2011-10-01 07:05:17.744751] W [client3_1-fops.c:4435:client3_1_fxattrop] 0-mirror-client-0: failed to send the fop: File descriptor in bad st
ate
[2011-10-01 07:05:17.752388] W [client3_1-fops.c:3702:client3_1_fstat] 0-mirror-client-0: (00000000-0000-0000-0000-000000000000): failed to ge
t fd ctx. EBADFD
[2011-10-01 07:05:17.752410] W [client3_1-fops.c:3716:client3_1_fstat] 0-mirror-client-0: failed to send the fop: File descriptor in bad state
[2011-10-01 07:05:17.754059] I [syncop.c:89:synctask_wake] 0-: SYNCOP: task: 0xa617a40, task->env: 0xa5fc530
[2011-10-01 07:05:17.754098] W [client3_1-fops.c:2761:client3_1_ftruncate] 0-mirror-client-0: (00000000-0000-0000-0000-000000000000): failed t
o get fd ctx. EBADFD
[2011-10-01 07:05:17.754115] W [client3_1-fops.c:2777:client3_1_ftruncate] 0-mirror-client-0: failed to send the fop: File descriptor in bad s
tate
[2011-10-01 07:05:17.754130] I [afr-self-heal-data.c:571:afr_sh_data_trim_cbk] 0-mirror-replicate-0: ftruncate of /run24646/fileop_L1_2/fileop
_L1_2_L2_5/fileop_file_2_5_11 on subvolume mirror-client-0 failed (File descriptor in bad state)
[2011-10-01 07:05:17.754159] W [client3_1-fops.c:4649:client3_1_inodelk] 0-mirror-client-0: failed to send the fop: Transport endpoint is not 
connected
[2011-10-01 07:05:17.754210] I [syncop.c:243:syncenv_processor] 0-: SYNCOP: task: 0xa617a40, env: 0xa5fc530, task->env: 0xa5fc530
[2011-10-01 07:05:17.782468] I [syncop.c:89:synctask_wake] 0-: SYNCOP: task: 0xa617a40, task->env: 0xa5fc530
[2011-10-01 07:05:17.782505] I [syncop.c:243:syncenv_processor] 0-: SYNCOP: task: 0xa617a40, env: 0xa5fc530, task->env: 0xa5fc530
[2011-10-01 07:05:17.782928] W [client3_1-fops.c:4649:client3_1_inodelk] 0-mirror-client-0: failed to send the fop: Transport endpoint is not 
connected
[2011-10-01 07:05:18.304095] I [syncop.c:89:synctask_wake] 0-: SYNCOP: task: 0xa617a40, task->env: 0xa5fc530
[2011-10-01 07:05:18.304137] I [syncop.c:243:syncenv_processor] 0-: SYNCOP: task: 0xa617a40, env: 0xa5fc530, task->env: 0xa5fc530
[2011-10-01 07:05:18.305680] I [syncop.c:89:synctask_wake] 0-: SYNCOP: task: 0xa617a40, task->env: 0xa5fc530
[2011-10-01 07:05:18.305717] I [syncop.c:243:syncenv_processor] 0-: SYNCOP: task: 0xa617a40, env: 0xa5fc530, task->env: 0xa5fc530
[2011-10-01 07:05:18.306006] W [client3_1-fops.c:3572:client3_1_writev] 0-mirror-client-0: (00000000-0000-0000-0000-000000000000): failed to get fd ctx. EBADFD
[2011-10-01 07:05:18.306030] W [client3_1-fops.c:3588:client3_1_writev] 0-mirror-client-0: failed to send the fop: File descriptor in bad state
[2011-10-01 07:05:18.306045] E [afr-self-heal-algorithm.c:428:sh_loop_write_cbk] 0-mirror-replicate-0: write to /run24646/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_11 failed on subvolume mirror-client-0 (File descriptor in bad state)
[2011-10-01 07:05:18.306085] I [afr-self-heal-algorithm.c:119:sh_loop_driver_done] 0-mirror-replicate-0: self-heal aborting on /run24646/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_11
[2011-10-01 07:05:18.306122] W [client3_1-fops.c:3612:client3_1_flush] 0-mirror-client-0: (00000000-0000-0000-0000-000000000000): failed to get fd ctx. EBADFD
[2011-10-01 07:05:18.306140] W [client3_1-fops.c:3639:client3_1_flush] 0-mirror-client-0: failed to send the fop: File descriptor in bad state
[2011-10-01 07:05:18.306154] I [afr-self-heal-data.c:107:afr_sh_data_flush_cbk] 0-mirror-replicate-0: flush failed on /run24646/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_11 on subvolume mirror-client-0: File descriptor in bad state
[2011-10-01 07:05:18.306243] I [syncop.c:89:synctask_wake] 0-: SYNCOP: task: 0xa617a40, task->env: 0xa5fc530
[2011-10-01 07:05:18.306279] I [syncop.c:243:syncenv_processor] 0-: SYNCOP: task: 0xa617a40, env: 0xa5fc530, task->env: 0xa5fc530
[2011-10-01 07:05:18.307465] E [afr-self-heal-common.c:2052:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background  meta-data data self-heal failed on /run24646/fileop_L1_2/fileop_L1_2_L2_5/fileop_file_2_5_11
[2011-10-01 07:05:18.307668] I [syncop.c:89:synctask_wake] 0-: SYNCOP: task: 0xa617a40, task->env: 0xa5fc530
pending frames:

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2011-10-01 07:05:18
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.3.0qa13
/lib64/libc.so.6[0x30b8e302d0]
/usr/local/lib/glusterfs/3.3.0qa13/xlator/cluster/replicate.so[0x2aaaacf1ff53]
/usr/local/lib/glusterfs/3.3.0qa13/xlator/protocol/client.so(client3_1_inodelk_cbk+0x27d)[0x2aaaaccb54e0]

Comment 1 Anand Avati 2011-11-23 14:56:34 UTC
CHANGE: http://review.gluster.com/735 (Change-Id: I0d87f06f989b2d4b971967c52d4898331693a801) merged in master by Vijay Bellur (vijay)