Bug 763235 (GLUSTER-1503)

Summary: segfault in distribute during failover testing
Product: [Community] GlusterFS Reporter: Harshavardhana <fharshav>
Component: protocolAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: nfs-alphaCC: cww, gluster-bugs, shehjart
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Harshavardhana 2010-09-01 23:38:46 UTC
NFS volume config is standard config generated from volgen with "namelookup" turned off in NFS/server. 

Backtrace

===================

Program terminated with signal 11, Segmentation fault.
#0  dht_iatt_merge (this=0x1376d30, to=0x7fc3a01a9f50, from=0x13b90, subvol=0x136c5f0) at dht-helper.c:349
349             to->ia_dev      = from->ia_dev;
Missing separate debuginfos, use: debuginfo-install glibc-2.10.2-1.x86_64 libgcc-4.4.1-2.fc11.x86_64
(gdb) bt
#0  dht_iatt_merge (this=0x1376d30, to=0x7fc3a01a9f50, from=0x13b90, subvol=0x136c5f0) at dht-helper.c:349
#1  0x00007fc3a6457634 in dht_selfheal_dir_mkdir_cbk (frame=0x7fc3a01abc50, cookie=0x14b3480, this=0x1376d30, op_ret=<value optimized out>, 
    op_errno=<value optimized out>, inode=<value optimized out>, stbuf=0x0, preparent=0x3df2b69e80, postparent=0x13b90) at dht-selfheal.c:220
#2  0x00007fc3a668c881 in client_mkdir (frame=0x14b3480, this=<value optimized out>, loc=0x7fc3a01a9ce8, mode=<value optimized out>)
    at client-protocol.c:1062
#3  0x00007fc3a6457283 in dht_selfheal_dir_mkdir (frame=0x7fc3a01abc50, loc=<value optimized out>, layout=0x7fc3a01a8870, 
    force=<value optimized out>) at dht-selfheal.c:271
#4  0x00007fc3a64573b2 in dht_selfheal_restore (frame=0x7fc3a01abc50, dir_cbk=<value optimized out>, loc=0x7fc3a01a9ce8, 
    layout=0x7fc3a01a8870) at dht-selfheal.c:533
#5  0x00007fc3a6464b56 in dht_rmdir_cbk (frame=0x7fc3a01abc50, cookie=0x14b1ce0, this=0x1376d30, op_ret=0, op_errno=<value optimized out>, 
    preparent=<value optimized out>, postparent=0x7fff0169e390) at dht-common.c:3380
#6  0x00007fc3a6693d42 in client_rmdir_cbk (frame=0x14b1ce0, hdr=<value optimized out>, hdrlen=<value optimized out>, 
    iobuf=<value optimized out>) at client-protocol.c:4625
#7  0x00007fc3a667e70a in protocol_client_pollin (this=0x136b800, trans=0x1384550) at client-protocol.c:6435
#8  0x00007fc3a6684d28 in notify (this=0x1376d30, event=<value optimized out>, data=0x1384550) at client-protocol.c:6554
#9  0x0000003df2c14863 in xlator_notify (xl=0x136b800, event=2, data=0x1384550) at xlator.c:919
#10 0x00007fc3a52071d8 in socket_event_handler (fd=<value optimized out>, idx=20, data=0x1384550, poll_in=1, poll_out=0, 
    poll_err=<value optimized out>) at socket.c:831
#11 0x0000003df2c303cd in event_dispatch_epoll_handler (i=<value optimized out>, events=<value optimized out>, 
    event_pool=<value optimized out>) at event.c:804
#12 event_dispatch_epoll (i=<value optimized out>, events=<value optimized out>, event_pool=<value optimized out>) at event.c:867
#13 0x0000000000404272 in main (argc=<value optimized out>, argv=<value optimized out>) at glusterfsd.c:1494

=============================

Comment 1 Harshavardhana 2010-09-02 02:37:32 UTC
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-3: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-2: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-2: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-4: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-4: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-5: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-6: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-5: connection to 10.1.100.31:9697 failed (Connection refused)
[2010-09-02 02:24:47] E [socket.c:762:socket_connect_finish] 10.1.100.31-6: connection to 10.1.100.31:9697 failed (Connection refused)
pending frames:

patchset: v3.0.0-252-g17efe56
signal received: 11
time of crash: 2010-09-02 02:24:47
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs nfs_beta_rc11
/lib64/libc.so.6[0x3df28332f0]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/cluster/distribute.so(dht_iatt_merge+0x1d)[0x7fc3a64554ad]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/cluster/distribute.so(dht_selfheal_dir_mkdir_cbk+0x94)[0x7fc3a6457634]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/protocol/client.so(client_mkdir+0x261)[0x7fc3a668c881]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/cluster/distribute.so(dht_selfheal_dir_mkdir+0x293)[0x7fc3a6457283]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/cluster/distribute.so(dht_selfheal_restore+0x52)[0x7fc3a64573b2]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/cluster/distribute.so(dht_rmdir_cbk+0x3d6)[0x7fc3a6464b56]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/protocol/client.so(client_rmdir_cbk+0xe2)[0x7fc3a6693d42]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/protocol/client.so(protocol_client_pollin+0xca)[0x7fc3a667e70a]
/usr/lib64/glusterfs/nfs_beta_rc11/xlator/protocol/client.so(notify+0xe8)[0x7fc3a6684d28]
/usr/lib64/libglusterfs.so.0(xlator_notify+0x43)[0x3df2c14863]
/usr/lib64/glusterfs/nfs_beta_rc11/transport/socket.so(socket_event_handler+0xc8)[0x7fc3a52071d8]
/usr/lib64/libglusterfs.so.0[0x3df2c303cd]
glusterfsd(main+0x892)[0x404272]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3df281ea4d]
glusterfsd[0x4027e9]

Comment 2 Shehjar Tikoo 2010-09-02 03:11:04 UTC
Use mount type as nfs so I can differentiate bugs found on NFS mount vs FUSE mount.

Use version as nfs-alpha. We havent released nfs beta, just release candidates.

Comment 3 Shehjar Tikoo 2010-09-02 05:18:38 UTC
(In reply to comment #1)
....
....
....
> Program terminated with signal 11, Segmentation fault.
> #0  dht_iatt_merge (this=0x1376d30, to=0x7fc3a01a9f50, from=0x13b90,
> subvol=0x136c5f0) at dht-helper.c:349
> 349             to->ia_dev      = from->ia_dev;
> Missing separate debuginfos, use: debuginfo-install glibc-2.10.2-1.x86_64
> libgcc-4.4.1-2.fc11.x86_64
> (gdb) bt
> #0  dht_iatt_merge (this=0x1376d30, to=0x7fc3a01a9f50, from=0x13b90,
> subvol=0x136c5f0) at dht-helper.c:349
> #1  0x00007fc3a6457634 in dht_selfheal_dir_mkdir_cbk (frame=0x7fc3a01abc50,
> cookie=0x14b3480, this=0x1376d30, op_ret=<value optimized out>, 
>     op_errno=<value optimized out>, inode=<value optimized out>, stbuf=0x0,
> preparent=0x3df2b69e80, postparent=0x13b90) at dht-selfheal.c:220
> #2  0x00007fc3a668c881 in client_mkdir (frame=0x14b3480, this=<value optimized
> out>, loc=0x7fc3a01a9ce8, mode=<value optimized out>)
>     at client-protocol.c:1062

The STACK_UNWIND on a failed client_mkdir is not being called with all mkdir_cbk args with NULL, hence postparent gets crap in its pointer in dht.

Comment 4 Shehjar Tikoo 2010-09-02 09:30:55 UTC
Harsha,

there are two parts to the fixes, one in dht and the other in proto/client. Both will be available in rc13 later today.

Only the dht patch is relevant for mainline. The bug in proto/client does not exist anymore in mainline code.

Comment 5 Vijay Bellur 2010-09-02 09:57:01 UTC
PATCH: http://patches.gluster.com/patch/4472 in master (cluster/dht: check for op_ret in dht_selfheal_dir_mkdir_cbk ())