Bug 762601 (GLUSTER-869)

Summary: replicate crash in selfheal
Product: [Community] GlusterFS Reporter: Harshavardhana <fharshav>
Component: replicateAssignee: Vikas Gorur <vikas>
Status: CLOSED DUPLICATE QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: mainlineCC: amarts, chida, cww, gluster-bugs, rabhat, shehjart
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Harshavardhana 2010-04-28 22:26:36 UTC
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
Core was generated by `/usr/local/sbin/glusterfs --log-level=NORMAL --volume-name=iocache --volfile=/u'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fedee0a6005 in afr_sh_data_fix (frame=0x59316e8, this=0x616d00) at afr-self-heal-data.c:750
750             orig_local->cont.lookup.xattr = dict_ref (orig_local->cont.lookup.xattrs[sh->source]);
(gdb) p orig_local->cont.lookup.xattrs
$1 = (dict_t **) 0x0
(gdb) bt
#0  0x00007fedee0a6005 in afr_sh_data_fix (frame=0x59316e8, this=0x616d00) at afr-self-heal-data.c:750
#1  0x00007fedee0a6292 in afr_sh_data_fstat_cbk (frame=0x59316e8, cookie=0x0, this=0x616d00, op_ret=<value optimized out>, op_errno=<value optimized out>,
    buf=0x7fffc983cc70) at afr-self-heal-data.c:828
#2  0x00007fedee2de93a in client_fstat_cbk (frame=0x57cf530, hdr=<value optimized out>, hdrlen=<value optimized out>, iobuf=<value optimized out>)
    at client-protocol.c:4344
#3  0x00007fedee2ca4da in protocol_client_pollin (this=0x613f40, trans=0x61f500) at client-protocol.c:6827
#4  0x00007fedee2d9c90 in notify (this=0x0, event=<value optimized out>, data=0x61f500) at client-protocol.c:6946
#5  0x00007fedef471493 in xlator_notify (xl=0x613f40, event=2, data=0x61f500) at xlator.c:924
#6  0x00007fedec801690 in socket_event_handler (fd=<value optimized out>, idx=32, data=0x61f500, poll_in=1, poll_out=0, poll_err=0) at socket.c:831
#7  0x00007fedef48bdbf in event_dispatch_epoll_handler (event_pool=0x6094f0) at event.c:804
#8  event_dispatch_epoll (event_pool=0x6094f0) at event.c:867
#9  0x000000000040451d in main (argc=5, argv=0x7fffc983dbe8) at glusterfsd.c:1425
(gdb) p orig_local->cont.lookup
$2 = {inode = 0x0, buf = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 121047,
    st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0},
    __unused = {0, 0, 0}}, read_child_buf = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0,
    st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0,
      tv_nsec = 0}, __unused = {0, 0, 0}}, postparent = {st_dev = 0, st_ino = 0, st_nlink = 0, st_mode = 0, st_uid = 0, st_gid = 0, __pad0 = 0,
    st_rdev = 0, st_size = 0, st_blksize = 0, st_blocks = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {
      tv_sec = 0, tv_nsec = 0}, __unused = {0, 0, 0}}, ino = 0, gen = 0, parent_ino = 0, xattr = 0x0, xattrs = 0x0, is_revalidate = _gf_false}
(gdb)

Comment 1 Harshavardhana 2010-04-29 01:22:29 UTC
gluster006 was rebooted after which we see this crash


[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-3: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-3: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-3: forced unwinding frame type(2) op(PING)
[2010-04-29 01:15:47] N [client-protocol.c:6994:notify] gluster006-3: disconnected
[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-1: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-1: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-1: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:15:47] E [saved-frames.c:165:saved_frames_unwind] gluster006-1: forced unwinding frame type(2) op(PING)
[2010-04-29 01:15:47] N [client-protocol.c:6994:notify] gluster006-1: disconnected
[2010-04-29 01:15:54] E [client-protocol.c:415:client_ping_timer_expired] gluster006-3: Server 192.168.101.56:10002 has not responded in the last 42 seconds, disconnecting.
[2010-04-29 01:15:59] E [socket.c:762:socket_connect_finish] gluster006-3: connection to 192.168.101.56:10002 failed (No route to host)
[2010-04-29 01:16:08] E [socket.c:762:socket_connect_finish] gluster006-1: connection to 192.168.101.56:10002 failed (Connection timed out)
[2010-04-29 01:16:08] E [socket.c:762:socket_connect_finish] gluster006-1: connection to 192.168.101.56:10002 failed (Connection timed out)
[2010-04-29 01:16:18] E [socket.c:762:socket_connect_finish] gluster006-3: connection to 192.168.101.56:10002 failed (Connection timed out)
[2010-04-29 01:16:36] E [client-protocol.c:415:client_ping_timer_expired] gluster006-2: Server 192.168.101.56:10002 has not responded in the last 42 seconds, disconnecting.
[2010-04-29 01:16:36] E [saved-frames.c:165:saved_frames_unwind] gluster006-2: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:16:36] E [saved-frames.c:165:saved_frames_unwind] gluster006-2: forced unwinding frame type(1) op(LOOKUP)
[2010-04-29 01:16:36] E [saved-frames.c:165:saved_frames_unwind] gluster006-2: forced unwinding frame type(1) op(STAT)
[2010-04-29 01:16:36] E [saved-frames.c:165:saved_frames_unwind] gluster006-2: forced unwinding frame type(2) op(PING)
[2010-04-29 01:16:36] N [client-protocol.c:6994:notify] gluster006-2: disconnected
[2010-04-29 01:16:40] N [client-protocol.c:6246:client_setvolume_cbk] gluster006-2: Connected to 192.168.101.56:10002, attached to remote volume 'brick2'.
[2010-04-29 01:16:41] N [client-protocol.c:6246:client_setvolume_cbk] gluster006-2: Connected to 192.168.101.56:10002, attached to remote volume 'brick2'.
[2010-04-29 01:16:41] N [client-protocol.c:6246:client_setvolume_cbk] gluster006-3: Connected to 192.168.101.56:10002, attached to remote volume 'brick3'.
[2010-04-29 01:16:41] E [afr-self-heal-common.c:1214:sh_missing_entries_create] mirror-8: no missing files - /working/archive2.log. proceeding to metadata check
pending frames:
frame : type(1) op(LOOKUP)

patchset: v3.0.4avail1
signal received: 11
time of crash: 2010-04-29 01:16:41
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.4avail1
/lib/libc.so.6[0x7fedeed19290]
/usr/local/lib/glusterfs/3.0.4avail1/xlator/cluster/replicate.so(afr_sh_data_fix+0x1d5)[0x7fedee0a6005]
/usr/local/lib/glusterfs/3.0.4avail1/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0xa2)[0x7fedee0a6292]
/usr/local/lib/glusterfs/3.0.4avail1/xlator/protocol/client.so(client_fstat_cbk+0x1aa)[0x7fedee2de93a]
/usr/local/lib/glusterfs/3.0.4avail1/xlator/protocol/client.so(protocol_client_pollin+0xca)[0x7fedee2ca4da]
/usr/local/lib/glusterfs/3.0.4avail1/xlator/protocol/client.so(notify+0xe0)[0x7fedee2d9c90]
/usr/local/lib/libglusterfs.so.0(xlator_notify+0x43)[0x7fedef471493]
/usr/local/lib/glusterfs/3.0.4avail1/transport/socket.so(socket_event_handler+0xe0)[0x7fedec801690]
/usr/local/lib/libglusterfs.so.0[0x7fedef48bdbf]
/usr/local/sbin/glusterfs(main+0x8cd)[0x40451d]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7fedeed055c6]
/usr/local/sbin/glusterfs[0x402a59]

Comment 2 Amar Tumballi 2010-05-04 08:07:00 UTC
This issue came up with the patch for bug 762547. Currently this is fixed by reverting the patch.

*** This bug has been marked as a duplicate of bug 815 ***

Comment 3 Shehjar Tikoo 2010-05-31 04:12:14 UTC
*** Bug 953 has been marked as a duplicate of this bug. ***

Comment 4 Raghavendra Bhat 2010-06-15 06:25:37 UTC
Bug 815 which is the duplicate if this bug has been checked and moved to verified status.