Bug 761894 (GLUSTER-162)

Summary: Replication segfaults with many nodes
Product: [Community] GlusterFS Reporter: Ville Tuulos <tuulos>
Component: protocolAssignee: Vijay Bellur <vbellur>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs, gowda, tuulos, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Volume file with 4 nodes
none
Volume file with 8 nodes (segfaults after chdir) none

Description Ville Tuulos 2009-07-24 01:54:19 UTC
Created attachment 42 [details]
Proposed patch for sigprocmask(2) defect

Comment 1 Ville Tuulos 2009-07-24 04:53:35 UTC
My distribute+replicate volfile (attached) seems to work correctly with a small number of nodes (a volfile with 4 nodes, 8 volumes attached). However, when I increase the number of nodes to >6, I can mount glusterfs ok but as soon as I chdir to the glusterfs mountpoint, the glusterfs process segfaults on the node where I access the directory and also on the nodes that are in the same replication group.

I get two kinds of stack traces:

-- stack trace 1 --

(no errors before the trace)

pending frames:
frame : type(1) op(INODELK)
>> message repeats many times
frame : type(1) op(STAT)
>> message repeats many times
patchset: git://git.sv.gnu.org/gluster.git
signal received: 11
time of crash: 2009-07-23 19:44:46
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 2.1.0git
/lib/libc.so.6[0x7f45de6d6f60]
/usr/local/lib/libglusterfs.so.0(inode_ref+0xe)[0x7f45dee3872e]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(server_inodelk_resume+0x1b1)[0x7f45dd456261]
/usr/local/lib/libglusterfs.so.0(call_resume+0x2c0)[0x7f45dee39f50]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(server_inodelk+0x15a)[0x7f45dd458a6a]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(protocol_server_pollin+0x9a)[0x7f45dd453e0a]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(notify+0x8b)[0x7f45dd453e9b]
/usr/local/lib/libglusterfs.so.0(transport_peerproc+0x8a)[0x7f45dee35f9a]
/lib/libpthread.so.0[0x7f45de9fefc7]
/lib/libc.so.6(clone+0x6d)[0x7f45de7745ad]
---------

-- stack trace 2 --

[2009-07-23 21:33:00] E [afr.c:2246:notify] repl1-vol2: All subvolumes are down. Going offline until atleast one of them comes back up.
pending frames:
frame : type(1) op(INODELK)
>> message repeats many times
patchset: git://git.sv.gnu.org/gluster.git
signal received: 11
time of crash: 2009-07-23 21:33:00
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 2.1.0git
/lib/libc.so.6[0x7f636e25cf60]
/usr/local/lib/libglusterfs.so.0(inode_ref+0xe)[0x7f636e9be72e]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(server_inodelk_resume+0x1b1)[0x7f636cfdc261]
/usr/local/lib/libglusterfs.so.0(call_resume+0x2c0)[0x7f636e9bff50]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(server_inodelk+0x15a)[0x7f636cfdea6a]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(protocol_server_pollin+0x9a)[0x7f636cfd9e0a]
/usr/local/lib/glusterfs/2.1.0git/xlator/protocol/server.so(notify+0x8b)[0x7f636cfd9e9b]
/usr/local/lib/libglusterfs.so.0(xlator_notify+0x43)[0x7f636e9b18e3]
/usr/local/lib/glusterfs/2.1.0git/transport/socket.so(socket_event_handler+0xd0)[0x7f636c11bfa0]
/usr/local/lib/libglusterfs.so.0[0x7f636e9cac77]
glusterfs(main+0x8ad)[0x403ffd]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7f636e2491a6]
glusterfs[0x402859]
---------

Comment 2 Basavanagowda Kanur 2009-07-24 08:44:11 UTC
the problem is with protocol/server and has nothing to do with the increased number of nodes.

thanks for reporting the bug. fix will be available soon on the git repository.

Comment 3 Anand Avati 2009-07-27 15:33:59 UTC
PATCH: http://patches.gluster.com/patch/818 in release-2.0 (protocol/server: add checks for updatation of loc->parent in entrylk() or inodelk().)