Description of problem: ======================= On DHT volume, while doing parallel mkdir and file creation inside, client got crashed Version-Release number of selected component (if applicable): ============================================================= 3.6.0.45-1.el6rhs.x86_64 How reproducible: ================= Haven't tried Steps to Reproduce: =================== 1. created DHT volume having 10 bricks 2. mounted on 2 client - 4 FUSE and 4 NFS 3. start creating Directories from multiple mount and file creations inside it. 4. keep changing epoll thread value. Actual results: =============== After sometime client process was crashed on 2 client machine Expected results: ================= Additional info: ================= log:- [2015-02-19 15:15:01.756215] I [event-epoll.c:628:event_dispatch_e poll_worker] 0-epoll: Exited thread with index 22 pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) frame : type(1) op(OPEN) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-02-19 15:15:59 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.0.45 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3dcac207b6] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3dcac3b3cf] /lib64/libc.so.6[0x3dc94326a0] /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_discover_complete+0x3a)[0x7fe9f8f7d98a] /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so(dht_discover_cbk+0x273)[0x7fe9f8f85733] /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so(client3_3_lookup_cbk+0x647)[0x7fe9f91c9687] /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x3dca80e785] /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x142)[0x3dca80fc12] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x3dca80b3e8] /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so(+0x9301)[0x7fe9fa42d301] /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so(+0xad1d)[0x7fe9fa42ed1d] /usr/lib64/libglusterfs.so.0[0x3dcac77d1c] /lib64/libpthread.so.0[0x3dc98079d1] /lib64/libc.so.6(clone+0x6d)[0x3dc94e89dd] core:- (gdb) bt #0 0x00007fe9f8f7d98a in dht_discover_complete () from /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so #1 0x00007fe9f8f85733 in dht_discover_cbk () from /usr/lib64/glusterfs/3.6.0.45/xlator/cluster/distribute.so #2 0x00007fe9f91c9687 in client3_3_lookup_cbk () from /usr/lib64/glusterfs/3.6.0.45/xlator/protocol/client.so #3 0x0000003dca80e785 in rpc_clnt_handle_reply () from /usr/lib64/libgfrpc.so.0 #4 0x0000003dca80fc12 in rpc_clnt_notify () from /usr/lib64/libgfrpc.so.0 #5 0x0000003dca80b3e8 in rpc_transport_notify () from /usr/lib64/libgfrpc.so.0 #6 0x00007fe9fa42d301 in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so #7 0x00007fe9fa42ed1d in ?? () from /usr/lib64/glusterfs/3.6.0.45/rpc-transport/socket.so #8 0x0000003dcac77d1c in ?? () from /usr/lib64/libglusterfs.so.0 #9 0x0000003dc98079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003dc94e89dd in clone () from /lib64/libc.so.6
Changing epoll thread value continuously and randomly does not seem to be a real world scenario. The epoll thread value is expected to be mostly static. While we should investigate the crash, it is not a blocker.
https://code.engineering.redhat.com/gerrit/42685
Tested this on following build: [root@dht-rhs-19 nfs_mnt1]# rpm -qa |grep -i gluster gluster-nagios-common-0.1.4-1.el6rhs.noarch vdsm-gluster-4.14.7.3-1.el6rhs.noarch glusterfs-api-3.6.0.48-1.el6rhs.x86_64 glusterfs-geo-replication-3.6.0.48-1.el6rhs.x86_64 glusterfs-3.6.0.48-1.el6rhs.x86_64 gluster-nagios-addons-0.1.14-1.el6rhs.x86_64 samba-glusterfs-3.6.509-169.4.el6rhs.x86_64 glusterfs-fuse-3.6.0.48-1.el6rhs.x86_64 glusterfs-server-3.6.0.48-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.48-1.el6rhs.x86_64 glusterfs-debuginfo-3.6.0.48-1.el6rhs.x86_64 glusterfs-libs-3.6.0.48-1.el6rhs.x86_64 glusterfs-cli-3.6.0.48-1.el6rhs.x86_64 [root@dht-rhs-19 nfs_mnt1]# Started a recursive directory creation and rename operation and while it is in progress, kept on changing the epoll server and client thread values. No crash seen. Marking the bug verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0682.html