Description of problem: ======================= Encountered a crash with the following backtrace: #0 0x0000562580dca0b1 in glusterfs_handle_translator_op (req=0x7efeb8001a70) at glusterfsd-mgmt.c:674 674 any = active->first; Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.15.2-50.el7_4.6.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x0000562580dca0b1 in glusterfs_handle_translator_op (req=0x7efeb8001a70) at glusterfsd-mgmt.c:674 #1 0x00007efecc2691e2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #2 0x00007efeca8acd40 in ?? () from /lib64/libc.so.6 #3 0x0000000000000000 in ?? () (gdb) bt full #0 0x0000562580dca0b1 in glusterfs_handle_translator_op (req=0x7efeb8001a70) at glusterfsd-mgmt.c:674 ret = 592 op_ret = 0 xlator_req = {name = 0x7efeb00008e0 "", op = 3, input = {input_len = 577, input_val = 0x7efeb0000900 ""}} input = 0x0 xlator = 0x0 any = 0x0 output = 0x0 key = '\000' <repeats 2047 times> xname = 0x0 ctx = <optimized out> active = 0x0 this = 0x7efecc4fc700 <global_xlator> i = 0 count = 0 __FUNCTION__ = "glusterfs_handle_translator_op" #1 0x00007efecc2691e2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 task = 0x7efeb8003330 #2 0x00007efeca8acd40 in ?? () from /lib64/libc.so.6 No symbol table info available. #3 0x0000000000000000 in ?? () No symbol table info available. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.8.4-52.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: 1.Created a dist-rep volume on physical machines 2. Did a replace-brick: [root@gqas001 ~]# gluster volume replace-brick distrep gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/b1 gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/b1 commit force Got the following error: volume replace-brick: failed: Commit failed on localhost. Please check log file for details. 3. Tried replace brick from a new node: [root@gqas001 ~]# gluster volume replace-brick distrep gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/b1 gqas004.sbu.lab.eng.bos.redhat.com:/bricks7/b1 commit force The error mentioned that brick5/b1 does not exist so looking into vol info, the brick has already been replaced (from the first try) Volume Name: distrep Type: Distributed-Replicate Volume ID: 19c5e552-1e03-414e-afad-0f515edb6a68 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/b1 Brick2: gqas004.sbu.lab.eng.bos.redhat.com:/bricks5/b2 Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/b3 Brick4: gqas001.sbu.lab.eng.bos.redhat.com:/bricks6/b4 Brick5: gqas004.sbu.lab.eng.bos.redhat.com:/bricks6/b5 Brick6: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/b6 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on transport.address-family: inet nfs.disable: on 4. Gluster volume status shows that particular brick down. Noticed the crash on gqas004.sbu.lab.eng.bos.redhat.com. Additional info: =============== From the glustershd log : ---------------------- 1089 [2017-11-30 05:52:09.479007] I [MSGID: 100030] [glusterfsd.c:2441:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/gluste rfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/gluste rshd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/582443010e17bb4fb5c4cdfc983262e5.socket --xlator-option *repl icate*.node-uuid=232c5069-75ee-4c72-a8f8-f623583e7c6b) 1090 [2017-11-30 05:52:09.498525] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with ind ex 1 1091 [2017-11-30 05:52:09.498598] E [socket.c:2360:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection re fused); disconnecting socket 1092 [2017-11-30 05:52:09.498635] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: local host 1093 pending frames: 1094 frame : type(0) op(0) 1095 patchset: git://git.gluster.com/glusterfs.git 1096 signal received: 11 1097 time of crash: 1098 2017-11-30 05:52:12 1099 configuration details: 1100 argp 1 1101 backtrace 1 1102 dlfcn 1 1103 libpthread 1 1104 llistxattr 1 1105 setfsid 1 1106 spinlock 1 1107 epoll.h 1 1108 xattr.h 1 1109 st_atim.tv_nsec 1 1110 package-string: glusterfs 3.8.4 1111 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7efecc232842] 1112 /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7efecc23c374] 1113 /lib64/libc.so.6(+0x35270)[0x7efeca89b270] 1114 /usr/sbin/glusterfs(glusterfs_handle_translator_op+0xd1)[0x562580dca0b1] 1115 /lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7efecc2691e2] 1116 /lib64/libc.so.6(+0x46d40)[0x7efeca8acd40] 1117 ---------
Rochelle, the shd crash is a known issue, see https://bugzilla.redhat.com/show_bug.cgi?id=1460245#c5 but it is independent of replace-brick operation. Can you check if replace-brick failing (Commit failed on localhost) is something that is re-creatable? If yes, that is a more serious issue we need to look into.
The shd crash has been fixed in rhgs-3.4.0 via BZ 1593865. Rochelle, should we close this one as a duplicate of 1593865?
(In reply to Ravishankar N from comment #4) > The shd crash has been fixed in rhgs-3.4.0 via BZ 1593865. > Rochelle, should we close this one as a duplicate of 1593865? I'm closing this as a duplicate of 1593865 which has the fix. Please feel free to re-open/raise a new bug as appropriate if you see any more glustershd crashes. *** This bug has been marked as a duplicate of bug 1593865 ***