1519105 – Encountered crash during replace-brick for a dist-rep volume

Bug 1519105 - Encountered crash during replace-brick for a dist-rep volume

Summary: Encountered crash during replace-brick for a dist-rep volume

Keywords:
Status:	CLOSED DUPLICATE of bug 1593865
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1582526 1598340
TreeView+	depends on / blocked

Reported:	2017-11-30 06:56 UTC by Rochelle
Modified:	2018-11-05 05:31 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-22 06:04:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Rochelle 2017-11-30 06:56:07 UTC

Description of problem:
=======================
Encountered a crash with the following backtrace:

#0  0x0000562580dca0b1 in glusterfs_handle_translator_op (req=0x7efeb8001a70) at glusterfsd-mgmt.c:674
674	        any = active->first;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libselinux-2.5-11.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.15.2-50.el7_4.6.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x0000562580dca0b1 in glusterfs_handle_translator_op (req=0x7efeb8001a70) at glusterfsd-mgmt.c:674
#1  0x00007efecc2691e2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#2  0x00007efeca8acd40 in ?? () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()
(gdb) bt full
#0  0x0000562580dca0b1 in glusterfs_handle_translator_op (req=0x7efeb8001a70) at glusterfsd-mgmt.c:674
        ret = 592
        op_ret = 0
        xlator_req = {name = 0x7efeb00008e0 "", op = 3, input = {input_len = 577, input_val = 0x7efeb0000900 ""}}
        input = 0x0
        xlator = 0x0
        any = 0x0
        output = 0x0
        key = '\000' <repeats 2047 times>
        xname = 0x0
        ctx = <optimized out>
        active = 0x0
        this = 0x7efecc4fc700 <global_xlator>
        i = 0
        count = 0
        __FUNCTION__ = "glusterfs_handle_translator_op"
#1  0x00007efecc2691e2 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
        task = 0x7efeb8003330
#2  0x00007efeca8acd40 in ?? () from /lib64/libc.so.6
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.8.4-52.el7rhgs.x86_64

How reproducible:
=================
1/1

Steps to Reproduce:
1.Created a dist-rep volume on physical machines 
2. Did a replace-brick:
[root@gqas001 ~]# gluster volume replace-brick distrep gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/b1 gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/b1 commit force

Got the following error: volume replace-brick: failed: Commit failed on localhost. Please check log file for details.

3. Tried replace brick from a new node:
[root@gqas001 ~]# gluster volume replace-brick distrep gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/b1 gqas004.sbu.lab.eng.bos.redhat.com:/bricks7/b1 commit force

The error mentioned that brick5/b1 does not exist so looking into vol info, the brick has already been replaced (from the first try)
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: 19c5e552-1e03-414e-afad-0f515edb6a68
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/b1
Brick2: gqas004.sbu.lab.eng.bos.redhat.com:/bricks5/b2
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/b3
Brick4: gqas001.sbu.lab.eng.bos.redhat.com:/bricks6/b4
Brick5: gqas004.sbu.lab.eng.bos.redhat.com:/bricks6/b5
Brick6: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/b6
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
transport.address-family: inet
nfs.disable: on

4. Gluster volume status shows that particular brick down.


Noticed the crash on gqas004.sbu.lab.eng.bos.redhat.com. 


Additional info:
===============

From the glustershd log :
----------------------
   1089 [2017-11-30 05:52:09.479007] I [MSGID: 100030] [glusterfsd.c:2441:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/gluste        rfs version 3.8.4 (args: /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/gluste        rshd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/582443010e17bb4fb5c4cdfc983262e5.socket --xlator-option *repl        icate*.node-uuid=232c5069-75ee-4c72-a8f8-f623583e7c6b)
   1090 [2017-11-30 05:52:09.498525] I [MSGID: 101190] [event-epoll.c:602:event_dispatch_epoll_worker] 0-epoll: Started thread with ind        ex 1
   1091 [2017-11-30 05:52:09.498598] E [socket.c:2360:socket_connect_finish] 0-glusterfs: connection to ::1:24007 failed (Connection re        fused); disconnecting socket
   1092 [2017-11-30 05:52:09.498635] I [glusterfsd-mgmt.c:2214:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: local        host
   1093 pending frames:
   1094 frame : type(0) op(0)
   1095 patchset: git://git.gluster.com/glusterfs.git
   1096 signal received: 11
   1097 time of crash:
   1098 2017-11-30 05:52:12
   1099 configuration details:
   1100 argp 1
   1101 backtrace 1
   1102 dlfcn 1
   1103 libpthread 1
   1104 llistxattr 1
   1105 setfsid 1
   1106 spinlock 1
   1107 epoll.h 1
   1108 xattr.h 1
   1109 st_atim.tv_nsec 1
   1110 package-string: glusterfs 3.8.4
   1111 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7efecc232842]
   1112 /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7efecc23c374]
   1113 /lib64/libc.so.6(+0x35270)[0x7efeca89b270]
   1114 /usr/sbin/glusterfs(glusterfs_handle_translator_op+0xd1)[0x562580dca0b1]
   1115 /lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7efecc2691e2]
   1116 /lib64/libc.so.6(+0x46d40)[0x7efeca8acd40]
   1117 ---------

Comment 3 Ravishankar N 2017-12-01 06:25:06 UTC

Rochelle, the shd crash is a known issue, see https://bugzilla.redhat.com/show_bug.cgi?id=1460245#c5 but it is independent of replace-brick operation. Can you check if replace-brick failing (Commit failed on localhost) is something that is re-creatable? If yes, that is a more serious issue we need to look into.

Comment 4 Ravishankar N 2018-08-02 12:35:07 UTC

The shd crash has been fixed in rhgs-3.4.0 via BZ 1593865.
Rochelle, should we close this one as a duplicate of 1593865?

Comment 8 Ravishankar N 2018-10-22 06:04:48 UTC

(In reply to Ravishankar N from comment #4)
> The shd crash has been fixed in rhgs-3.4.0 via BZ 1593865.
> Rochelle, should we close this one as a duplicate of 1593865?

I'm closing this as a duplicate of 1593865 which has the fix. Please feel free to re-open/raise a new bug as appropriate if you see any more glustershd crashes.

*** This bug has been marked as a duplicate of bug 1593865 ***

Note You need to log in before you can comment on or make changes to this bug.