1443961 – [Brick Multiplexing]: Glusterd crashed when volume force started after disabling brick multiplex

Bug 1443961 - [Brick Multiplexing]: Glusterd crashed when volume force started after disabling brick multiplex

Summary: [Brick Multiplexing]: Glusterd crashed when volume force started after disabl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Atin Mukherjee
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-04-20 10:56 UTC by Bala Konda Reddy M
Modified:	2017-09-21 04:39 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.4-24
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:39:40 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core (662.01 KB, application/x-gzip) 2017-04-20 11:09 UTC, Bala Konda Reddy M	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1443123	0	unspecified	CLOSED	[BrickMultiplex] gluster command not responding and .snaps directory is not visible after executing snapshot related com...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Internal Links: 1443123

Description Bala Konda Reddy M 2017-04-20 10:56:48 UTC

Description of problem:
Glusterd crashed when volume is force started after bringing down the brick on one of the volume

Version-Release number of selected component (if applicable):
3.8.4.22


Steps to Reproduce:
1. Create two distributed-replica(3 x 2) volumes and started them
2. Enabled brick multiplexing
3. Created another distributed-replica volume
4. gluster vol status
[root@dhcp37-135 ~]# gluster vol status
Status of volume: first
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t41 49154 0 Y 17199
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t42 49152 0 Y 16861
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t43 49152 0 Y 16839
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t44 49155 0 Y 17218
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t45 49153 0 Y 16880
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t46 49153 0 Y 16858
Self-heal Daemon on localhost N/A N/A Y 17508
Self-heal Daemon on dhcp37-138.lab.eng.blr.
redhat.com N/A N/A Y 17073
Self-heal Daemon on dhcp37-140.lab.eng.blr.
redhat.com N/A N/A Y 17053
 
Task Status of Volume first
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: second
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t1 49156 0 Y 17321
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t2 49154 0 Y 16935
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t3 49154 0 Y 16914
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t4 49157 0 Y 17340
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t5 49155 0 Y 16954
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t6 49155 0 Y 16933
Self-heal Daemon on localhost N/A N/A Y 17508
Self-heal Daemon on dhcp37-138.lab.eng.blr.
redhat.com N/A N/A Y 17073
Self-heal Daemon on dhcp37-140.lab.eng.blr.
redhat.com N/A N/A Y 17053
 
Task Status of Volume second
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: third
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t21 49154 0 Y 17199
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t22 49152 0 Y 16861
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t23 49152 0 Y 16839
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t24 49154 0 Y 17199
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t25 49152 0 Y 16861
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t26 49152 0 Y 16839
Self-heal Daemon on localhost N/A N/A Y 17508
Self-heal Daemon on dhcp37-138.lab.eng.blr.
redhat.com N/A N/A Y 17073
Self-heal Daemon on dhcp37-140.lab.eng.blr.
redhat.com N/A N/A Y 17053
 
Task Status of Volume third
------------------------------------------------------------------------------
There are no active volume tasks

5. Killed brick process of voluem "second". kill -9 17321
[root@dhcp37-135 ~]# gluster vol status second
Status of volume: second
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t1 N/A N/A N N/A 
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t2 49154 0 Y 16935
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t3 49154 0 Y 16914
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t4 49157 0 Y 17340
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t5 49155 0 Y 16954
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t6 49155 0 Y 16933
Self-heal Daemon on localhost N/A N/A Y 17508
Self-heal Daemon on dhcp37-140.lab.eng.blr.
redhat.com N/A N/A Y 17053
Self-heal Daemon on dhcp37-138.lab.eng.blr.
redhat.com N/A N/A Y 17073
 
Task Status of Volume second
------------------------------------------------------------------------------
There are no active volume tasks

6. After that force started the volume second and volume three
7. gluster volume status of "second" 
Status of volume: second
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t1 49157 0 Y 17340
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t2 49154 0 Y 16935
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t3 49154 0 Y 16914
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t4 49157 0 Y 17340
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t5 49155 0 Y 16954
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t6 49155 0 Y 16933
Self-heal Daemon on localhost N/A N/A Y 18154
Self-heal Daemon on dhcp37-140.lab.eng.blr.
redhat.com N/A N/A Y 17598
Self-heal Daemon on dhcp37-138.lab.eng.blr.
redhat.com N/A N/A Y 17618
 
Task Status of Volume second
------------------------------------------------------------------------------
There are no active volume tasks

8. Killed brick process kill -9 17340 and the volume status

Status of volume: second
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t1 N/A N/A N N/A 
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t2 49154 0 Y 16935
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t3 49154 0 Y 16914
Brick dhcp37-135.lab.eng.blr.redhat.com:/br
icks/brick1/t4 N/A N/A N N/A 
Brick dhcp37-138.lab.eng.blr.redhat.com:/br
icks/brick1/t5 49155 0 Y 16954
Brick dhcp37-140.lab.eng.blr.redhat.com:/br
icks/brick1/t6 49155 0 Y 16933
Self-heal Daemon on localhost N/A N/A Y 18154
Self-heal Daemon on dhcp37-138.lab.eng.blr.
redhat.com N/A N/A Y 17618
Self-heal Daemon on dhcp37-140.lab.eng.blr.
redhat.com N/A N/A Y 17598
 
Task Status of Volume second
------------------------------------------------------------------------------
There are no active volume tasks
 
9. disabled the brick multiplex and force started the volume second 

[root@dhcp37-135 ~]# gluster vol set all cluster.brick-multiplex disable
volume set: success

[root@dhcp37-135 ~]# gluster vol start second force
Connection failed. Please check if gluster daemon is operational.


Actual results:
glusterd crashed and generated a core

Expected results:
volume should start without any issues

Additional info:

Comment 2 Bala Konda Reddy M 2017-04-20 11:08:03 UTC

logs available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1443961/

Comment 3 Bala Konda Reddy M 2017-04-20 11:09:33 UTC

Created attachment 1272917 [details]
core

Comment 5 Bala Konda Reddy M 2017-04-20 12:57:18 UTC

bt full

(gdb) bt full
#0  0x00007f51105e64e5 in __gf_free (free_ptr=0x7f50ec000cc0) at mem-pool.c:314
        ptr = 0x7f50ec000c80
        mem_acct = 0x7f5112bc9370
        header = 0x7f50ec000c80
        __FUNCTION__ = "__gf_free"
#1  0x00007f51105cd7d2 in gf_timer_call_cancel (ctx=<optimized out>, event=0x7f50ec000cc0) at timer.c:116
        reg = 0x7f5112bb4770
        fired = _gf_false
        __FUNCTION__ = "gf_timer_call_cancel"
#2  0x00007f510510fac3 in glusterd_volume_start_glusterfs (volinfo=volinfo@entry=0x7f50f0022860, brickinfo=brickinfo@entry=0x7f50f008b590, wait=wait@entry=_gf_true)
    at glusterd-utils.c:1867
        ret = -1
        this = 0x7f5112bbd710
        priv = 0x7f51108f0050
        pidfile = "/var/lib/glusterd/vols/second/run/dhcp37-135.lab.eng.blr.redhat.com-bricks-brick1-t4.pid", '\000' <repeats 4008 times>
        volfile = '\000' <repeats 4095 times>
        runner = {argv = 0x0, argvlen = 0, runerr = 0, chpid = 0, chfd = {0, 0, 0}, chio = {0x0, 0x0, 0x0}}
        exp_path = '\000' <repeats 4095 times>
        logfile = '\000' <repeats 4095 times>
        port = 0
        rdma_port = 0
        bind_address = 0x0
        socketpath = "/var/run/gluster/d4e3b659bd9c6b16470fdb7898557e96.socket", '\000' <repeats 4039 times>
        glusterd_uuid = '\000' <repeats 1023 times>
        valgrind_logfile = '\000' <repeats 4095 times>
        rdma_brick_path = '\000' <repeats 4095 times>
        rpc = 0x7f50fc1364f0
        conn = 0x7f50fc136520
        __FUNCTION__ = "glusterd_volume_start_glusterfs"
#3  0x00007f51051124cf in glusterd_brick_start (volinfo=volinfo@entry=0x7f50f0022860, brickinfo=brickinfo@entry=0x7f50f008b590, wait=wait@entry=_gf_true)
    at glusterd-utils.c:5391
        ret = <optimized out>
        this = 0x7f5112bbd710
        conf = <optimized out>
        pid = 17340
        pidfile = "/var/lib/glusterd/vols/second/run/dhcp37-135.lab.eng.blr.redhat.com-bricks-brick1-t4.pid", '\000' <repeats 4007 times>
        fp = 0x7f50f00fe3e0
        socketpath = '\000' <repeats 4095 times>
        other_vol = <optimized out>
        __FUNCTION__ = "glusterd_brick_start"
#4  0x00007f5105186136 in glusterd_start_volume (volinfo=0x7f50f0022860, flags=<optimized out>, wait=wait@entry=_gf_true) at glusterd-volume-ops.c:2508
        ret = <optimized out>
        brickinfo = 0x7f50f008b590
        this = 0x7f5112bbd710
---Type <return> to continue, or q <return> to quit---
        verincrement = GLUSTERD_VOLINFO_VER_AC_NONE
        __FUNCTION__ = "glusterd_start_volume"
#5  0x00007f510518662e in glusterd_op_start_volume (dict=dict@entry=0x7f50f00fef20, op_errstr=op_errstr@entry=0x7f50f859d730) at glusterd-volume-ops.c:2636
        ret = <optimized out>
        brick_count = <optimized out>
        brick_mount_dir = 0x0
        key = '\000' <repeats 4095 times>
        volname = 0x7f50f00d6840 "second"
        str = 0x0
        option = _gf_false
        flags = 1
        volinfo = 0x7f50f0022860
        brickinfo = <optimized out>
        this = 0x7f5112bbd710
        conf = 0x7f51108f0050
        svc = 0x0
        __FUNCTION__ = "glusterd_op_start_volume"
#6  0x00007f51051c2ee3 in gd_mgmt_v3_commit_fn (op=op@entry=GD_OP_START_VOLUME, dict=dict@entry=0x7f50f00fef20, op_errstr=op_errstr@entry=0x7f50f859d730, 
    op_errno=op_errno@entry=0x7f50f859d724, rsp_dict=rsp_dict@entry=0x7f50f000c2c0) at glusterd-mgmt.c:313
        ret = -1
        this = 0x7f5112bbd710
        __FUNCTION__ = "gd_mgmt_v3_commit_fn"
#7  0x00007f51051c5f45 in glusterd_mgmt_v3_commit (op=op@entry=GD_OP_START_VOLUME, op_ctx=op_ctx@entry=0x7f50f0072310, req_dict=0x7f50f00fef20, 
    op_errstr=op_errstr@entry=0x7f50f859d730, op_errno=op_errno@entry=0x7f50f859d724, txn_generation=txn_generation@entry=4) at glusterd-mgmt.c:1448
        ret = -1
        peer_cnt = 0
        rsp_dict = 0x7f50f000c2c0
        peerinfo = 0x0
        args = {op_ret = 0, op_errno = 0, iatt1 = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', 
              sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', 
                exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, 
            ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}, iatt2 = {ia_ino = 0, 
            ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {
                read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', 
                write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, 
            ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}, xattr = 0x0, statvfs_buf = {f_bsize = 0, f_frsize = 0, f_blocks = 0, 
            f_bfree = 0, f_bavail = 0, f_files = 0, f_ffree = 0, f_favail = 0, f_fsid = 0, f_flag = 0, f_namemax = 0, __f_spare = {0, 0, 0, 0, 0, 0}}, vector = 0x0, 
          count = 0, iobref = 0x0, buffer = 0x0, xdata = 0x0, flock = {l_type = 0, l_whence = 0, l_start = 0, l_len = 0, l_pid = 0, l_owner = {len = 0, 
              data = '\000' <repeats 1023 times>}}, lease = {cmd = 0, lease_type = NONE, lease_id = '\000' <repeats 15 times>, lease_flags = 0}, 
          uuid = '\000' <repeats 15 times>, errstr = 0x0, dict = 0x0, lock_dict = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, 
              __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, barrier = {guard = {__data = {__lock = 0, 
                __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
              __align = 0}, cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, 
                __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, waitq = {next = 0x0, prev = 0x0}, count = 0}, task = 0x0, mutex = {__data = {
---Type <return> to continue, or q <return> to quit---
              __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, 
            __align = 0}, cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, 
              __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, done = 0, entries = {{list = {next = 0x0, prev = 0x0}, {next = 0x0, 
                prev = 0x0}}, d_ino = 0, d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, 
              ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', 
                  write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, 
              ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, 
              ia_ctime_nsec = 0}, dict = 0x0, inode = 0x0, d_name = 0x7f50f859d230 ""}, offset = 0, locklist = {list = {next = 0x0, prev = 0x0}, flock = {l_type = 0, 
              l_whence = 0, l_start = 0, l_len = 0, l_pid = 0, l_owner = {len = 0, data = '\000' <repeats 1023 times>}}, client_uid = 0x0, lk_flags = 0}}
        peer_uuid = '\000' <repeats 15 times>
        this = 0x7f5112bbd710
        conf = 0x7f51108f0050
        __FUNCTION__ = "glusterd_mgmt_v3_commit"
#8  0x00007f51051c7b9e in glusterd_mgmt_v3_initiate_all_phases (req=req@entry=0x7f50f80018b0, op=op@entry=GD_OP_START_VOLUME, dict=0x7f50f0072310)
    at glusterd-mgmt.c:2043
        ret = 0
        op_ret = -1
        req_dict = 0x7f50f00fef20
        tmp_dict = 0x7f50f00f9f10
        conf = <optimized out>
        op_errstr = 0x0
        this = 0x7f5112bbd710
        is_acquired = _gf_true
        originator_uuid = <optimized out>
        txn_generation = 4
        op_errno = 0
        __FUNCTION__ = "glusterd_mgmt_v3_initiate_all_phases"
#9  0x00007f510517ec52 in __glusterd_handle_cli_start_volume (req=req@entry=0x7f50f80018b0) at glusterd-volume-ops.c:549
        ret = 0
        cli_req = {dict = {dict_len = 82, dict_val = 0x7f50f00fe980 ""}}
        volname = 0x7f50f00d6840 "second"
        dict = 0x7f50f0072310
        errstr = '\000' <repeats 2047 times>
        this = 0x7f5112bbd710
        conf = 0x7f51108f0050
        __FUNCTION__ = "__glusterd_handle_cli_start_volume"
#10 0x00007f51050e4ca0 in glusterd_big_locked_handler (req=0x7f50f80018b0, actor_fn=0x7f510517e990 <__glusterd_handle_cli_start_volume>) at glusterd-handler.c:81
        priv = 0x7f51108f0050
        ret = -1
#11 0x00007f51105f6362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
        task = 0x7f50f818a1f0
#12 0x00007f510ecb3cf0 in ?? () from /lib64/libc.so.6
No symbol table info available.
#13 0x0000000000000000 in ?? ()

Comment 6 Atin Mukherjee 2017-04-20 12:58:08 UTC

Symptoms look similar to BZ 1443123

Comment 7 Atin Mukherjee 2017-04-20 15:00:14 UTC

upstream patch : https://review.gluster.org/#/c/17088/

Comment 9 Atin Mukherjee 2017-04-24 07:50:31 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/104115/

Comment 12 Bala Konda Reddy M 2017-05-02 10:42:25 UTC

verified in 3.8.4.24

glusterd didn't generate any cores and volume force start worked fine after disabling brick multiplexing.

Hence marking the bug as verified

Comment 14 errata-xmlrpc 2017-09-21 04:39:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.