Description of problem: Glusterd crashed when volume is force started after bringing down the brick on one of the volume Version-Release number of selected component (if applicable): 3.8.4.22 Steps to Reproduce: 1. Create two distributed-replica(3 x 2) volumes and started them 2. Enabled brick multiplexing 3. Created another distributed-replica volume 4. gluster vol status [root@dhcp37-135 ~]# gluster vol status Status of volume: first Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t41 49154 0 Y 17199 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t42 49152 0 Y 16861 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t43 49152 0 Y 16839 Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t44 49155 0 Y 17218 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t45 49153 0 Y 16880 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t46 49153 0 Y 16858 Self-heal Daemon on localhost N/A N/A Y 17508 Self-heal Daemon on dhcp37-138.lab.eng.blr. redhat.com N/A N/A Y 17073 Self-heal Daemon on dhcp37-140.lab.eng.blr. redhat.com N/A N/A Y 17053 Task Status of Volume first ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: second Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t1 49156 0 Y 17321 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t2 49154 0 Y 16935 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t3 49154 0 Y 16914 Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t4 49157 0 Y 17340 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t5 49155 0 Y 16954 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t6 49155 0 Y 16933 Self-heal Daemon on localhost N/A N/A Y 17508 Self-heal Daemon on dhcp37-138.lab.eng.blr. redhat.com N/A N/A Y 17073 Self-heal Daemon on dhcp37-140.lab.eng.blr. redhat.com N/A N/A Y 17053 Task Status of Volume second ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: third Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t21 49154 0 Y 17199 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t22 49152 0 Y 16861 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t23 49152 0 Y 16839 Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t24 49154 0 Y 17199 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t25 49152 0 Y 16861 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t26 49152 0 Y 16839 Self-heal Daemon on localhost N/A N/A Y 17508 Self-heal Daemon on dhcp37-138.lab.eng.blr. redhat.com N/A N/A Y 17073 Self-heal Daemon on dhcp37-140.lab.eng.blr. redhat.com N/A N/A Y 17053 Task Status of Volume third ------------------------------------------------------------------------------ There are no active volume tasks 5. Killed brick process of voluem "second". kill -9 17321 [root@dhcp37-135 ~]# gluster vol status second Status of volume: second Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t1 N/A N/A N N/A Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t2 49154 0 Y 16935 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t3 49154 0 Y 16914 Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t4 49157 0 Y 17340 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t5 49155 0 Y 16954 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t6 49155 0 Y 16933 Self-heal Daemon on localhost N/A N/A Y 17508 Self-heal Daemon on dhcp37-140.lab.eng.blr. redhat.com N/A N/A Y 17053 Self-heal Daemon on dhcp37-138.lab.eng.blr. redhat.com N/A N/A Y 17073 Task Status of Volume second ------------------------------------------------------------------------------ There are no active volume tasks 6. After that force started the volume second and volume three 7. gluster volume status of "second" Status of volume: second Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t1 49157 0 Y 17340 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t2 49154 0 Y 16935 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t3 49154 0 Y 16914 Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t4 49157 0 Y 17340 Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t5 49155 0 Y 16954 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t6 49155 0 Y 16933 Self-heal Daemon on localhost N/A N/A Y 18154 Self-heal Daemon on dhcp37-140.lab.eng.blr. redhat.com N/A N/A Y 17598 Self-heal Daemon on dhcp37-138.lab.eng.blr. redhat.com N/A N/A Y 17618 Task Status of Volume second ------------------------------------------------------------------------------ There are no active volume tasks 8. Killed brick process kill -9 17340 and the volume status Status of volume: second Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t1 N/A N/A N N/A Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t2 49154 0 Y 16935 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t3 49154 0 Y 16914 Brick dhcp37-135.lab.eng.blr.redhat.com:/br icks/brick1/t4 N/A N/A N N/A Brick dhcp37-138.lab.eng.blr.redhat.com:/br icks/brick1/t5 49155 0 Y 16954 Brick dhcp37-140.lab.eng.blr.redhat.com:/br icks/brick1/t6 49155 0 Y 16933 Self-heal Daemon on localhost N/A N/A Y 18154 Self-heal Daemon on dhcp37-138.lab.eng.blr. redhat.com N/A N/A Y 17618 Self-heal Daemon on dhcp37-140.lab.eng.blr. redhat.com N/A N/A Y 17598 Task Status of Volume second ------------------------------------------------------------------------------ There are no active volume tasks 9. disabled the brick multiplex and force started the volume second [root@dhcp37-135 ~]# gluster vol set all cluster.brick-multiplex disable volume set: success [root@dhcp37-135 ~]# gluster vol start second force Connection failed. Please check if gluster daemon is operational. Actual results: glusterd crashed and generated a core Expected results: volume should start without any issues Additional info:
logs available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/bmekala/bug.1443961/
Created attachment 1272917 [details] core
bt full (gdb) bt full #0 0x00007f51105e64e5 in __gf_free (free_ptr=0x7f50ec000cc0) at mem-pool.c:314 ptr = 0x7f50ec000c80 mem_acct = 0x7f5112bc9370 header = 0x7f50ec000c80 __FUNCTION__ = "__gf_free" #1 0x00007f51105cd7d2 in gf_timer_call_cancel (ctx=<optimized out>, event=0x7f50ec000cc0) at timer.c:116 reg = 0x7f5112bb4770 fired = _gf_false __FUNCTION__ = "gf_timer_call_cancel" #2 0x00007f510510fac3 in glusterd_volume_start_glusterfs (volinfo=volinfo@entry=0x7f50f0022860, brickinfo=brickinfo@entry=0x7f50f008b590, wait=wait@entry=_gf_true) at glusterd-utils.c:1867 ret = -1 this = 0x7f5112bbd710 priv = 0x7f51108f0050 pidfile = "/var/lib/glusterd/vols/second/run/dhcp37-135.lab.eng.blr.redhat.com-bricks-brick1-t4.pid", '\000' <repeats 4008 times> volfile = '\000' <repeats 4095 times> runner = {argv = 0x0, argvlen = 0, runerr = 0, chpid = 0, chfd = {0, 0, 0}, chio = {0x0, 0x0, 0x0}} exp_path = '\000' <repeats 4095 times> logfile = '\000' <repeats 4095 times> port = 0 rdma_port = 0 bind_address = 0x0 socketpath = "/var/run/gluster/d4e3b659bd9c6b16470fdb7898557e96.socket", '\000' <repeats 4039 times> glusterd_uuid = '\000' <repeats 1023 times> valgrind_logfile = '\000' <repeats 4095 times> rdma_brick_path = '\000' <repeats 4095 times> rpc = 0x7f50fc1364f0 conn = 0x7f50fc136520 __FUNCTION__ = "glusterd_volume_start_glusterfs" #3 0x00007f51051124cf in glusterd_brick_start (volinfo=volinfo@entry=0x7f50f0022860, brickinfo=brickinfo@entry=0x7f50f008b590, wait=wait@entry=_gf_true) at glusterd-utils.c:5391 ret = <optimized out> this = 0x7f5112bbd710 conf = <optimized out> pid = 17340 pidfile = "/var/lib/glusterd/vols/second/run/dhcp37-135.lab.eng.blr.redhat.com-bricks-brick1-t4.pid", '\000' <repeats 4007 times> fp = 0x7f50f00fe3e0 socketpath = '\000' <repeats 4095 times> other_vol = <optimized out> __FUNCTION__ = "glusterd_brick_start" #4 0x00007f5105186136 in glusterd_start_volume (volinfo=0x7f50f0022860, flags=<optimized out>, wait=wait@entry=_gf_true) at glusterd-volume-ops.c:2508 ret = <optimized out> brickinfo = 0x7f50f008b590 this = 0x7f5112bbd710 ---Type <return> to continue, or q <return> to quit--- verincrement = GLUSTERD_VOLINFO_VER_AC_NONE __FUNCTION__ = "glusterd_start_volume" #5 0x00007f510518662e in glusterd_op_start_volume (dict=dict@entry=0x7f50f00fef20, op_errstr=op_errstr@entry=0x7f50f859d730) at glusterd-volume-ops.c:2636 ret = <optimized out> brick_count = <optimized out> brick_mount_dir = 0x0 key = '\000' <repeats 4095 times> volname = 0x7f50f00d6840 "second" str = 0x0 option = _gf_false flags = 1 volinfo = 0x7f50f0022860 brickinfo = <optimized out> this = 0x7f5112bbd710 conf = 0x7f51108f0050 svc = 0x0 __FUNCTION__ = "glusterd_op_start_volume" #6 0x00007f51051c2ee3 in gd_mgmt_v3_commit_fn (op=op@entry=GD_OP_START_VOLUME, dict=dict@entry=0x7f50f00fef20, op_errstr=op_errstr@entry=0x7f50f859d730, op_errno=op_errno@entry=0x7f50f859d724, rsp_dict=rsp_dict@entry=0x7f50f000c2c0) at glusterd-mgmt.c:313 ret = -1 this = 0x7f5112bbd710 __FUNCTION__ = "gd_mgmt_v3_commit_fn" #7 0x00007f51051c5f45 in glusterd_mgmt_v3_commit (op=op@entry=GD_OP_START_VOLUME, op_ctx=op_ctx@entry=0x7f50f0072310, req_dict=0x7f50f00fef20, op_errstr=op_errstr@entry=0x7f50f859d730, op_errno=op_errno@entry=0x7f50f859d724, txn_generation=txn_generation@entry=4) at glusterd-mgmt.c:1448 ret = -1 peer_cnt = 0 rsp_dict = 0x7f50f000c2c0 peerinfo = 0x0 args = {op_ret = 0, op_errno = 0, iatt1 = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}, iatt2 = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = { read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}, xattr = 0x0, statvfs_buf = {f_bsize = 0, f_frsize = 0, f_blocks = 0, f_bfree = 0, f_bavail = 0, f_files = 0, f_ffree = 0, f_favail = 0, f_fsid = 0, f_flag = 0, f_namemax = 0, __f_spare = {0, 0, 0, 0, 0, 0}}, vector = 0x0, count = 0, iobref = 0x0, buffer = 0x0, xdata = 0x0, flock = {l_type = 0, l_whence = 0, l_start = 0, l_len = 0, l_pid = 0, l_owner = {len = 0, data = '\000' <repeats 1023 times>}}, lease = {cmd = 0, lease_type = NONE, lease_id = '\000' <repeats 15 times>, lease_flags = 0}, uuid = '\000' <repeats 15 times>, errstr = 0x0, dict = 0x0, lock_dict = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, barrier = {guard = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, waitq = {next = 0x0, prev = 0x0}, count = 0}, task = 0x0, mutex = {__data = { ---Type <return> to continue, or q <return> to quit--- __lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}, cond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, done = 0, entries = {{list = {next = 0x0, prev = 0x0}, {next = 0x0, prev = 0x0}}, d_ino = 0, d_off = 0, d_len = 0, d_type = 0, d_stat = {ia_ino = 0, ia_gfid = '\000' <repeats 15 times>, ia_dev = 0, ia_type = IA_INVAL, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_atime_nsec = 0, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0}, dict = 0x0, inode = 0x0, d_name = 0x7f50f859d230 ""}, offset = 0, locklist = {list = {next = 0x0, prev = 0x0}, flock = {l_type = 0, l_whence = 0, l_start = 0, l_len = 0, l_pid = 0, l_owner = {len = 0, data = '\000' <repeats 1023 times>}}, client_uid = 0x0, lk_flags = 0}} peer_uuid = '\000' <repeats 15 times> this = 0x7f5112bbd710 conf = 0x7f51108f0050 __FUNCTION__ = "glusterd_mgmt_v3_commit" #8 0x00007f51051c7b9e in glusterd_mgmt_v3_initiate_all_phases (req=req@entry=0x7f50f80018b0, op=op@entry=GD_OP_START_VOLUME, dict=0x7f50f0072310) at glusterd-mgmt.c:2043 ret = 0 op_ret = -1 req_dict = 0x7f50f00fef20 tmp_dict = 0x7f50f00f9f10 conf = <optimized out> op_errstr = 0x0 this = 0x7f5112bbd710 is_acquired = _gf_true originator_uuid = <optimized out> txn_generation = 4 op_errno = 0 __FUNCTION__ = "glusterd_mgmt_v3_initiate_all_phases" #9 0x00007f510517ec52 in __glusterd_handle_cli_start_volume (req=req@entry=0x7f50f80018b0) at glusterd-volume-ops.c:549 ret = 0 cli_req = {dict = {dict_len = 82, dict_val = 0x7f50f00fe980 ""}} volname = 0x7f50f00d6840 "second" dict = 0x7f50f0072310 errstr = '\000' <repeats 2047 times> this = 0x7f5112bbd710 conf = 0x7f51108f0050 __FUNCTION__ = "__glusterd_handle_cli_start_volume" #10 0x00007f51050e4ca0 in glusterd_big_locked_handler (req=0x7f50f80018b0, actor_fn=0x7f510517e990 <__glusterd_handle_cli_start_volume>) at glusterd-handler.c:81 priv = 0x7f51108f0050 ret = -1 #11 0x00007f51105f6362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 task = 0x7f50f818a1f0 #12 0x00007f510ecb3cf0 in ?? () from /lib64/libc.so.6 No symbol table info available. #13 0x0000000000000000 in ?? ()
Symptoms look similar to BZ 1443123
upstream patch : https://review.gluster.org/#/c/17088/
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/104115/
verified in 3.8.4.24 glusterd didn't generate any cores and volume force start worked fine after disabling brick multiplexing. Hence marking the bug as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774