Description of problem: ==================== I had a 6 node setup with 1 ec vol and 1 distrep volume I mounted those two vols and was running IOs during that time I enabled Brick multiplexing and create a new 1x3 volume I saw that all the nodes had core dumped glusterd gdb) #0 0x00007fd3704041d7 in raise () from /lib64/libc.so.6 #1 0x00007fd3704058c8 in abort () from /lib64/libc.so.6 #2 0x00007fd370443f07 in __libc_message () from /lib64/libc.so.6 #3 0x00007fd3704de047 in __fortify_fail () from /lib64/libc.so.6 #4 0x00007fd3704dc200 in __chk_fail () from /lib64/libc.so.6 #5 0x00007fd3704db91b in __vsnprintf_chk () from /lib64/libc.so.6 #6 0x00007fd3704db838 in __snprintf_chk () from /lib64/libc.so.6 #7 0x00007fd3668559b4 in snprintf (__fmt=0x7fd366949ad8 "%s/run/%s-%s.pid", __n=4096, __s=0x7fd35858f930 "") at /usr/include/bits/stdio2.h:64 #8 glusterd_bricks_select_stop_volume (dict=dict@entry=0x7fd3500dcad0, op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>) at glusterd-op-sm.c:6115 #9 0x00007fd366862f76 in glusterd_op_bricks_select (op=<optimized out>, dict=0x7fd3500dcad0, op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>, rsp_dict=rsp_dict@entry=0x0) at glusterd-op-sm.c:7503 #10 0x00007fd366890890 in glusterd_brick_op (frame=<optimized out>, this=0x7fd373f93710, data=0x7fd350101630) at glusterd-rpc-ops.c:2289 #11 0x00007fd366866253 in glusterd_op_ac_send_brick_op (event=0x7fd3500c30b0, ctx=<optimized out>) at glusterd-op-sm.c:7406 #12 0x00007fd366864f3f in glusterd_op_sm () at glusterd-op-sm.c:7990 #13 0x00007fd366841862 in __glusterd_handle_commit_op (req=req@entry=0x7fd3580018b0) at glusterd-handler.c:1165 #14 0x00007fd366847ca0 in glusterd_big_locked_handler (req=0x7fd3580018b0, actor_fn=0x7fd366841740 <__glusterd_handle_commit_op>) at glusterd-handler.c:81 #15 0x00007fd371d58362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #16 0x00007fd370415cf0 in ?? () from /lib64/libc.so.6 #17 0x0000000000000000 in ?? () Version-Release number of selected component (if applicable): ======== 3.8.4-20 Note: I even did a ifdown , later, but the core was dumped before this
Also note that after enabling brick multiplexing in the above test bed, the glusterd info doesnt show it in glusterd.info [root@dhcp35-122 glusterd]# cat glusterd.info UUID=425e7d60-f0e5-4a45-8266-cce0443584b1 operating-version=31001
New LWP 14960] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 6, Aborted. #0 0x00007fd3704041d7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt t a a No symbol "t" in current context. (gdb) t a a bt Thread 8 (Thread 0x7fd368ba4700 (LWP 14960)): #0 0x00007fd370b89101 in sigwait () from /lib64/libpthread.so.0 #1 0x00007fd372218ebb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055 #2 0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fd3704c673d in clone () from /lib64/libc.so.6 Thread 7 (Thread 0x7fd3622a6700 (LWP 15165)): #0 0x00007fd3704c6d13 in epoll_wait () from /lib64/libc.so.6 #1 0x00007fd371d7bd30 in event_dispatch_epoll_worker (data=0x7fd373fd8490) at event-epoll.c:665 #2 0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fd3704c673d in clone () from /lib64/libc.so.6 Thread 6 (Thread 0x7fd3693a5700 (LWP 14959)): #0 0x00007fd370b88bdd in nanosleep () from /lib64/libpthread.so.0 #1 0x00007fd371d2f306 in gf_timer_proc (data=0x7fd373f8a770) at timer.c:176 #2 0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fd3704c673d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x7fd3683a3700 (LWP 14961)): #0 0x00007fd37048d66d in nanosleep () from /lib64/libc.so.6 #1 0x00007fd37048d504 in sleep () from /lib64/libc.so.6 #2 0x00007fd371d4882d in pool_sweeper (arg=<optimized out>) at mem-pool.c:464 #3 0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fd3704c673d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x7fd362aa7700 (LWP 15164)): #0 0x00007fd370b856d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fd3668fb783 in hooks_worker (args=<optimized out>) at glusterd-hooks.c:531 #2 0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0 #3 0x00007fd3704c673d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7fd367ba2700 (LWP 14962)): #0 0x00007fd370b85a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fd371d5a898 in syncenv_task (proc=proc@entry=0x7fd373f8afc0) at syncop.c:603 #2 0x00007fd371d5b6e0 in syncenv_processor (thdata=0x7fd373f8afc0) at syncop.c:695 #3 0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0 #4 0x00007fd3704c673d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7fd3721fb780 (LWP 14958)): #0 0x00007fd370b82ef7 in pthread_join () from /lib64/libpthread.so.0 #1 0x00007fd371d7c2e0 in event_dispatch_epoll (event_pool=0x7fd373f82f00) at event-epoll.c:759 #2 0x00007fd372215d95 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2464 Thread 1 (Thread 0x7fd3673a1700 (LWP 14963)): #0 0x00007fd3704041d7 in raise () from /lib64/libc.so.6 #1 0x00007fd3704058c8 in abort () from /lib64/libc.so.6 #2 0x00007fd370443f07 in __libc_message () from /lib64/libc.so.6 #3 0x00007fd3704de047 in __fortify_fail () from /lib64/libc.so.6 #4 0x00007fd3704dc200 in __chk_fail () from /lib64/libc.so.6 #5 0x00007fd3704db91b in __vsnprintf_chk () from /lib64/libc.so.6 #6 0x00007fd3704db838 in __snprintf_chk () from /lib64/libc.so.6 #7 0x00007fd3668559b4 in snprintf (__fmt=0x7fd366949ad8 "%s/run/%s-%s.pid", __n=4096, ---Type <return> to continue, or q <return> to quit--- __s=0x7fd35858f930 "") at /usr/include/bits/stdio2.h:64 #8 glusterd_bricks_select_stop_volume (dict=dict@entry=0x7fd3500dcad0, op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>) at glusterd-op-sm.c:6115 #9 0x00007fd366862f76 in glusterd_op_bricks_select (op=<optimized out>, dict=0x7fd3500dcad0, op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>, rsp_dict=rsp_dict@entry=0x0) at glusterd-op-sm.c:7503 #10 0x00007fd366890890 in glusterd_brick_op (frame=<optimized out>, this=0x7fd373f93710, data=0x7fd350101630) at glusterd-rpc-ops.c:2289 #11 0x00007fd366866253 in glusterd_op_ac_send_brick_op (event=0x7fd3500c30b0, ctx=<optimized out>) at glusterd-op-sm.c:7406 #12 0x00007fd366864f3f in glusterd_op_sm () at glusterd-op-sm.c:7990 #13 0x00007fd366841862 in __glusterd_handle_commit_op (req=req@entry=0x7fd3580018b0) at glusterd-handler.c:1165 #14 0x00007fd366847ca0 in glusterd_big_locked_handler (req=0x7fd3580018b0, actor_fn=0x7fd366841740 <__glusterd_handle_commit_op>) at glusterd-handler.c:81 #15 0x00007fd371d58362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #16 0x00007fd370415cf0 in ?? () from /lib64/libc.so.6 #17 0x0000000000000000 in ?? ()
logs http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1437940/servers/ (note sosreport was taking lot of time to dump, hence aborted) logs contain core, /var/log /var/lib/glusterd info
[root@dhcp35-130 ~]# gluster v status Status of volume: cross3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick3/distrep 49154 0 Y 10968 Brick 10.70.35.130:/rhs/brick3/distrep 49154 0 Y 15923 Brick 10.70.35.112:/rhs/brick3/distrep 49154 0 Y 9396 Self-heal Daemon on localhost N/A N/A Y 16088 Self-heal Daemon on 10.70.35.23 N/A N/A Y 17839 Self-heal Daemon on dhcp35-45.lab.eng.blr.r edhat.com N/A N/A Y 11190 Self-heal Daemon on 10.70.35.112 N/A N/A Y 9545 Self-heal Daemon on 10.70.35.122 N/A N/A Y 6508 Self-heal Daemon on 10.70.35.138 N/A N/A Y 14368 Task Status of Volume cross3 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: distrep Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick1/distrep 49154 0 Y 10968 Brick 10.70.35.130:/rhs/brick1/distrep 49154 0 Y 15923 Brick 10.70.35.112:/rhs/brick1/distrep 49154 0 Y 9396 Brick 10.70.35.138:/rhs/brick1/distrep 49154 0 Y 14193 Self-heal Daemon on localhost N/A N/A Y 16088 Self-heal Daemon on 10.70.35.122 N/A N/A Y 6508 Self-heal Daemon on dhcp35-45.lab.eng.blr.r edhat.com N/A N/A Y 11190 Self-heal Daemon on 10.70.35.23 N/A N/A Y 17839 Self-heal Daemon on 10.70.35.112 N/A N/A Y 9545 Self-heal Daemon on 10.70.35.138 N/A N/A Y 14368 Task Status of Volume distrep ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: zen Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs /brick1/zen 49152 0 Y 10881 Brick dhcp35-130.lab.eng.blr.redhat.com:/rh s/brick1/zen 49152 0 Y 15848 Brick dhcp35-122.lab.eng.blr.redhat.com:/rh s/brick1/zen 49152 0 Y 6312 Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs /brick1/zen 49152 0 Y 17642 Brick dhcp35-112.lab.eng.blr.redhat.com:/rh s/brick1/zen 49152 0 Y 9321 Brick dhcp35-138.lab.eng.blr.redhat.com:/rh s/brick1/zen 49152 0 Y 14116 Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs /brick2/zen 49153 0 Y 10900 Brick dhcp35-130.lab.eng.blr.redhat.com:/rh s/brick2/zen 49153 0 Y 15867 Brick dhcp35-122.lab.eng.blr.redhat.com:/rh s/brick2/zen 49153 0 Y 6331 Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs /brick2/zen 49153 0 Y 17661 Brick dhcp35-112.lab.eng.blr.redhat.com:/rh s/brick2/zen 49153 0 Y 9340 Brick dhcp35-138.lab.eng.blr.redhat.com:/rh s/brick2/zen 49153 0 Y 14136 Self-heal Daemon on localhost N/A N/A Y 16088 Self-heal Daemon on 10.70.35.112 N/A N/A Y 9545 Self-heal Daemon on dhcp35-45.lab.eng.blr.r edhat.com N/A N/A Y 11190 Self-heal Daemon on 10.70.35.23 N/A N/A Y 17839 Self-heal Daemon on 10.70.35.122 N/A N/A Y 6508 Self-heal Daemon on 10.70.35.138 N/A N/A Y 14368 Task Status of Volume zen ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-130 ~]# gluster v info Volume Name: cross3 Type: Replicate Volume ID: 848123a0-6f33-4046-a48e-db2e5f3b84a6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick3/distrep Brick2: 10.70.35.130:/rhs/brick3/distrep Brick3: 10.70.35.112:/rhs/brick3/distrep Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: on Volume Name: distrep Type: Distributed-Replicate Volume ID: f9ebab34-d007-4ae7-a8a9-1fc6c4d6f61f Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick1/distrep Brick2: 10.70.35.130:/rhs/brick1/distrep Brick3: 10.70.35.112:/rhs/brick1/distrep Brick4: 10.70.35.138:/rhs/brick1/distrep Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: on Volume Name: zen Type: Distributed-Disperse Volume ID: 5098bb2d-8292-4b3d-b3c7-bf690709d1af Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick1/zen Brick2: dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick1/zen Brick3: dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick1/zen Brick4: dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick1/zen Brick5: dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick1/zen Brick6: dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick1/zen Brick7: dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick2/zen Brick8: dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick2/zen Brick9: dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick2/zen Brick10: dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick2/zen Brick11: dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick2/zen Brick12: dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick2/zen Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: on [root@dhcp35-130 ~]#
This is already fixed upstream through BZ 1420606 but we missed to backport it to downstream. Upstream patch : https://review.gluster.org/16560
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/102294
Build Version: 3.8.4-21 Created an ec volume and distrep volume. while the IO's are running, enabled brick multiplexing option and created new distrep(1*3) volume. No cores generated after enabling brick multiplexing in all the nodes Hence marking the bug as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774