1437940 – Brick Multiplexing: Core dumped when brick multiplex is enabled

Bug 1437940 - Brick Multiplexing: Core dumped when brick multiplex is enabled

Summary: Brick Multiplexing: Core dumped when brick multiplex is enabled

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Atin Mukherjee
QA Contact:	Bala Konda Reddy M
Docs Contact:
URL:
Whiteboard:	brick-multiplexing
Depends On:	1420606
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-03-31 13:49 UTC by Nag Pavan Chilakam
Modified:	2018-11-30 05:39 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-21
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:35:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2774	0	normal	SHIPPED_LIVE	glusterfs bug fix and enhancement update	2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-03-31 13:49:35 UTC

Description of problem:
====================
I had a 6 node setup with 1 ec vol and 1 distrep volume
I mounted those two vols and was running IOs
during that time I enabled Brick multiplexing and create a new 1x3 volume
I saw that all the nodes had core dumped glusterd



gdb) 
#0  0x00007fd3704041d7 in raise () from /lib64/libc.so.6
#1  0x00007fd3704058c8 in abort () from /lib64/libc.so.6
#2  0x00007fd370443f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fd3704de047 in __fortify_fail () from /lib64/libc.so.6
#4  0x00007fd3704dc200 in __chk_fail () from /lib64/libc.so.6
#5  0x00007fd3704db91b in __vsnprintf_chk () from /lib64/libc.so.6
#6  0x00007fd3704db838 in __snprintf_chk () from /lib64/libc.so.6
#7  0x00007fd3668559b4 in snprintf (__fmt=0x7fd366949ad8 "%s/run/%s-%s.pid", __n=4096, 
    __s=0x7fd35858f930 "") at /usr/include/bits/stdio2.h:64
#8  glusterd_bricks_select_stop_volume (dict=dict@entry=0x7fd3500dcad0, 
    op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>)
    at glusterd-op-sm.c:6115
#9  0x00007fd366862f76 in glusterd_op_bricks_select (op=<optimized out>, dict=0x7fd3500dcad0, 
    op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>, 
    rsp_dict=rsp_dict@entry=0x0) at glusterd-op-sm.c:7503
#10 0x00007fd366890890 in glusterd_brick_op (frame=<optimized out>, this=0x7fd373f93710, 
    data=0x7fd350101630) at glusterd-rpc-ops.c:2289
#11 0x00007fd366866253 in glusterd_op_ac_send_brick_op (event=0x7fd3500c30b0, ctx=<optimized out>)
    at glusterd-op-sm.c:7406
#12 0x00007fd366864f3f in glusterd_op_sm () at glusterd-op-sm.c:7990
#13 0x00007fd366841862 in __glusterd_handle_commit_op (req=req@entry=0x7fd3580018b0)
    at glusterd-handler.c:1165
#14 0x00007fd366847ca0 in glusterd_big_locked_handler (req=0x7fd3580018b0, 
    actor_fn=0x7fd366841740 <__glusterd_handle_commit_op>) at glusterd-handler.c:81
#15 0x00007fd371d58362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#16 0x00007fd370415cf0 in ?? () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()


Version-Release number of selected component (if applicable):
========
3.8.4-20


Note: I even did a ifdown , later, but the core was dumped before this

Comment 2 Nag Pavan Chilakam 2017-03-31 13:51:03 UTC

Also note that after enabling brick multiplexing in the above test bed, the glusterd info doesnt show it in glusterd.info

[root@dhcp35-122 glusterd]# cat glusterd.info 
UUID=425e7d60-f0e5-4a45-8266-cce0443584b1
operating-version=31001

Comment 3 Nag Pavan Chilakam 2017-03-31 13:52:02 UTC

New LWP 14960]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 6, Aborted.
#0  0x00007fd3704041d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt t a a 
No symbol "t" in current context.
(gdb) t a a bt

Thread 8 (Thread 0x7fd368ba4700 (LWP 14960)):
#0  0x00007fd370b89101 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fd372218ebb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055
#2  0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd3704c673d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fd3622a6700 (LWP 15165)):
#0  0x00007fd3704c6d13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fd371d7bd30 in event_dispatch_epoll_worker (data=0x7fd373fd8490) at event-epoll.c:665
#2  0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd3704c673d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fd3693a5700 (LWP 14959)):
#0  0x00007fd370b88bdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fd371d2f306 in gf_timer_proc (data=0x7fd373f8a770) at timer.c:176
#2  0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd3704c673d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fd3683a3700 (LWP 14961)):
#0  0x00007fd37048d66d in nanosleep () from /lib64/libc.so.6
#1  0x00007fd37048d504 in sleep () from /lib64/libc.so.6
#2  0x00007fd371d4882d in pool_sweeper (arg=<optimized out>) at mem-pool.c:464
#3  0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd3704c673d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fd362aa7700 (LWP 15164)):
#0  0x00007fd370b856d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd3668fb783 in hooks_worker (args=<optimized out>) at glusterd-hooks.c:531
#2  0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fd3704c673d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fd367ba2700 (LWP 14962)):
#0  0x00007fd370b85a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd371d5a898 in syncenv_task (proc=proc@entry=0x7fd373f8afc0) at syncop.c:603
#2  0x00007fd371d5b6e0 in syncenv_processor (thdata=0x7fd373f8afc0) at syncop.c:695
#3  0x00007fd370b81dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd3704c673d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fd3721fb780 (LWP 14958)):
#0  0x00007fd370b82ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fd371d7c2e0 in event_dispatch_epoll (event_pool=0x7fd373f82f00) at event-epoll.c:759
#2  0x00007fd372215d95 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2464

Thread 1 (Thread 0x7fd3673a1700 (LWP 14963)):
#0  0x00007fd3704041d7 in raise () from /lib64/libc.so.6
#1  0x00007fd3704058c8 in abort () from /lib64/libc.so.6
#2  0x00007fd370443f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fd3704de047 in __fortify_fail () from /lib64/libc.so.6
#4  0x00007fd3704dc200 in __chk_fail () from /lib64/libc.so.6
#5  0x00007fd3704db91b in __vsnprintf_chk () from /lib64/libc.so.6
#6  0x00007fd3704db838 in __snprintf_chk () from /lib64/libc.so.6
#7  0x00007fd3668559b4 in snprintf (__fmt=0x7fd366949ad8 "%s/run/%s-%s.pid", __n=4096, 
---Type <return> to continue, or q <return> to quit---
    __s=0x7fd35858f930 "") at /usr/include/bits/stdio2.h:64
#8  glusterd_bricks_select_stop_volume (dict=dict@entry=0x7fd3500dcad0, 
    op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>)
    at glusterd-op-sm.c:6115
#9  0x00007fd366862f76 in glusterd_op_bricks_select (op=<optimized out>, dict=0x7fd3500dcad0, 
    op_errstr=op_errstr@entry=0x7fd358591e68, selected=selected@entry=0x7fd366b9f458 <opinfo+88>, 
    rsp_dict=rsp_dict@entry=0x0) at glusterd-op-sm.c:7503
#10 0x00007fd366890890 in glusterd_brick_op (frame=<optimized out>, this=0x7fd373f93710, 
    data=0x7fd350101630) at glusterd-rpc-ops.c:2289
#11 0x00007fd366866253 in glusterd_op_ac_send_brick_op (event=0x7fd3500c30b0, ctx=<optimized out>)
    at glusterd-op-sm.c:7406
#12 0x00007fd366864f3f in glusterd_op_sm () at glusterd-op-sm.c:7990
#13 0x00007fd366841862 in __glusterd_handle_commit_op (req=req@entry=0x7fd3580018b0)
    at glusterd-handler.c:1165
#14 0x00007fd366847ca0 in glusterd_big_locked_handler (req=0x7fd3580018b0, 
    actor_fn=0x7fd366841740 <__glusterd_handle_commit_op>) at glusterd-handler.c:81
#15 0x00007fd371d58362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#16 0x00007fd370415cf0 in ?? () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()

Comment 4 Nag Pavan Chilakam 2017-03-31 14:06:15 UTC

logs
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1437940/servers/

(note sosreport was taking lot of time to dump, hence aborted)
logs contain core, /var/log /var/lib/glusterd info

Comment 5 Nag Pavan Chilakam 2017-03-31 14:07:30 UTC

[root@dhcp35-130 ~]# gluster v status
Status of volume: cross3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.45:/rhs/brick3/distrep       49154     0          Y       10968
Brick 10.70.35.130:/rhs/brick3/distrep      49154     0          Y       15923
Brick 10.70.35.112:/rhs/brick3/distrep      49154     0          Y       9396 
Self-heal Daemon on localhost               N/A       N/A        Y       16088
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       17839
Self-heal Daemon on dhcp35-45.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       11190
Self-heal Daemon on 10.70.35.112            N/A       N/A        Y       9545 
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       6508 
Self-heal Daemon on 10.70.35.138            N/A       N/A        Y       14368
 
Task Status of Volume cross3
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.45:/rhs/brick1/distrep       49154     0          Y       10968
Brick 10.70.35.130:/rhs/brick1/distrep      49154     0          Y       15923
Brick 10.70.35.112:/rhs/brick1/distrep      49154     0          Y       9396 
Brick 10.70.35.138:/rhs/brick1/distrep      49154     0          Y       14193
Self-heal Daemon on localhost               N/A       N/A        Y       16088
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       6508 
Self-heal Daemon on dhcp35-45.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       11190
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       17839
Self-heal Daemon on 10.70.35.112            N/A       N/A        Y       9545 
Self-heal Daemon on 10.70.35.138            N/A       N/A        Y       14368
 
Task Status of Volume distrep
------------------------------------------------------------------------------
There are no active volume tasks
 
Status of volume: zen
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs
/brick1/zen                                 49152     0          Y       10881
Brick dhcp35-130.lab.eng.blr.redhat.com:/rh
s/brick1/zen                                49152     0          Y       15848
Brick dhcp35-122.lab.eng.blr.redhat.com:/rh
s/brick1/zen                                49152     0          Y       6312 
Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs
/brick1/zen                                 49152     0          Y       17642
Brick dhcp35-112.lab.eng.blr.redhat.com:/rh
s/brick1/zen                                49152     0          Y       9321 
Brick dhcp35-138.lab.eng.blr.redhat.com:/rh
s/brick1/zen                                49152     0          Y       14116
Brick dhcp35-45.lab.eng.blr.redhat.com:/rhs
/brick2/zen                                 49153     0          Y       10900
Brick dhcp35-130.lab.eng.blr.redhat.com:/rh
s/brick2/zen                                49153     0          Y       15867
Brick dhcp35-122.lab.eng.blr.redhat.com:/rh
s/brick2/zen                                49153     0          Y       6331 
Brick dhcp35-23.lab.eng.blr.redhat.com:/rhs
/brick2/zen                                 49153     0          Y       17661
Brick dhcp35-112.lab.eng.blr.redhat.com:/rh
s/brick2/zen                                49153     0          Y       9340 
Brick dhcp35-138.lab.eng.blr.redhat.com:/rh
s/brick2/zen                                49153     0          Y       14136
Self-heal Daemon on localhost               N/A       N/A        Y       16088
Self-heal Daemon on 10.70.35.112            N/A       N/A        Y       9545 
Self-heal Daemon on dhcp35-45.lab.eng.blr.r
edhat.com                                   N/A       N/A        Y       11190
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       17839
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       6508 
Self-heal Daemon on 10.70.35.138            N/A       N/A        Y       14368
 
Task Status of Volume zen
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-130 ~]# gluster v info
 
Volume Name: cross3
Type: Replicate
Volume ID: 848123a0-6f33-4046-a48e-db2e5f3b84a6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.45:/rhs/brick3/distrep
Brick2: 10.70.35.130:/rhs/brick3/distrep
Brick3: 10.70.35.112:/rhs/brick3/distrep
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: on
 
Volume Name: distrep
Type: Distributed-Replicate
Volume ID: f9ebab34-d007-4ae7-a8a9-1fc6c4d6f61f
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.45:/rhs/brick1/distrep
Brick2: 10.70.35.130:/rhs/brick1/distrep
Brick3: 10.70.35.112:/rhs/brick1/distrep
Brick4: 10.70.35.138:/rhs/brick1/distrep
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: on
 
Volume Name: zen
Type: Distributed-Disperse
Volume ID: 5098bb2d-8292-4b3d-b3c7-bf690709d1af
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick1/zen
Brick2: dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick1/zen
Brick3: dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick1/zen
Brick4: dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick1/zen
Brick5: dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick1/zen
Brick6: dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick1/zen
Brick7: dhcp35-45.lab.eng.blr.redhat.com:/rhs/brick2/zen
Brick8: dhcp35-130.lab.eng.blr.redhat.com:/rhs/brick2/zen
Brick9: dhcp35-122.lab.eng.blr.redhat.com:/rhs/brick2/zen
Brick10: dhcp35-23.lab.eng.blr.redhat.com:/rhs/brick2/zen
Brick11: dhcp35-112.lab.eng.blr.redhat.com:/rhs/brick2/zen
Brick12: dhcp35-138.lab.eng.blr.redhat.com:/rhs/brick2/zen
Options Reconfigured:
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: on
[root@dhcp35-130 ~]#

Comment 6 Atin Mukherjee 2017-03-31 18:09:40 UTC

This is already fixed upstream through BZ 1420606 but we missed to backport it to downstream.

Upstream patch : https://review.gluster.org/16560

Comment 7 Atin Mukherjee 2017-04-03 10:38:30 UTC

Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/102294

Comment 11 Bala Konda Reddy M 2017-04-10 10:28:38 UTC

Build Version: 3.8.4-21
Created an ec volume and distrep volume. while the IO's are running, enabled brick multiplexing option and created new distrep(1*3) volume. No cores generated after enabling brick multiplexing in all the nodes

Hence marking the bug as verified

Comment 13 errata-xmlrpc 2017-09-21 04:35:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774

Note You need to log in before you can comment on or make changes to this bug.