Bug 1437957 - Brick Multiplexing: Glusterd crashed when stopping volumes
Summary: Brick Multiplexing: Glusterd crashed when stopping volumes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.3.0
Assignee: Atin Mukherjee
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard: brick-multiplexing
: 1438247 (view as bug list)
Depends On: 1420606
Blocks: 1417151
TreeView+ depends on / blocked
 
Reported: 2017-03-31 14:17 UTC by Nag Pavan Chilakam
Modified: 2017-09-21 04:35 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.8.4-21
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-21 04:35:56 UTC
Embargoed:


Attachments (Terms of Use)
core (635.00 KB, application/x-gzip)
2017-03-31 14:21 UTC, Nag Pavan Chilakam
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-03-31 14:17:07 UTC
Description of problem:
======================
Hit this issue after raising BZ#https://bugzilla.redhat.com/show_bug.cgi?id=1437940

I wanted to clean up the testbed, hence started to delete volumes by first stopping as below
[root@dhcp35-45 ~]# for i in $(gluster v list);do gluster v stop $i;done
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Connection failed. Please check if gluster daemon is operational.
Connection failed. Please check if gluster daemon is operational.
Connection failed. Please check if gluster daemon is operational.


The glusterd crashed on this node

The problem and core seems to be the same as the one I raised upstream
1436543 - Brick Multiplexing: Glusterd crashed when stopping volumes
nevertheless raising a seperate bz with the logs


version:
3.8.4-19

3 volumes as below:
1x3 ===>created post enabling brick multiplex
2x(4+2) ec vol==>created before enabling brick multiplex
2x2 ==>created before enabling brick multiplex


bt:
warning: core file may not match specified executable file.
[New LWP 10564]
[New LWP 10766]
[New LWP 10562]
[New LWP 10561]
[New LWP 10559]
[New LWP 10563]
[New LWP 10767]
[New LWP 10560]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 6, Aborted.
#0  0x00007fc4ac0981d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007fc4ac0981d7 in raise () from /lib64/libc.so.6
#1  0x00007fc4ac0998c8 in abort () from /lib64/libc.so.6
#2  0x00007fc4ac0d7f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc4ac172047 in __fortify_fail () from /lib64/libc.so.6
#4  0x00007fc4ac170200 in __chk_fail () from /lib64/libc.so.6
#5  0x00007fc4ac16f91b in __vsnprintf_chk () from /lib64/libc.so.6
#6  0x00007fc4ac16f838 in __snprintf_chk () from /lib64/libc.so.6
#7  0x00007fc4a24e99b4 in snprintf (__fmt=0x7fc4a25ddad8 "%s/run/%s-%s.pid", __n=4096, 
    __s=0x7fc49459e3f0 "@\345Y\224\304\177") at /usr/include/bits/stdio2.h:64
#8  glusterd_bricks_select_stop_volume (dict=dict@entry=0x7fc498007ec0, 
    op_errstr=op_errstr@entry=0x7fc4945a09f0, selected=selected@entry=0x7fc4945a0930)
    at glusterd-op-sm.c:6115
#9  0x00007fc4a24f6f76 in glusterd_op_bricks_select (op=op@entry=GD_OP_STOP_VOLUME, 
    dict=dict@entry=0x7fc498007ec0, op_errstr=op_errstr@entry=0x7fc4945a09f0, 
    selected=selected@entry=0x7fc4945a0930, rsp_dict=rsp_dict@entry=0x7fc48814d440)
    at glusterd-op-sm.c:7503
#10 0x00007fc4a258d7ef in gd_brick_op_phase (op=GD_OP_STOP_VOLUME, op_ctx=op_ctx@entry=0x7fc488154990, 
    req_dict=0x7fc498007ec0, op_errstr=op_errstr@entry=0x7fc4945a09f0) at glusterd-syncop.c:1681
#11 0x00007fc4a258e213 in gd_sync_task_begin (op_ctx=op_ctx@entry=0x7fc488154990, 
    req=req@entry=0x7fc494186f40) at glusterd-syncop.c:1922
#12 0x00007fc4a258e510 in glusterd_op_begin_synctask (req=req@entry=0x7fc494186f40, 
    op=op@entry=GD_OP_STOP_VOLUME, dict=0x7fc488154990) at glusterd-syncop.c:1991
#13 0x00007fc4a2575b7f in __glusterd_handle_cli_stop_volume (req=req@entry=0x7fc494186f40)
    at glusterd-volume-ops.c:628
#14 0x00007fc4a24dbca0 in glusterd_big_locked_handler (req=0x7fc494186f40, 
    actor_fn=0x7fc4a2575980 <__glusterd_handle_cli_stop_volume>) at glusterd-handler.c:81
---Type <return> to continue, or q <return> to quit---
#15 0x00007fc4ad9ec362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#16 0x00007fc4ac0a9cf0 in ?? () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()
(gdb) 
(gdb) t a a bt

Thread 8 (Thread 0x7fc4a5039700 (LWP 10560)):
#0  0x00007fc4ac81cbdd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fc4ad9c3306 in gf_timer_proc (data=0x7fc4aeb66770) at timer.c:176
#2  0x00007fc4ac815dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc4ac15a73d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7fc49df3a700 (LWP 10767)):
#0  0x00007fc4ac15ad13 in epoll_wait () from /lib64/libc.so.6
#1  0x00007fc4ada0fd30 in event_dispatch_epoll_worker (data=0x7fc4aebb4490) at event-epoll.c:665
#2  0x00007fc4ac815dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc4ac15a73d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fc4a3836700 (LWP 10563)):
#0  0x00007fc4ac819a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc4ad9ee898 in syncenv_task (proc=proc@entry=0x7fc4aeb66fc0) at syncop.c:603
#2  0x00007fc4ad9ef6e0 in syncenv_processor (thdata=0x7fc4aeb66fc0) at syncop.c:695
#3  0x00007fc4ac815dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc4ac15a73d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fc4ade8f780 (LWP 10559)):
#0  0x00007fc4ac816ef7 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fc4ada102e0 in event_dispatch_epoll (event_pool=0x7fc4aeb5ef00) at event-epoll.c:759
#2  0x00007fc4adea9d95 in main (argc=5, argv=<optimized out>) at glusterfsd.c:2464

Thread 4 (Thread 0x7fc4a4838700 (LWP 10561)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007fc4ac81d101 in sigwait () from /lib64/libpthread.so.0
#1  0x00007fc4adeacebb in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2055
#2  0x00007fc4ac815dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc4ac15a73d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fc4a4037700 (LWP 10562)):
#0  0x00007fc4ac12166d in nanosleep () from /lib64/libc.so.6
#1  0x00007fc4ac121504 in sleep () from /lib64/libc.so.6
#2  0x00007fc4ad9dc82d in pool_sweeper (arg=<optimized out>) at mem-pool.c:464
#3  0x00007fc4ac815dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fc4ac15a73d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fc49e73b700 (LWP 10766)):
#0  0x00007fc4ac8196d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fc4a258f783 in hooks_worker (args=<optimized out>) at glusterd-hooks.c:531
#2  0x00007fc4ac815dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fc4ac15a73d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fc4a3035700 (LWP 10564)):
#0  0x00007fc4ac0981d7 in raise () from /lib64/libc.so.6
#1  0x00007fc4ac0998c8 in abort () from /lib64/libc.so.6
#2  0x00007fc4ac0d7f07 in __libc_message () from /lib64/libc.so.6
#3  0x00007fc4ac172047 in __fortify_fail () from /lib64/libc.so.6
#4  0x00007fc4ac170200 in __chk_fail () from /lib64/libc.so.6
#5  0x00007fc4ac16f91b in __vsnprintf_chk () from /lib64/libc.so.6
#6  0x00007fc4ac16f838 in __snprintf_chk () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#7  0x00007fc4a24e99b4 in snprintf (__fmt=0x7fc4a25ddad8 "%s/run/%s-%s.pid", __n=4096, 
    __s=0x7fc49459e3f0 "@\345Y\224\304\177") at /usr/include/bits/stdio2.h:64
#8  glusterd_bricks_select_stop_volume (dict=dict@entry=0x7fc498007ec0, 
    op_errstr=op_errstr@entry=0x7fc4945a09f0, selected=selected@entry=0x7fc4945a0930)
    at glusterd-op-sm.c:6115
#9  0x00007fc4a24f6f76 in glusterd_op_bricks_select (op=op@entry=GD_OP_STOP_VOLUME, 
    dict=dict@entry=0x7fc498007ec0, op_errstr=op_errstr@entry=0x7fc4945a09f0, 
    selected=selected@entry=0x7fc4945a0930, rsp_dict=rsp_dict@entry=0x7fc48814d440)
    at glusterd-op-sm.c:7503
#10 0x00007fc4a258d7ef in gd_brick_op_phase (op=GD_OP_STOP_VOLUME, op_ctx=op_ctx@entry=0x7fc488154990, 
    req_dict=0x7fc498007ec0, op_errstr=op_errstr@entry=0x7fc4945a09f0) at glusterd-syncop.c:1681
#11 0x00007fc4a258e213 in gd_sync_task_begin (op_ctx=op_ctx@entry=0x7fc488154990, 
    req=req@entry=0x7fc494186f40) at glusterd-syncop.c:1922
#12 0x00007fc4a258e510 in glusterd_op_begin_synctask (req=req@entry=0x7fc494186f40, 
    op=op@entry=GD_OP_STOP_VOLUME, dict=0x7fc488154990) at glusterd-syncop.c:1991
#13 0x00007fc4a2575b7f in __glusterd_handle_cli_stop_volume (req=req@entry=0x7fc494186f40)
    at glusterd-volume-ops.c:628
#14 0x00007fc4a24dbca0 in glusterd_big_locked_handler (req=0x7fc494186f40, 
    actor_fn=0x7fc4a2575980 <__glusterd_handle_cli_stop_volume>) at glusterd-handler.c:81
#15 0x00007fc4ad9ec362 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#16 0x00007fc4ac0a9cf0 in ?? () from /lib64/libc.so.6
#17 0x0000000000000000 in ?? ()
(gdb) 
(gdb)

Comment 2 Nag Pavan Chilakam 2017-03-31 14:21:51 UTC
Created attachment 1267879 [details]
core

Comment 3 Nag Pavan Chilakam 2017-03-31 14:24:06 UTC
ssoreports and core @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1437957/

Comment 4 Atin Mukherjee 2017-03-31 18:09:00 UTC
This is already fixed upstream through BZ 1420606 but we missed to backport it to downstream.

Upstream patch : https://review.gluster.org/16560

Comment 6 Atin Mukherjee 2017-04-03 10:37:45 UTC
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/102294

Comment 8 Atin Mukherjee 2017-04-03 10:44:48 UTC
*** Bug 1438247 has been marked as a duplicate of this bug. ***

Comment 10 Nag Pavan Chilakam 2017-04-17 15:40:55 UTC
I am seeing a glusterd crash while stopping the volumes
Kindly refer to https://bugzilla.redhat.com/show_bug.cgi?id=1436543#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1436543#c6

I was testing on 3.8.4-22

If they are different then we can track them seperately, else we may have to fail_qa this bz

Comment 12 Nag Pavan Chilakam 2017-04-18 11:26:47 UTC
I have hit another crash but the bt being different(https://bugzilla.redhat.com/show_bug.cgi?id=1437957#c11). I am moving this bz to verified as I didn't hit this bz after trying a couple of times on multiple volumes

Hence moving to verified , tested on 3.8.4-22

Comment 14 errata-xmlrpc 2017-09-21 04:35:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.