Description of problem: ===================== had about 42 volumes as below on a 3 node setup 10 vols of 2x2 type 10 vols of 2x(4+2) type 10 1x3 volumes 10 1x2 volumes 1 1x2 and 1 1x3 volume===>created before brick multiplex enabled I started to stop volumes all volumes one after another From another Node, I was deleting volumes which were stopped I found that after about 20 volumes, the glusterd crashed on the node where I was stopping the volumes [root@dhcp35-192 ~]# file /core.9140 /core.9140: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterd', platform: 'x86_64' [root@dhcp35-192 ~]# gdb /usr/sbin/glusterd /core.9140 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. warning: core file may not match specified executable file. [New LWP 9148] [New LWP 9143] [New LWP 9147] [New LWP 9142] [New LWP 9141] [New LWP 9144] [New LWP 9140] [New LWP 9145] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'. Program terminated with signal 11, Segmentation fault. #0 list_del_init (old=0x7fb6d4962cf0) at list.h:87 87 old->prev->next = old->next; Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.16-3.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 list_del_init (old=0x7fb6d4962cf0) at list.h:87 #1 __run (task=task@entry=0x7fb6d4962cf0) at syncop.c:255 #2 0x00007fb70a0538d1 in synctask_wake (task=0x7fb6d4962cf0) at syncop.c:359 #3 0x00007fb6febede66 in _gd_syncop_brick_op_cbk (req=req@entry=0x7fb6e4a87990, iov=iov@entry=0x7fb6fa595860, count=count@entry=1, myframe=myframe@entry=0x7fb6e7c232d0) at glusterd-syncop.c:937 #4 0x00007fb6feb8862a in glusterd_big_locked_cbk (req=0x7fb6e4a87990, iov=0x7fb6fa595860, count=1, myframe=0x7fb6e7c232d0, fn=0x7fb6febedbc0 <_gd_syncop_brick_op_cbk>) at glusterd-rpc-ops.c:222 #5 0x00007fb709de48d5 in saved_frames_unwind (saved_frames=saved_frames@entry=0x7fb6f001bfb0) at rpc-clnt.c:369 #6 0x00007fb709de49be in saved_frames_destroy (frames=0x7fb6f001bfb0) at rpc-clnt.c:386 #7 0x00007fb709de6124 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fb6f4201ff8) at rpc-clnt.c:555 #8 0x00007fb709de69ac in rpc_clnt_handle_disconnect (conn=0x7fb6f4201ff8, clnt=0x7fb6f4201fa0) at rpc-clnt.c:880 #9 rpc_clnt_notify (trans=<optimized out>, mydata=0x7fb6f4201ff8, event=RPC_TRANSPORT_DISCONNECT, data=0x7fb6f42925f0) at rpc-clnt.c:936 #10 0x00007fb709de29e3 in rpc_transport_notify (this=this@entry=0x7fb6f42925f0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7fb6f42925f0) at rpc-transport.c:538 #11 0x00007fb6fbfa77b2 in socket_event_poll_err (this=0x7fb6f42925f0) at socket.c:1180 #12 socket_event_handler (fd=<optimized out>, idx=20, data=0x7fb6f42925f0, poll_in=0, poll_out=4, poll_err=<optimized out>) at socket.c:2405 #13 0x00007fb70a076fa0 in event_dispatch_epoll_handler (event=0x7fb6fa595e80, event_pool=0x7fb70b1e5fe0) at event-epoll.c:572 #14 event_dispatch_epoll_worker (data=0x7fb70b207c10) at event-epoll.c:675 #15 0x00007fb708e7ddc5 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- #16 0x00007fb7087c273d in clone () from /lib64/libc.so.6 (gdb) quit [root@dhcp35-192 ~]# [root@dhcp35-192 ~]# [root@dhcp35-192 ~]# service glusterd status Redirecting to /bin/systemctl status glusterd.service ● glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: disabled) Active: failed (Result: signal) since Tue 2017-03-28 12:12:25 IST; 23min ago Process: 9139 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 9140 (code=killed, signal=SEGV) CGroup: /system.slice/glusterd.service Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: setfsid 1 Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: spinlock 1 Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: epoll.h 1 Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: xattr.h 1 Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: st_atim.tv_nsec 1 Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: package-string: glusterfs 3.10.0 Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: --------- Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com systemd[1]: glusterd.service: main process exited...GV Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com systemd[1]: Unit glusterd.service entered failed ...e. Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com systemd[1]: glusterd.service failed. Hint: Some lines were ellipsized, use -l to show in full. Version-Release number of selected component (if applicable): ========= [root@dhcp35-192 ~]# rpm -qa|grep gluster glusterfs-libs-3.10.0-1.el7.x86_64 glusterfs-api-3.10.0-1.el7.x86_64 glusterfs-debuginfo-3.10.0-1.el7.x86_64 glusterfs-3.10.0-1.el7.x86_64 glusterfs-fuse-3.10.0-1.el7.x86_64 glusterfs-cli-3.10.0-1.el7.x86_64 glusterfs-rdma-3.10.0-1.el7.x86_64 glusterfs-client-xlators-3.10.0-1.el7.x86_64 glusterfs-server-3.10.0-1.el7.x86_64 [root@dhcp35-192 ~]#
Created attachment 1266847 [details] core
note that I had enabled quota and uss on all volumes
Looks like the same issue what the patch https://review.gluster.org/16927 tries to solve. Jeff?
(In reply to Atin Mukherjee from comment #3) > Looks like the same issue what the patch https://review.gluster.org/16927 > tries to solve. Jeff? Pretty likely, but not certain. Memory-corruption bugs can have all sorts of unexpected effects. However, since we crashed deleting an item on one of the very same lists that was likely to be corrupted by the other bug, the probability of a relationship is high. We need to get that patch un-stuck and re-test.
Seen this crash again on 3.8.4-22, when i was stopping all the 20 volumes in a sequence post raising bZ# 1442787 - Brick Multiplexing: During Remove brick when glusterd of a node is stopped, the brick process gets disconnected from glusterd purview and hence losing multiplexing feature . Core is attached
Created attachment 1272104 [details] core while doing vol stop post raising bz#1442787 - Brick Multiplexing
(In reply to nchilaka from comment #5) > Seen this crash again on 3.8.4-22, when i was stopping all the 20 volumes > in a sequence post raising bZ# 1442787 - Brick Multiplexing: During Remove > brick when glusterd of a node is stopped, the brick process gets > disconnected from glusterd purview and hence losing multiplexing feature . > Core is attached 3.8.4-22 is not an upstream bit. If you are updating this bug, results should be based on upstream testing.
This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained. As a result this bug is being closed. If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.