1436543 – Brick Multiplexing: Glusterd crashed when stopping volumes

Bug 1436543 - Brick Multiplexing: Glusterd crashed when stopping volumes

Summary: Brick Multiplexing: Glusterd crashed when stopping volumes

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Jeff Darcy
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-28 07:13 UTC by Nag Pavan Chilakam
Modified:	2018-06-20 18:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-06-20 18:24:30 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
core (10.84 MB, application/x-gzip) 2017-03-28 07:19 UTC, Nag Pavan Chilakam	no flags	Details
core while doing vol stop post raising bz#1442787 - Brick Multiplexing (1.69 MB, application/x-gzip) 2017-04-17 15:37 UTC, Nag Pavan Chilakam	no flags	Details
View All

Description Nag Pavan Chilakam 2017-03-28 07:13:24 UTC

Description of problem:
=====================
had about 42 volumes as below on a  3 node setup
10 vols of 2x2 type
10 vols of 2x(4+2) type
10 1x3 volumes
10 1x2 volumes
1 1x2 and 1 1x3 volume===>created before brick multiplex enabled

I started to stop volumes all volumes one after another

From another Node, I was deleting volumes which were stopped

I found that after about 20 volumes, the glusterd crashed on the node where I was stopping the volumes


[root@dhcp35-192 ~]# file /core.9140 
/core.9140: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterd', platform: 'x86_64'
[root@dhcp35-192 ~]# gdb /usr/sbin/glusterd /core.9140
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 9148]
[New LWP 9143]
[New LWP 9147]
[New LWP 9142]
[New LWP 9141]
[New LWP 9144]
[New LWP 9140]
[New LWP 9145]
 [Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
 Core was generated by `/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO'.
Program terminated with signal 11, Segmentation fault.
#0  list_del_init (old=0x7fb6d4962cf0) at list.h:87
87		old->prev->next = old->next;
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.135-1.el7_3.3.x86_64 device-mapper-libs-1.02.135-1.el7_3.3.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-33.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-11.el7.x86_64 libselinux-2.5-6.el7.x86_64 libsepol-2.5-6.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.166-1.el7_3.3.x86_64 openssl-libs-1.0.1e-60.el7_3.1.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.7.x86_64 userspace-rcu-0.7.16-3.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  list_del_init (old=0x7fb6d4962cf0) at list.h:87
#1  __run (task=task@entry=0x7fb6d4962cf0) at syncop.c:255
#2  0x00007fb70a0538d1 in synctask_wake (task=0x7fb6d4962cf0) at syncop.c:359
#3  0x00007fb6febede66 in _gd_syncop_brick_op_cbk (req=req@entry=0x7fb6e4a87990, 
    iov=iov@entry=0x7fb6fa595860, count=count@entry=1, myframe=myframe@entry=0x7fb6e7c232d0)
    at glusterd-syncop.c:937
#4  0x00007fb6feb8862a in glusterd_big_locked_cbk (req=0x7fb6e4a87990, iov=0x7fb6fa595860, count=1, 
    myframe=0x7fb6e7c232d0, fn=0x7fb6febedbc0 <_gd_syncop_brick_op_cbk>) at glusterd-rpc-ops.c:222
#5  0x00007fb709de48d5 in saved_frames_unwind (saved_frames=saved_frames@entry=0x7fb6f001bfb0)
    at rpc-clnt.c:369
#6  0x00007fb709de49be in saved_frames_destroy (frames=0x7fb6f001bfb0) at rpc-clnt.c:386
#7  0x00007fb709de6124 in rpc_clnt_connection_cleanup (conn=conn@entry=0x7fb6f4201ff8)
    at rpc-clnt.c:555
#8  0x00007fb709de69ac in rpc_clnt_handle_disconnect (conn=0x7fb6f4201ff8, clnt=0x7fb6f4201fa0)
    at rpc-clnt.c:880
#9  rpc_clnt_notify (trans=<optimized out>, mydata=0x7fb6f4201ff8, event=RPC_TRANSPORT_DISCONNECT, 
    data=0x7fb6f42925f0) at rpc-clnt.c:936
#10 0x00007fb709de29e3 in rpc_transport_notify (this=this@entry=0x7fb6f42925f0, 
    event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7fb6f42925f0) at rpc-transport.c:538
#11 0x00007fb6fbfa77b2 in socket_event_poll_err (this=0x7fb6f42925f0) at socket.c:1180
#12 socket_event_handler (fd=<optimized out>, idx=20, data=0x7fb6f42925f0, poll_in=0, poll_out=4, 
    poll_err=<optimized out>) at socket.c:2405
#13 0x00007fb70a076fa0 in event_dispatch_epoll_handler (event=0x7fb6fa595e80, 
    event_pool=0x7fb70b1e5fe0) at event-epoll.c:572
#14 event_dispatch_epoll_worker (data=0x7fb70b207c10) at event-epoll.c:675
#15 0x00007fb708e7ddc5 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#16 0x00007fb7087c273d in clone () from /lib64/libc.so.6
(gdb) quit
[root@dhcp35-192 ~]# 
[root@dhcp35-192 ~]# 
[root@dhcp35-192 ~]# service glusterd status
Redirecting to /bin/systemctl status  glusterd.service
● glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; disabled; vendor preset: disabled)
   Active: failed (Result: signal) since Tue 2017-03-28 12:12:25 IST; 23min ago
  Process: 9139 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 9140 (code=killed, signal=SEGV)
   CGroup: /system.slice/glusterd.service

Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: setfsid 1
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: spinlock 1
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: epoll.h 1
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: xattr.h 1
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: st_atim.tv_nsec 1
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: package-string: glusterfs 3.10.0
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com glusterd[9140]: ---------
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com systemd[1]: glusterd.service: main process exited...GV
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com systemd[1]: Unit glusterd.service entered failed ...e.
Mar 28 12:12:25 dhcp35-192.lab.eng.blr.redhat.com systemd[1]: glusterd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.





Version-Release number of selected component (if applicable):
=========
[root@dhcp35-192 ~]# rpm -qa|grep gluster
glusterfs-libs-3.10.0-1.el7.x86_64
glusterfs-api-3.10.0-1.el7.x86_64
glusterfs-debuginfo-3.10.0-1.el7.x86_64
glusterfs-3.10.0-1.el7.x86_64
glusterfs-fuse-3.10.0-1.el7.x86_64
glusterfs-cli-3.10.0-1.el7.x86_64
glusterfs-rdma-3.10.0-1.el7.x86_64
glusterfs-client-xlators-3.10.0-1.el7.x86_64
glusterfs-server-3.10.0-1.el7.x86_64
[root@dhcp35-192 ~]#

Comment 1 Nag Pavan Chilakam 2017-03-28 07:19:19 UTC

Created attachment 1266847 [details]
core

Comment 2 Nag Pavan Chilakam 2017-03-28 07:23:03 UTC

note that I had enabled quota and uss on all volumes

Comment 3 Atin Mukherjee 2017-03-28 08:03:26 UTC

Looks like the same issue what the patch https://review.gluster.org/16927 tries to solve. Jeff?

Comment 4 Jeff Darcy 2017-03-28 11:36:16 UTC

(In reply to Atin Mukherjee from comment #3)
> Looks like the same issue what the patch https://review.gluster.org/16927
> tries to solve. Jeff?

Pretty likely, but not certain.  Memory-corruption bugs can have all sorts of unexpected effects.  However, since we crashed deleting an item on one of the very same lists that was likely to be corrupted by the other bug, the probability of a relationship is high.  We need to get that patch un-stuck and re-test.

Comment 5 Nag Pavan Chilakam 2017-04-17 15:34:52 UTC

Seen this crash again on 3.8.4-22, when i was stopping all the 20 volumes  in a sequence post raising bZ# 1442787 - Brick Multiplexing: During Remove brick when glusterd of a node is stopped, the brick process gets disconnected from glusterd purview and hence losing multiplexing feature . Core is attached

Comment 6 Nag Pavan Chilakam 2017-04-17 15:37:23 UTC

Created attachment 1272104 [details]
core while doing vol stop post raising bz#1442787 - Brick Multiplexing

Comment 7 Atin Mukherjee 2017-04-18 04:47:36 UTC

(In reply to nchilaka from comment #5)
> Seen this crash again on 3.8.4-22, when i was stopping all the 20 volumes 
> in a sequence post raising bZ# 1442787 - Brick Multiplexing: During Remove
> brick when glusterd of a node is stopped, the brick process gets
> disconnected from glusterd purview and hence losing multiplexing feature .
> Core is attached

3.8.4-22 is not an upstream bit. If you are updating this bug, results should be based on upstream testing.

Comment 8 Shyamsundar 2018-06-20 18:24:30 UTC

This bug reported is against a version of Gluster that is no longer maintained (or has been EOL'd). See https://www.gluster.org/release-schedule/ for the versions currently maintained.

As a result this bug is being closed.

If the bug persists on a maintained version of gluster or against the mainline gluster repository, request that it be reopened and the Version field be marked appropriately.

Note You need to log in before you can comment on or make changes to this bug.