1725022 – glustershd dumped core with seg fault at __get_heard_from_all_status

Bug 1725022 - glustershd dumped core with seg fault at __get_heard_from_all_status

Summary: glustershd dumped core with seg fault at __get_heard_from_all_status

Keywords:
Status:	CLOSED DUPLICATE of bug 1725024
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Karthik U S
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-28 09:27 UTC by Nag Pavan Chilakam
Modified:	2019-10-31 09:12 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-28 14:10:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2019-06-28 09:27:31 UTC

Description of problem:
=====================
I was running volume creates and deletes on my brickmux setup, with creation of different type of volumes. I saw after about 10 hrs,  shd crash dumps with below BT

Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c8/fbb951579a5ccf45f786661b585545f43e4870
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id shd/dist-arb_ex8z2g5bax33k -p /va'.
Program terminated with signal 11, Segmentation fault.
#0  __get_heard_from_all_status (this=this@entry=0x7f7fce0f2250) at afr-common.c:5024
5024	    for (i = 0; i < priv->child_count; i++) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-15.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64


(gdb) bt
#0  __get_heard_from_all_status (this=this@entry=0x7f7fce0f2250) at afr-common.c:5024
#1  0x00007f7fef97be27 in afr_notify (this=0x7f7fce0f2250, event=6, data=0x7f7fce0de3d0, data2=<optimized out>) at afr-common.c:5519
#2  0x00007f7fef97c6c9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at afr.c:42
#3  0x00007f7ffe32c2a2 in xlator_notify (xl=0x7f7fce0f2250, event=event@entry=6, data=data@entry=0x7f7fce0de3d0) at xlator.c:692
#4  0x00007f7ffe3e3d15 in default_notify (this=this@entry=0x7f7fce0de3d0, event=event@entry=6, data=data@entry=0x0) at defaults.c:3388
#5  0x00007f7fefbb6469 in client_notify_dispatch (this=this@entry=0x7f7fce0de3d0, event=event@entry=6, data=data@entry=0x0) at client.c:97
#6  0x00007f7fefbb64ca in client_notify_dispatch_uniq (this=this@entry=0x7f7fce0de3d0, event=event@entry=6, data=data@entry=0x0) at client.c:71
#7  0x00007f7fefbb748d in client_rpc_notify (rpc=0x7f7fce838270, mydata=0x7f7fce0de3d0, event=<optimized out>, data=<optimized out>) at client.c:2365
#8  0x00007f7ffe0d8203 in rpc_clnt_handle_disconnect (conn=0x7f7fce8382a0, clnt=0x7f7fce838270) at rpc-clnt.c:826
#9  rpc_clnt_notify (trans=0x7f7fce8385b0, mydata=0x7f7fce8382a0, event=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-clnt.c:887
#10 0x00007f7ffe0d4a53 in rpc_transport_notify (this=this@entry=0x7f7fce8385b0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f7fce8385b0) at rpc-transport.c:547
#11 0x00007f7ff26ee2df in socket_event_poll_err (this=this@entry=0x7f7fce8385b0, gen=gen@entry=1, idx=idx@entry=183) at socket.c:1385
#12 0x00007f7ff26f06ea in socket_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7f7fce8385b0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=16, event_thread_died=0 '\000')
    at socket.c:3008
#13 0x00007f7ffe395416 in event_dispatch_epoll_handler (event=0x7f7ff0e3ce70, event_pool=0x55b40b2bf5b0) at event-epoll.c:648
#14 event_dispatch_epoll_worker (data=0x55b40b311c80) at event-epoll.c:761
#15 0x00007f7ffd16dea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f7ffca338cd in clone () from /lib64/libc.so.6



Version-Release number of selected component (if applicable):
===================
6.0.6 


How reproducible:
===============
was hit thrice on this cluster

Steps to Reproduce:
1.created a 3 node brickmux setup
2. triggered a script which creates 1000 volumes of different types randomly(singlebrick,rep3, distrep3, arb,dist-arb,ecv,dist-ecv)
3. then post that we delete all the volumes
4. then again re-iterate step2,3

for j in {1..100};do echo "########## loop $j #### " |& tee -a volc.log; date |& tee -a volc.log;for i in {1..1000};do python randvol-create.py |& tee -a volc.log;done;for v in $(gluster v list);do gluster v stop $v --mode=script|& tee -a volc.log;date |& tee -a volc.log; gluster v del $v --mode=script|& tee -a volc.log;done;done

Actual results:
================
glustershd crash as above

Comment 2 Nag Pavan Chilakam 2019-06-28 09:40:28 UTC

logs and sosreports @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1725022/
cores at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1725022/rhs-gp-srv1.lab.eng.blr.redhat.com/

Note You need to log in before you can comment on or make changes to this bug.