Description of problem: ==================== hit this problem along with BZ#1725022 but on a different node and probably different iteration I was running volume creates and deletes on my brickmux setup, with creation of different type of volumes. I saw after about 10 hrs, shd crash dumps with below BT warning: core file may not match specified executable file. Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c8/fbb951579a5ccf45f786661b585545f43e4870 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id shd/rep_vnpuvcukvzkhb -p /var/run'. Program terminated with signal 11, Segmentation fault. #0 afr_has_quorum (subvols=0x7f94eb2093d0 "", this=this@entry=0x7f94eb0bc070, frame=frame@entry=0x0) at afr-transaction.c:871 871 up_children_count = AFR_COUNT(subvols, priv->child_count); Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-15.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 afr_has_quorum (subvols=0x7f94eb2093d0 "", this=this@entry=0x7f94eb0bc070, frame=frame@entry=0x0) at afr-transaction.c:871 #1 0x00007fa3bf5b82cd in afr_notify (this=0x7f94eb0bc070, event=6, data=0x7f94eb0a3f30, data2=<optimized out>) at afr-common.c:5586 #2 0x00007fa3bf5b86c9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at afr.c:42 #3 0x00007fa3cdf682a2 in xlator_notify (xl=0x7f94eb0bc070, event=event@entry=6, data=data@entry=0x7f94eb0a3f30) at xlator.c:692 #4 0x00007fa3ce01fd15 in default_notify (this=this@entry=0x7f94eb0a3f30, event=event@entry=6, data=data@entry=0x0) at defaults.c:3388 #5 0x00007fa3bf7f2469 in client_notify_dispatch (this=this@entry=0x7f94eb0a3f30, event=event@entry=6, data=data@entry=0x0) at client.c:97 #6 0x00007fa3bf7f24ca in client_notify_dispatch_uniq (this=this@entry=0x7f94eb0a3f30, event=event@entry=6, data=data@entry=0x0) at client.c:71 #7 0x00007fa3bf7f348d in client_rpc_notify (rpc=0x7f94eb5d08b0, mydata=0x7f94eb0a3f30, event=<optimized out>, data=<optimized out>) at client.c:2365 #8 0x00007fa3cdd14203 in rpc_clnt_handle_disconnect (conn=0x7f94eb5d08e0, clnt=0x7f94eb5d08b0) at rpc-clnt.c:826 #9 rpc_clnt_notify (trans=0x7f94eb5d0b70, mydata=0x7f94eb5d08e0, event=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-clnt.c:887 #10 0x00007fa3cdd10a53 in rpc_transport_notify (this=this@entry=0x7f94eb5d0b70, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7f94eb5d0b70) at rpc-transport.c:547 #11 0x00007fa3c232a2df in socket_event_poll_err (this=this@entry=0x7f94eb5d0b70, gen=gen@entry=1, idx=idx@entry=4638) at socket.c:1385 #12 0x00007fa3c232c6ea in socket_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7f94eb5d0b70, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=16, event_thread_died=0 '\000') at socket.c:3008 #13 0x00007fa3cdfd1416 in event_dispatch_epoll_handler (event=0x7fa3c0277e70, event_pool=0x5587f336f5b0) at event-epoll.c:648 #14 event_dispatch_epoll_worker (data=0x5587f33c1fd0) at event-epoll.c:761 #15 0x00007fa3ccda9ea5 in start_thread () from /lib64/libpthread.so.0 #16 0x00007fa3cc66f8cd in clone () from /lib64/libc.so.6 Version-Release number of selected component (if applicable): =================== 6.0.6 How reproducible: =============== was hit thrice on this cluster Steps to Reproduce: 1.created a 3 node brickmux setup 2. triggered a script which creates 1000 volumes of different types randomly(singlebrick,rep3, distrep3, arb,dist-arb,ecv,dist-ecv) 3. then post that we delete all the volumes 4. then again re-iterate step2,3 for j in {1..100};do echo "########## loop $j #### " |& tee -a volc.log; date |& tee -a volc.log;for i in {1..1000};do python randvol-create.py |& tee -a volc.log;done;for v in $(gluster v list);do gluster v stop $v --mode=script|& tee -a volc.log;date |& tee -a volc.log; gluster v del $v --mode=script|& tee -a volc.log;done;done Actual results: ================ glustershd crash as above
core @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1725022/rhs-gp-srv4.lab.eng.blr.redhat.com/core.27031 sosreports and logs sames as http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1725022
*** Bug 1725022 has been marked as a duplicate of this bug. ***
another crash hit with similar bt, when trying to add-bricks to randomly to any one of the 3 different volumes{1 each of disp,rep3,arb} ...tried this for 20 times... warning: core file may not match specified executable file. Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done. done. Missing separate debuginfo for Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c8/fbb951579a5ccf45f786661b585545f43e4870 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id shd/arb -p /var/run/gluster/shd/a'. Program terminated with signal 11, Segmentation fault. #0 __get_heard_from_all_status (this=this@entry=0x7fb534263680) at afr-common.c:5024 5024 for (i = 0; i < priv->child_count; i++) { Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-15.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64 (gdb) bt #0 __get_heard_from_all_status (this=this@entry=0x7fb534263680) at afr-common.c:5024 #1 0x00007fb5447b6e27 in afr_notify (this=0x7fb534263680, event=6, data=0x7fb53425c080, data2=<optimized out>) at afr-common.c:5519 #2 0x00007fb5447b76c9 in notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at afr.c:42 #3 0x00007fb553167272 in xlator_notify (xl=0x7fb534263680, event=event@entry=6, data=data@entry=0x7fb53425c080) at xlator.c:699 #4 0x00007fb55321e1d5 in default_notify (this=this@entry=0x7fb53425c080, event=event@entry=6, data=data@entry=0x0) at defaults.c:3388 #5 0x00007fb5449f1469 in client_notify_dispatch (this=this@entry=0x7fb53425c080, event=event@entry=6, data=data@entry=0x0) at client.c:97 #6 0x00007fb5449f14ca in client_notify_dispatch_uniq (this=this@entry=0x7fb53425c080, event=event@entry=6, data=data@entry=0x0) at client.c:71 #7 0x00007fb5449f2465 in client_rpc_notify (rpc=0x7fb5343ecb30, mydata=0x7fb53425c080, event=<optimized out>, data=<optimized out>) at client.c:2373 #8 0x00007fb552f13203 in rpc_clnt_handle_disconnect (conn=0x7fb5343ecb60, clnt=0x7fb5343ecb30) at rpc-clnt.c:826 #9 rpc_clnt_notify (trans=0x7fb5343ecdf0, mydata=0x7fb5343ecb60, event=RPC_TRANSPORT_DISCONNECT, data=<optimized out>) at rpc-clnt.c:887 #10 0x00007fb552f0fa53 in rpc_transport_notify (this=this@entry=0x7fb5343ecdf0, event=event@entry=RPC_TRANSPORT_DISCONNECT, data=data@entry=0x7fb5343ecdf0) at rpc-transport.c:547 #11 0x00007fb5475292df in socket_event_poll_err (this=this@entry=0x7fb5343ecdf0, gen=gen@entry=1, idx=idx@entry=18) at socket.c:1385 #12 0x00007fb54752b6ea in socket_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7fb5343ecdf0, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=16, event_thread_died=0 '\000') at socket.c:3008 #13 0x00007fb5531cf8d6 in event_dispatch_epoll_handler (event=0x7fb545476130, event_pool=0x56031562e560) ---Type <return> to continue, or q <return> to quit--- at event-epoll.c:648 #14 event_dispatch_epoll_worker (data=0x56031567eaa0) at event-epoll.c:761 #15 0x00007fb551fa8ea5 in start_thread () from /lib64/libpthread.so.0 #16 0x00007fb55186e8cd in clone () from /lib64/libc.so.6 (gdb)
pending frames: frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-07-03 11:00:11 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 6.0 /lib64/libglusterfs.so.0(+0x27210)[0x7fb55316a210] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fb553174c34] /lib64/libc.so.6(+0x363f0)[0x7fb5517a63f0] /usr/lib64/glusterfs/6.0/xlator/cluster/replicate.so(+0x5e447)[0x7fb5447a3447] /usr/lib64/glusterfs/6.0/xlator/cluster/replicate.so(+0x71e27)[0x7fb5447b6e27] /usr/lib64/glusterfs/6.0/xlator/cluster/replicate.so(+0x726c9)[0x7fb5447b76c9] /lib64/libglusterfs.so.0(xlator_notify+0x32)[0x7fb553167272] /lib64/libglusterfs.so.0(default_notify+0x195)[0x7fb55321e1d5] /usr/lib64/glusterfs/6.0/xlator/protocol/client.so(+0x14469)[0x7fb5449f1469] /usr/lib64/glusterfs/6.0/xlator/protocol/client.so(+0x144ca)[0x7fb5449f14ca] /usr/lib64/glusterfs/6.0/xlator/protocol/client.so(+0x15465)[0x7fb5449f2465] /lib64/libgfrpc.so.0(+0xf203)[0x7fb552f13203] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fb552f0fa53] /usr/lib64/glusterfs/6.0/rpc-transport/socket.so(+0x82df)[0x7fb5475292df] /usr/lib64/glusterfs/6.0/rpc-transport/socket.so(+0xa6ea)[0x7fb54752b6ea] /lib64/libglusterfs.so.0(+0x8c8d6)[0x7fb5531cf8d6] /lib64/libpthread.so.0(+0x7ea5)[0x7fb551fa8ea5] /lib64/libc.so.6(clone+0x6d)[0x7fb55186e8cd]
upstream patch: https://review.gluster.org/#/c/glusterfs/+/23006/1