1809438 – [brickmux]: glustershd crashed when rebooting 1/3 nodes at regular intervals

Bug 1809438 - [brickmux]: glustershd crashed when rebooting 1/3 nodes at regular intervals

Summary: [brickmux]: glustershd crashed when rebooting 1/3 nodes at regular intervals

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1808875
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-03 06:42 UTC by Xavi Hernandez
Modified:	2020-04-22 11:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:	1808875
Environment:
Last Closed:	2020-03-12 13:27:41 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gluster.org Gerrit	24196	0	None	Merged	cluster/afr: fix race when bricks come up	2020-03-16 08:19:22 UTC

Description Xavi Hernandez 2020-03-03 06:42:59 UTC

+++ This bug was initially created as a clone of Bug #1808875 +++

+++ This bug was initially created as a clone of Bug #1794663 +++

Description of problem:
=====================
glustershd crashed on 2 nodes when testing the fix for BZ#1793035
more details are available @ https://bugzilla.redhat.com/show_bug.cgi?id=1793035#c21

Also, another issue:
I do see transport end point errors for some of the volumes intermittently when we do the reboot of n2.
this is consistently seen for some set of volumes(all arbiter) and the issue goes away as soon as n2 is up


eg:
[root@rhsqa7 glusterfs]# pwd
/var/log/glusterfs
[root@rhsqa7 glusterfs]# less mnt-arb_4yqv4qry9ogxx.log 
[root@rhsqa7 glusterfs]# cd /mnt/arb_4yqv4qry9ogxx 
-bash: cd: /mnt/arb_4yqv4qry9ogxx: Transport endpoint is not connected
[root@rhsqa7 glusterfs]# 
[root@rhsqa7 glusterfs]# vim mnt-arb_4yqv4qry9ogxx.log 
[root@rhsqa7 glusterfs]# cd /mnt/arb_4yqv4qry9ogxx 
[root@rhsqa7 arb_4yqv4qry9ogxx]# ls
kernel
[root@rhsqa7 arb_4yqv4qry9ogxx]# ls
ls: cannot open directory .: Transport endpoint is not connected
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# ls
ls: cannot open directory .: Transport endpoint is not connected
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# ls
ls: cannot open directory .: Transport endpoint is not connected
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# pwd
/mnt/arb_4yqv4qry9ogxx
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# 
[root@rhsqa7 arb_4yqv4qry9ogxx]# ls
kernel
[root@rhsqa7 arb_4yqv4qry9ogxx]# 

[root@rhsqa7 arb_4yqv4qry9ogxx]# cd .meta
[root@rhsqa7 .meta]# ls
cmdline  frames  graphs  logging  mallinfo  master  measure_latency  process_uuid  version
[root@rhsqa7 .meta]# cd graphs/active/
[root@rhsqa7 active]# ls
arb_4yqv4qry9ogxx           arb_4yqv4qry9ogxx-dht          arb_4yqv4qry9ogxx-replicate-0   volfile
arb_4yqv4qry9ogxx-client-0  arb_4yqv4qry9ogxx-md-cache     arb_4yqv4qry9ogxx-write-behind
arb_4yqv4qry9ogxx-client-1  arb_4yqv4qry9ogxx-open-behind  meta-autoload
arb_4yqv4qry9ogxx-client-2  arb_4yqv4qry9ogxx-quick-read   top
[root@rhsqa7 active]# cat arb_4yqv4qry9ogxx-client-*/private |grep connect
connected = 1
connected = 0
connected = 1
[root@rhsqa7 active]# 


However we must not see that because always same node is brought down.

n1:
====
[root@rhsqa10 ~]# #gdb /core.10158 
[root@rhsqa10 ~]# 

warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/b5/a5458535a1397fa6baaf5e8c13a6395426a1b2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gl'.
Program terminated with signal 11, Segmentation fault.
#0  client4_0_lookup_cbk (req=0x7f5a38185988, iov=0x7f5a381859c0, count=<optimized out>, myframe=0x7f5a38002ca8)
    at client-rpc-fops_v2.c:2576
2576	    inode = local->loc.inode;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_7.2.x86_64 libcom_err-1.42.9-16.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 libuuid-2.23.2-61.el7_7.1.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) 
(gdb) bt
#0  client4_0_lookup_cbk (req=0x7f5a38185988, iov=0x7f5a381859c0, count=<optimized out>, myframe=0x7f5a38002ca8)
    at client-rpc-fops_v2.c:2576
#1  0x00007f660b06d0f1 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f65f3511830, pollin=pollin@entry=0x7f65f8218360)
    at rpc-clnt.c:764
#2  0x00007f660b06d457 in rpc_clnt_notify (trans=0x7f65f3511b60, mydata=0x7f65f3511860, event=<optimized out>, 
    data=0x7f65f8218360) at rpc-clnt.c:931
#3  0x00007f660b069af3 in rpc_transport_notify (this=this@entry=0x7f65f3511b60, 
    event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f65f8218360) at rpc-transport.c:547
#4  0x00007f65ff685af5 in socket_event_poll_in (notify_handled=true, this=0x7f65f3511b60) at socket.c:2582
#5  socket_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7f65f3511b60, 
    poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2991
#6  0x00007f660b328806 in event_dispatch_epoll_handler (event=0x7f65fddd1130, event_pool=0x55998e37e570)
    at event-epoll.c:656
#7  event_dispatch_epoll_worker (data=0x55998e3ce150) at event-epoll.c:769
#8  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f66099c88cd in clone () from /lib64/libc.so.6
(gdb) t a a bt

Thread 523 (Thread 0x7f65a67fc700 (LWP 10203)):
#0  0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f65fc6a20ce in __afr_shd_healer_wait (healer=healer@entry=0x7f65f154e3c0) at afr-self-heald.c:100
#2  0x00007f65fc6a214d in afr_shd_healer_wait (healer=healer@entry=0x7f65f154e3c0) at afr-self-heald.c:121
#3  0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f154e3c0) at afr-self-heald.c:818
#4  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f66099c88cd in clone () from /lib64/libc.so.6

Thread 522 (Thread 0x7f62e2ffd700 (LWP 10581)):
#0  0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f65fc6a20ce in __afr_shd_healer_wait (healer=healer@entry=0x7f65f253d690) at afr-self-heald.c:100
#2  0x00007f65fc6a214d in afr_shd_healer_wait (healer=healer@entry=0x7f65f253d690) at afr-self-heald.c:121
#3  0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f253d690) at afr-self-heald.c:818
#4  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f66099c88cd in clone () from /lib64/libc.so.6

Thread 521 (Thread 0x7f633bfff700 (LWP 10530)):
#0  0x00007f660a106de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f65fc6a20ce in __afr_shd_healer_wait (healer=healer@entry=0x7f65f230e280) at afr-self-heald.c:100
#2  0x00007f65fc6a214d in afr_shd_healer_wait (healer=healer@entry=0x7f65f230e280) at afr-self-heald.c:121
#3  0x00007f65fc6a3c10 in afr_shd_index_healer (data=0x7f65f230e280) at afr-self-heald.c:818
#4  0x00007f660a102ea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f66099c88cd in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---q
Quit



n3:
===
warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Cannot access memory at address 0x7fd91c78a128
Cannot access memory at address 0x7fd91c78a120
Failed to read a valid object file image from memory.
Core was generated by `/usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gl'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fd91c02226c in ?? ()
(gdb) bt
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd90e586db8: 
(gdb) 
#0  0x00007fd91c02226c in ?? ()
Cannot access memory at address 0x7fd90e586db8
(gdb) 
#0  0x00007fd91c02226c in ?? ()
Cannot access memory at address 0x7fd90e586db8
(gdb) t a a bt

Thread 523 (LWP 10404):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd80effc020: 

Thread 522 (LWP 10742):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd472ffbce0: 

Thread 521 (LWP 10668):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd538ff8020: 

Thread 520 (LWP 10751):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd461ffa020: 

Thread 519 (LWP 10758):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd44dffa020: 

Thread 518 (LWP 10653):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd55d7f9020: 

Thread 517 (LWP 10489):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd70effc020: 

Thread 516 (LWP 10703):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd4dd7f9020: 

Thread 515 (LWP 10734):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd4877fcce0: 

Thread 514 (LWP 10399):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd815ffa020: 

Thread 513 (LWP 10396):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd816ffc020: 

Thread 512 (LWP 10772):
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7fd429ffa020: 


Version-Release number of selected component (if applicable):
=============
6.0.29

How reproducible:
==================
hit on 2 different crashes on 2 different nodes at different intervals

Steps to Reproduce:
================
1) 3 node cluster with bmux enabled
2) created 500 vols, either of arbiter or x3 type
3) mounted all 500 vols on 3 different clients
4) started IOs, ie cp linux image to the mount point, and do untar. 
This is one mount at a time, but 3 screen sessions on each node and all different set of volumes being selected.
Hence 9 linux untars parallely , on different volumes at any point of time
5) Did node reboot of n2(rhsqa11) every 15min(through ssh from n1 ie rhsqa10)
6) left it overnight



Actual results:
================
1) found the crash on n1 and n2
2) i do see transport end point errors for some of the volumes intermittently when we do the reboot of n2.
this is consistently seen for some set of volumes(all arbiter) and the issue goes away as soon as n2 is up

However we must not see that because always same node is brought down.

Comment 1 Worker Ant 2020-03-03 06:49:45 UTC

REVIEW: https://review.gluster.org/24196 (cluster/afr: fix race when bricks come up) posted (#1) for review on release-7 by Xavi Hernandez

Comment 2 Worker Ant 2020-03-12 13:27:41 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/1021, and will be tracked there from now on. Visit GitHub issues URL for further details

Comment 3 Worker Ant 2020-03-16 08:19:23 UTC

REVIEW: https://review.gluster.org/24196 (cluster/afr: fix race when bricks come up) merged (#2) on release-7 by Rinku Kothiya

Note You need to log in before you can comment on or make changes to this bug.