Bug 1238067

Summary: Glusterd crashed while glusterd service was shutting down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Bhaskarakiran <byarlaga>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED WONTFIX QA Contact: Bala Konda Reddy M <bmekala>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: abhaumik, amukherj, byarlaga, mlawrenc, mzywusko, nbalacha, nchilaka, nlevinki, nsathyan, rmekala, sanandpa, sasundar, tdesala, vbellur, vdas
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: GlusterD
Fixed In Version: Doc Type: Known Issue
Doc Text:
In rare instances, glusterd may crash when it is stopped. The crash is due to a race between the clean up thread and the running thread and doesn't impact functionality. The clean up thread releases URCU resources while a running thread continues to try to access it, which results in a crash.
Story Points: ---
Clone Of:
: 1239156 (view as bug list) Environment:
Last Closed: 2016-01-08 08:58:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1216951, 1223636, 1239156, 1277939    
Attachments:
Description Flags
core file none

Description Bhaskarakiran 2015-07-01 06:43:52 UTC
Description of problem:
=======================

Seen a glusterd crash. No restarts were done and there IO running from  the client. Did a peer probe to another server which failed with 107 error.

Backtrace:
=========
(gdb) bt
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb)

(gdb) t a a bt

Thread 7 (Thread 0x7fb1cd6a7700 (LWP 10080)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34bb00) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34bb00)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fb1d5f03700 (LWP 2914)):
#0  0x00007fb1ddab5fbd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fb1de9e55ca in gf_timer_proc (ctx=0x7fb1df31d010) at timer.c:205
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fb1d4100700 (LWP 2917)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34afc0) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34afc0)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fb1d02c3700 (LWP 3091)):
#0  0x00007fb1ddab263c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1d3465973 in hooks_worker (args=<value optimized out>)
    at glusterd-hooks.c:534
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fb1dee72740 (LWP 2913)):
#0  0x00007fb1ddaaf2ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fb1dea2741d in event_dispatch_epoll (event_pool=0x7fb1df33bc90)
    at event-epoll.c:762
#2  0x00007fb1dee8eef1 in main (argc=2, argv=0x7ffdfed58a08)
    at glusterfsd.c:2333

Thread 2 (Thread 0x7fb1d5502700 (LWP 2915)):
#0  0x00007fb1d2c1cf18 in _fini () from /usr/lib64/liburcu-cds.so.1.0.0
#1  0x00007fb1dec72c7c in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007fb1dd365b22 in exit () from /lib64/libc.so.6
#3  0x00007fb1dee8cc03 in cleanup_and_exit (signum=<value optimized out>)
    at glusterfsd.c:1276
#4  0x00007fb1dee8d075 in glusterfs_sigwaiter (arg=<value optimized out>)
    at glusterfsd.c:1997
#5  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fb1cf8c2700 (LWP 3092)):
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
---Type <return> to continue, or q <return> to quit---
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 


Version-Release number of selected component (if applicable):
=============================================================

[root@ninja core]# gluster --version
glusterfs 3.7.1 built on Jun 28 2015 11:01:17
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@ninja core]# 

How reproducible:
================
seen once

Actual results:


Expected results:


Additional info:

Comment 2 Bhaskarakiran 2015-07-01 06:45:58 UTC
Created attachment 1044920 [details]
core file

Comment 4 Bhaskarakiran 2015-07-01 09:22:35 UTC
time of crash :

-rw-------. 1 root root 232M Jun 30 16:14 core.2913.1435661084.dump

Comment 5 Bhaskarakiran 2015-07-01 09:27:20 UTC
sosrepot :

rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1238067/sosreport-sysreg-prod-20150701140725.tar.xz

Comment 8 Anand Nekkunti 2015-07-15 07:02:20 UTC
Upstream patch: http://review.gluster.org/#/c/11532/

Comment 11 Anand Nekkunti 2015-11-19 10:44:01 UTC
*** Bug 1283139 has been marked as a duplicate of this bug. ***

Comment 13 Atin Mukherjee 2016-01-08 08:58:25 UTC
This one of the rare race at the clean up part where the URCU resources were already cleaned up by clean up thread and other thread was still accessing the resource. Since the current implementation doesn't take care of synchronizing the threads in respect to clean up, that's why its been observed. To fix this issue we'd need changes in sync-op framework which is non-trivial. As this doesn't impact the functionality and one of the rarest race to hit, we are planning not to chase down the bug and hence closing it. Feel free to reopen if you think otherwise with proper justification.

Comment 14 Atin Mukherjee 2016-11-23 07:55:36 UTC
*** Bug 1397669 has been marked as a duplicate of this bug. ***

Comment 15 Atin Mukherjee 2017-03-28 03:30:04 UTC
*** Bug 1434047 has been marked as a duplicate of this bug. ***

Comment 16 Atin Mukherjee 2017-04-18 07:15:21 UTC
*** Bug 1442928 has been marked as a duplicate of this bug. ***

Comment 17 Atin Mukherjee 2018-01-04 13:59:48 UTC
*** Bug 1530936 has been marked as a duplicate of this bug. ***

Comment 18 Atin Mukherjee 2018-02-14 12:10:29 UTC
*** Bug 1545045 has been marked as a duplicate of this bug. ***

Comment 19 Nag Pavan Chilakam 2018-05-16 11:17:00 UTC
I am seeing below crash consistently on all nodes when upgrading from 3.8.4-54.8 to 3.12.2-9 
Below is BT
Atin, kindly confirm if this is the same
(I see this crash on yum update glusterfs-server)


[root@dhcp37-41 ~]# file /core.17780 
/core.17780: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'glusterd --xlator-option *.upgrade=on -N', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/sbin/glusterd', platform: 'x86_64'


warning: core file may not match specified executable file.
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/lib/debug/usr/sbin/glusterfsd.debug...done.
done.
Missing separate debuginfo for 
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/3b/b87246fcddff47293950c06e763e44f866502e
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `glusterd --xlator-option *.upgrade=on -N'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f4f63944d8b in rcu_bp_register () from /lib64/liburcu-bp.so.1
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 device-mapper-event-libs-1.02.146-4.el7.x86_64 device-mapper-libs-1.02.146-4.el7.x86_64 elfutils-libelf-0.170-4.el7.x86_64 elfutils-libs-0.170-4.el7.x86_64 glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libblkid-2.23.2-52.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-12.el7.x86_64 libsepol-2.5-8.1.el7.x86_64 libuuid-2.23.2-52.el7.x86_64 libxml2-2.9.1-6.el7_2.3.x86_64 lvm2-libs-2.02.177-4.el7.x86_64 openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64 systemd-libs-219-57.el7.x86_64 userspace-rcu-0.7.9-2.el7rhgs.x86_64 xz-libs-5.2.2-1.el7.x86_64 zalib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00007f4f63944d8b in rcu_bp_register () from /lib64/liburcu-bp.so.1
#1  0x00007f4f639450ce in rcu_read_lock_bp () from /lib64/liburcu-bp.so.1
#2  0x00007f4f63ee087c in __glusterd_peer_rpc_notify (rpc=rpc@entry=0x55d7715674a0, 
    mydata=mydata@entry=0x55d771565e10, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0)
    at glusterd-handler.c:6372
#3  0x00007f4f63ed6a5a in glusterd_big_locked_notify (rpc=0x55d7715674a0, mydata=0x55d771565e10, 
    event=RPC_CLNT_CONNECT, data=0x0, notify_fn=0x7f4f63ee0830 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:70
#4  0x00007f4f6f215594 in rpc_clnt_notify (trans=<optimized out>, mydata=0x55d7715674d0, event=<optimized out>, 
    data=0x55d7715676d0) at rpc-clnt.c:1004
#5  0x00007f4f6f211393 in rpc_transport_notify (this=this@entry=0x55d7715676d0, 
    event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x55d7715676d0) at rpc-transport.c:538
#6  0x00007f4f6111e367 in socket_connect_finish (this=this@entry=0x55d7715676d0) at socket.c:2404
#7  0x00007f4f61122aa8 in socket_event_handler (fd=11, idx=2, gen=1, data=0x55d7715676d0, poll_in=0, poll_out=4, 
    poll_err=0) at socket.c:2456
#8  0x00007f4f6f4aae34 in event_dispatch_epoll_handler (event=0x7f4f5f173e80, event_pool=0x55d7714a7210)
    at event-epoll.c:583
#9  event_dispatch_epoll_worker (data=0x55d771571f10) at event-epoll.c:659
---Type <return> to continue, or q <return> to quit---
#10 0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f4f6db74b3d in clone () from /lib64/libc.so.6
(gdb) t a a bt

Thread 8 (Thread 0x7f4f6624b700 (LWP 17782)):
#0  0x00007f4f613795b0 in _fini () from /lib64/libpcre.so.1
#1  0x00007f4f6f7321a8 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007f4f6daafb69 in __run_exit_handlers () from /lib64/libc.so.6
#3  0x00007f4f6daafbb7 in exit () from /lib64/libc.so.6
#4  0x000055d76fe4c4df in cleanup_and_exit (signum=15) at glusterfsd.c:1423
#5  0x000055d76fe4c5d5 in glusterfs_sigwaiter (arg=<optimized out>) at glusterfsd.c:2145
#6  0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f4f6f935780 (LWP 17780)):
#0  0x00007f4f6e2acf47 in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f4f6f4ab468 in event_dispatch_epoll (event_pool=0x55d7714a7210) at event-epoll.c:746
#2  0x000055d76fe492a7 in main (argc=4, argv=<optimized out>) at glusterfsd.c:2550

Thread 6 (Thread 0x7f4f65249700 (LWP 17784)):
#0  0x00007f4f6e2afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#1  0x00007f4f6f489008 in syncenv_task (proc=proc@entry=0x55d7714af0e0) at syncop.c:603
#2  0x00007f4f6f489ed0 in syncenv_processor (thdata=0x55d7714af0e0) at syncop.c:695
#3  0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f4f66a4c700 (LWP 17781)):
#0  0x00007f4f6e2b2eed in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f4f6f45b986 in gf_timer_proc (data=0x55d7714ae8c0) at timer.c:174
#2  0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f4f64a48700 (LWP 17785)):
#0  0x00007f4f6e2afcf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4f6f489008 in syncenv_task (proc=proc@entry=0x55d7714af4a0) at syncop.c:603
#2  0x00007f4f6f489ed0 in syncenv_processor (thdata=0x55d7714af4a0) at syncop.c:695
#3  0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

---Type <return> to continue, or q <return> to quit---
Thread 3 (Thread 0x7f4f65a4a700 (LWP 17783)):
#0  0x00007f4f6db3b4fd in nanosleep () from /lib64/libc.so.6
#1  0x00007f4f6db3b394 in sleep () from /lib64/libc.so.6
#2  0x00007f4f6f4761bd in pool_sweeper (arg=<optimized out>) at mem-pool.c:481
#3  0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f4f5f975700 (LWP 17786)):
#0  0x00007f4f6e2af945 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f4f63f9602b in hooks_worker (args=<optimized out>) at glusterd-hooks.c:529
#2  0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f4f5f174700 (LWP 17787)):
#0  0x00007f4f63944d8b in rcu_bp_register () from /lib64/liburcu-bp.so.1
#1  0x00007f4f639450ce in rcu_read_lock_bp () from /lib64/liburcu-bp.so.1
#2  0x00007f4f63ee087c in __glusterd_peer_rpc_notify (rpc=rpc@entry=0x55d7715674a0, 
    mydata=mydata@entry=0x55d771565e10, event=event@entry=RPC_CLNT_CONNECT, data=data@entry=0x0)
---Type <return> to continue, or q <return> to quit---
    at glusterd-handler.c:6372
#3  0x00007f4f63ed6a5a in glusterd_big_locked_notify (rpc=0x55d7715674a0, mydata=0x55d771565e10, 
    event=RPC_CLNT_CONNECT, data=0x0, notify_fn=0x7f4f63ee0830 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:70
#4  0x00007f4f6f215594 in rpc_clnt_notify (trans=<optimized out>, mydata=0x55d7715674d0, event=<optimized out>, 
    data=0x55d7715676d0) at rpc-clnt.c:1004
#5  0x00007f4f6f211393 in rpc_transport_notify (this=this@entry=0x55d7715676d0, 
    event=event@entry=RPC_TRANSPORT_CONNECT, data=data@entry=0x55d7715676d0) at rpc-transport.c:538
#6  0x00007f4f6111e367 in socket_connect_finish (this=this@entry=0x55d7715676d0) at socket.c:2404
#7  0x00007f4f61122aa8 in socket_event_handler (fd=11, idx=2, gen=1, data=0x55d7715676d0, poll_in=0, poll_out=4, 
    poll_err=0) at socket.c:2456
#8  0x00007f4f6f4aae34 in event_dispatch_epoll_handler (event=0x7f4f5f173e80, event_pool=0x55d7714a7210)
    at event-epoll.c:583
#9  event_dispatch_epoll_worker (data=0x55d771571f10) at event-epoll.c:659
#10 0x00007f4f6e2abdd5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f4f6db74b3d in clone () from /lib64/libc.so.6

Comment 20 Nag Pavan Chilakam 2018-05-16 11:30:02 UTC
while doing an yum update below is the cli message

/var/tmp/rpm-tmp.O031c7: line 26: 14148 Segmentation fault      (core dumped) glusterd --xlator-option *.upgrade=on -N
  Verifying  : glusterfs-client-xlators-3.12.2-9.el7rhgs.x86_64                                                 1/14 
  Verifying  : glusterfs-3.12.2-9.el7rhgs.x86_64                                                                2/14 
  Verifying  : glusterfs-api-3.12.2-9.el7rhgs.x86_64

Comment 22 Sanju 2018-08-28 12:16:16 UTC
*** Bug 1622554 has been marked as a duplicate of this bug. ***