Bug 1239156 - Glusterd crashed while glusterd service was shutting down
Summary: Glusterd crashed while glusterd service was shutting down
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
Assignee: Kaushal
QA Contact:
URL:
Whiteboard: GlusterD
Depends On: 1238067
Blocks: 1223636
TreeView+ depends on / blocked
 
Reported: 2015-07-03 19:24 UTC by Anand Nekkunti
Modified: 2019-05-11 11:38 UTC (History)
5 users (show)

Fixed In Version: glusterfs-6.x
Doc Type: Bug Fix
Doc Text:
Clone Of: 1238067
Environment:
Last Closed: 2019-05-11 11:38:06 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Anand Nekkunti 2015-07-03 19:24:03 UTC
+++ This bug was initially created as a clone of Bug #1238067 +++

Description of problem:
=======================

Seen a glusterd crash. No restarts were done and there IO running from  the client. Did a peer probe to another server which failed with 107 error.

Backtrace:
=========
(gdb) bt
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb)

(gdb) t a a bt

Thread 7 (Thread 0x7fb1cd6a7700 (LWP 10080)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34bb00) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34bb00)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fb1d5f03700 (LWP 2914)):
#0  0x00007fb1ddab5fbd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fb1de9e55ca in gf_timer_proc (ctx=0x7fb1df31d010) at timer.c:205
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fb1d4100700 (LWP 2917)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34afc0) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34afc0)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fb1d02c3700 (LWP 3091)):
#0  0x00007fb1ddab263c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1d3465973 in hooks_worker (args=<value optimized out>)
    at glusterd-hooks.c:534
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fb1dee72740 (LWP 2913)):
#0  0x00007fb1ddaaf2ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fb1dea2741d in event_dispatch_epoll (event_pool=0x7fb1df33bc90)
    at event-epoll.c:762
#2  0x00007fb1dee8eef1 in main (argc=2, argv=0x7ffdfed58a08)
    at glusterfsd.c:2333

Thread 2 (Thread 0x7fb1d5502700 (LWP 2915)):
#0  0x00007fb1d2c1cf18 in _fini () from /usr/lib64/liburcu-cds.so.1.0.0
#1  0x00007fb1dec72c7c in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007fb1dd365b22 in exit () from /lib64/libc.so.6
#3  0x00007fb1dee8cc03 in cleanup_and_exit (signum=<value optimized out>)
    at glusterfsd.c:1276
#4  0x00007fb1dee8d075 in glusterfs_sigwaiter (arg=<value optimized out>)
    at glusterfsd.c:1997
#5  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fb1cf8c2700 (LWP 3092)):
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
---Type <return> to continue, or q <return> to quit---
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 


Version-Release number of selected component (if applicable):
=============================================================

[root@ninja core]# gluster --version
glusterfs 3.7.1 built on Jun 28 2015 11:01:17
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
[root@ninja core]# 

How reproducible:
================
seen once

Actual results:


Expected results:


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-07-01 02:43:54 EDT ---

This bug is automatically being proposed for Red Hat Gluster Storage 3.1.0 by setting the release flag 'rhgs‑3.1.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Bhaskarakiran on 2015-07-01 02:45:58 EDT ---



--- Additional comment from Bhaskarakiran on 2015-07-01 05:17:37 EDT ---

copied the sosreports to rhsqe-repo/sosreports/1238067 folder.

--- Additional comment from Bhaskarakiran on 2015-07-01 05:22:35 EDT ---

time of crash :

-rw-------. 1 root root 232M Jun 30 16:14 core.2913.1435661084.dump

--- Additional comment from Bhaskarakiran on 2015-07-01 05:27:20 EDT ---

sosrepot :

rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1238067/sosreport-sysreg-prod-20150701140725.tar.xz

--- Additional comment from Atin Mukherjee on 2015-07-01 23:48:18 EDT ---

The crash happened while glusterD service was going down. This doesn't impact the functionality and the crash is because of race between clean up thread and running thread. The clean up thread releases URCU resources while one of the running thread still try to access it resulting into a crash. Hence this can be deferred to 3.1.z.

--- Additional comment from Rejy M Cyriac on 2015-07-03 01:54:02 EDT ---

Since this BZ is not a Blocker for the RHGS 3.1 release, and the phase for fixing non-blocker bugs is over for the release, re-proposing this BZ for the RHGS 3.1 Z-stream release

Comment 1 Anand Avati 2015-07-03 19:36:24 UTC
REVIEW: http://review.gluster.org/11532 (glusterd/synctask: destroy all synctask and epoll threads in fini) posted (#1) for review on master by Anand Nekkunti (anekkunt@redhat.com)

Comment 2 Anand Avati 2015-07-03 19:48:02 UTC
REVIEW: http://review.gluster.org/11532 (glusterd/synctask: destroy all synctask and epoll threads in fini) posted (#2) for review on master by Anand Nekkunti (anekkunt@redhat.com)

Comment 3 Anand Avati 2015-07-04 08:19:37 UTC
REVIEW: http://review.gluster.org/11532 (glusterd/synctask: destroy all synctask and epoll threads in fini) posted (#3) for review on master by Anand Nekkunti (anekkunt@redhat.com)

Comment 6 Mike McCune 2016-03-28 22:51:39 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune@redhat.com with any questions

Comment 7 Atin Mukherjee 2018-10-05 03:41:14 UTC
Fixed through commit 6b58e84

commit 6b58e8426a36bc544c06a599311999bf89ad04f2
Author: Atin Mukherjee <amukherj@redhat.com>
Date:   Wed Oct 3 16:34:54 2018 +0530

    glusterd: ignore RPC events when glusterd is shutting down
    
    When glusterd receives a SIGTERM while it receives RPC
    connect/disconnect/destroy events, the thread might lead to a crash
    while accessing rcu_read_lock () as the clean up thread might have
    already freed up the resources. This is more observable when glusterd
    comes up with upgrade mode = on during upgrade process.
    
    The solution is to ignore these events if glusterd is already in the
    middle of cleanup_and_exit ().
    
    Fixes: bz#1635593
    Change-Id: I12831d31c2f689d4deb038b83b9421bd5cce26d9
    Signed-off-by: Atin Mukherjee <amukherj@redhat.com>


Note You need to log in before you can comment on or make changes to this bug.