1635593 – glusterd crashed in cleanup_and_exit when glusterd comes up with upgrade mode.

Bug 1635593 - glusterd crashed in cleanup_and_exit when glusterd comes up with upgrade mode.

Summary: glusterd crashed in cleanup_and_exit when glusterd comes up with upgrade mode.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1635071
TreeView+	depends on / blocked

Reported:	2018-10-03 11:09 UTC by Atin Mukherjee
Modified:	2019-03-25 16:31 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-6.0
Clone Of:	1635071
Environment:
Last Closed:	2019-03-25 16:31:10 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Atin Mukherjee 2018-10-03 11:15:21 UTC

When glusterd is brought up with upgrade mode as on, it's observed that glusterd crashes with following backtrace:

Backtrace:
=========
(gdb) bt
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb)

(gdb) t a a bt

Thread 7 (Thread 0x7fb1cd6a7700 (LWP 10080)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34bb00) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34bb00)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fb1d5f03700 (LWP 2914)):
#0  0x00007fb1ddab5fbd in nanosleep () from /lib64/libpthread.so.0
#1  0x00007fb1de9e55ca in gf_timer_proc (ctx=0x7fb1df31d010) at timer.c:205
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7fb1d4100700 (LWP 2917)):
#0  0x00007fb1ddab2a0e in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1dea0acab in syncenv_task (proc=0x7fb1df34afc0) at syncop.c:595
#2  0x00007fb1dea0fba0 in syncenv_processor (thdata=0x7fb1df34afc0)
    at syncop.c:687
#3  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fb1d02c3700 (LWP 3091)):
#0  0x00007fb1ddab263c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x00007fb1d3465973 in hooks_worker (args=<value optimized out>)
    at glusterd-hooks.c:534
#2  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fb1dee72740 (LWP 2913)):
#0  0x00007fb1ddaaf2ad in pthread_join () from /lib64/libpthread.so.0
#1  0x00007fb1dea2741d in event_dispatch_epoll (event_pool=0x7fb1df33bc90)
    at event-epoll.c:762
#2  0x00007fb1dee8eef1 in main (argc=2, argv=0x7ffdfed58a08)
    at glusterfsd.c:2333

Thread 2 (Thread 0x7fb1d5502700 (LWP 2915)):
#0  0x00007fb1d2c1cf18 in _fini () from /usr/lib64/liburcu-cds.so.1.0.0
#1  0x00007fb1dec72c7c in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#2  0x00007fb1dd365b22 in exit () from /lib64/libc.so.6
#3  0x00007fb1dee8cc03 in cleanup_and_exit (signum=<value optimized out>)
    at glusterfsd.c:1276
#4  0x00007fb1dee8d075 in glusterfs_sigwaiter (arg=<value optimized out>)
    at glusterfsd.c:1997
#5  0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fb1dd41896d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fb1cf8c2700 (LWP 3092)):
#0  _rcu_read_lock_bp () at urcu/static/urcu-bp.h:199
#1  rcu_read_lock_bp () at urcu-bp.c:271
#2  0x00007fb1d33cd256 in __glusterd_peer_rpc_notify (rpc=0x7fb1df49c8d0, 
    mydata=<value optimized out>, event=RPC_CLNT_DISCONNECT, 
    data=<value optimized out>) at glusterd-handler.c:4996
#3  0x00007fb1d33b0c50 in glusterd_big_locked_notify (rpc=0x7fb1df49c8d0, 
    mydata=0x7fb1df49c250, event=RPC_CLNT_DISCONNECT, data=0x0, 
---Type <return> to continue, or q <return> to quit---
    notify_fn=0x7fb1d33cd1f0 <__glusterd_peer_rpc_notify>)
    at glusterd-handler.c:71
#4  0x00007fb1de793953 in rpc_clnt_notify (trans=<value optimized out>, 
    mydata=0x7fb1df49c900, event=<value optimized out>, 
    data=<value optimized out>) at rpc-clnt.c:861
#5  0x00007fb1de78ead8 in rpc_transport_notify (this=<value optimized out>, 
    event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:543
#6  0x00007fb1d1a53df1 in socket_event_poll_err (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1205
#7  socket_event_handler (fd=<value optimized out>, 
    idx=<value optimized out>, data=0x7fb1df49fa60, 
    poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2410
#8  0x00007fb1dea27970 in event_dispatch_epoll_handler (data=0x7fb1df4edda0)
    at event-epoll.c:575
#9  event_dispatch_epoll_worker (data=0x7fb1df4edda0) at event-epoll.c:678
#10 0x00007fb1ddaaea51 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb1dd41896d in clone () from /lib64/libc.so.6
(gdb) 
(gdb) 

The root cause of the issue goes following:

This is a race at the clean up part where the URCU resources were already cleaned up by clean up thread and other thread was still accessing the resource. Since the current implementation doesn't take care of synchronizing the threads in respect to clean up, that's why its been observed. It's more observed in during upgrade as glusterd in this mode just regenerates the volfile and shutdown but on a multi node setup, glusterd also receives rpc connect event from other peers during processing the cleanup_and_exit () which leads to this race more often.

Comment 2 Worker Ant 2018-10-04 04:03:38 UTC

REVIEW: https://review.gluster.org/21330 (glusterd: ignore RPC events when glusterd is shutting down) posted (#3) for review on master by Atin Mukherjee

Comment 3 Atin Mukherjee 2018-10-05 02:16:25 UTC

Interestingly the smoke didn't move this BZ to modified even though the patch is merged.

Comment 4 Shyamsundar 2019-03-25 16:31:10 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.