Bug 1597230 - glustershd crashes when index heal is launched before graph is initialized.
Summary: glustershd crashes when index heal is launched before graph is initialized.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 3.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Ravishankar N
QA Contact:
URL:
Whiteboard:
Depends On: 1596513
Blocks: 1460245 1593865 1595752 1597229
TreeView+ depends on / blocked
 
Reported: 2018-07-02 10:13 UTC by Ravishankar N
Modified: 2018-08-20 07:01 UTC (History)
2 users (show)

Fixed In Version: glusterfs-3.12.12
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1596513
Environment:
Last Closed: 2018-08-20 07:01:24 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Ravishankar N 2018-07-02 10:13:33 UTC
+++ This bug was initially created as a clone of Bug #1596513 +++

Description of problem:
glustershd crashes when index heal is launched via CLI before graph is initialized.

Version-Release number of selected component (if applicable)/ How reproducible:
I'm able to reproduce this easily on glusterfs-3.8.4 and very infrequently on glusterfs-3.12.2 (only once on 3.12.2)

Steps to Reproduce:
1. create a replica 2 volume and start it.
2. `while true; do gluster volume heal <volname>;sleep 0.5; done` in one terminal.
3. In another terminal, keep running 'service glusterd restart`

Actual results:
Once in a while shd crashes and never comes up until manually restarted:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000040cdfa in glusterfs_handle_translator_op (req=0x7f0fa4003490) at glusterfsd-mgmt.c:793
793             any = active->first;
[Current thread is 1 (Thread 0x7f0face2c700 (LWP 3716))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.25-12.fc26.x86_64 libgcc-7.2.1-2.fc26.x86_64 libuuid-2.30.2-1.fc26.x86_64 openssl-libs-1.1.0f-7.fc26.x86_64 sssd-client-1.16.0-1.fc26.x86_64 zlib-1.2.11-2.fc26.x86_64
(gdb) t a a bt

Thread 7 (Thread 0x7f0fabd86700 (LWP 3717)):
#0  0x00007f0fb65adce6 in fnmatch@@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007f0fb7f43f42 in gf_add_cmdline_options (graph=0x7f0fa4000c40, cmd_args=0x15c2010) at graph.c:299
#2  0x00007f0fb7f449c0 in glusterfs_graph_prepare (graph=0x7f0fa4000c40, ctx=0x15c2010, volume_name=0x0) at graph.c:588
#3  0x000000000040a74b in glusterfs_process_volfp (ctx=0x15c2010, fp=0x7f0fa4006920) at glusterfsd.c:2368
#4  0x000000000040fc81 in mgmt_getspec_cbk (req=0x7f0fa4001d10, iov=0x7f0fa4001d50, count=1, myframe=0x7f0fa4001560) at glusterfsd-mgmt.c:1989
#5  0x00007f0fb7cc26b5 in rpc_clnt_handle_reply (clnt=0x163fef0, pollin=0x7f0fa40061b0) at rpc-clnt.c:778
#6  0x00007f0fb7cc2c53 in rpc_clnt_notify (trans=0x1640120, mydata=0x163ff20, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0fa40061b0) at rpc-clnt.c:971
#7  0x00007f0fb7cbecb8 in rpc_transport_notify (this=0x1640120, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0fa40061b0) at rpc-transport.c:538
#8  0x00007f0fac41919e in socket_event_poll_in (this=0x1640120, notify_handled=_gf_true) at socket.c:2315
#9  0x00007f0fac4197c3 in socket_event_handler (fd=10, idx=1, gen=1, data=0x1640120, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467
#10 0x00007f0fb7f6d367 in event_dispatch_epoll_handler (event_pool=0x15f9240, event=0x7f0fabd85e94) at event-epoll.c:583
#11 0x00007f0fb7f6d63e in event_dispatch_epoll_worker (data=0x1642f90) at event-epoll.c:659
#12 0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#13 0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f0fad62d700 (LWP 3715)):
#0  0x00007f0fb6d3deb6 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f0fb7f48274 in syncenv_task (proc=0x16033c0) at syncop.c:603
#2  0x00007f0fb7f4850f in syncenv_processor (thdata=0x16033c0) at syncop.c:695
#3  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f0fade2e700 (LWP 3714)):
#0  0x00007f0fb65a4c0d in nanosleep () from /lib64/libc.so.6
#1  0x00007f0fb65a4b4a in sleep () from /lib64/libc.so.6
#2  0x00007f0fb7f32762 in pool_sweeper (arg=0x0) at mem-pool.c:481
#3  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#4  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0fae62f700 (LWP 3713)):
#0  0x00007f0fb6d41f56 in sigwait () from /lib64/libpthread.so.0
#1  0x000000000040a001 in glusterfs_sigwaiter (arg=0x7fff4608dcd0) at glusterfsd.c:2137
#2  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0faee30700 (LWP 3712)):
#0  0x00007f0fb6d4192d in nanosleep () from /lib64/libpthread.so.0
#1  0x00007f0fb7f0ee1c in gf_timer_proc (data=0x1600ed0) at timer.c:174
#2  0x00007f0fb6d3736d in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0fb65e0e1f in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0fb83fe780 (LWP 3711)):
#0  0x00007f0fb6d3883d in pthread_join () from /lib64/libpthread.so.0
#1  0x00007f0fb7f6d89c in event_dispatch_epoll (event_pool=0x15f9240) at event-epoll.c:746
#2  0x00007f0fb7f30f3a in event_dispatch (event_pool=0x15f9240) at event.c:124
#3  0x000000000040acce in main (argc=13, argv=0x7fff4608eec8) at glusterfsd.c:2550

Thread 1 (Thread 0x7f0face2c700 (LWP 3716)):
#0  0x000000000040cdfa in glusterfs_handle_translator_op (req=0x7f0fa4003490) at glusterfsd-mgmt.c:793
#1  0x00007f0fb7f47a44 in synctask_wrap () at syncop.c:375
#2  0x00007f0fb651c950 in ?? () from /lib64/libc.so.6
#3  0x0000000000000000 in ?? ()
(gdb) l
788                     goto out;
789             }
790
791             ctx = glusterfsd_ctx;
792             active = ctx->active;
793             any = active->first;
794             input = dict_new ();
795             ret = dict_unserialize (xlator_req.input.input_val,
796                                     xlator_req.input.input_len,
797                                     &input);
(gdb) p ctx->active
$1 = (glusterfs_graph_t *) 0x0


Expected results:
shd must not crash

--- Additional comment from Worker Ant on 2018-06-29 03:28:22 EDT ---

REVIEW: https://review.gluster.org/20422 (glusterfsd: Do not process GLUSTERD_BRICK_XLATOR_OP if graph is not ready) posted (#1) for review on master by Ravishankar N

--- Additional comment from Worker Ant on 2018-07-02 06:10:49 EDT ---

COMMIT: https://review.gluster.org/20422 committed in master by "Atin Mukherjee" <amukherj> with a commit message- glusterfsd: Do not process GLUSTERD_BRICK_XLATOR_OP if graph is not ready

Problem:
If glustershd gets restarted by glusterd due to node reboot/volume start force/
or any thing that changes shd graph (add/remove brick), and index heal
is launched via CLI, there can be a chance that shd receives this IPC
before the graph is fully active. Thus when it accesses
glusterfsd_ctx->active, it crashes.

Fix:
Since glusterd does not really wait for the daemons it spawned to be
fully initialized and can send the request as soon as rpc initialization has
succeeded, we just handle it at shd. If glusterfs_graph_activate() is
not yet done in shd but glusterd sends GD_OP_HEAL_VOLUME to shd,
we fail the request.

Change-Id: If6cc07bc5455c4ba03458a36c28b63664496b17d
fixes: bz#1596513
Signed-off-by: Ravishankar N <ravishankar>

Comment 1 Worker Ant 2018-07-02 10:39:15 UTC
REVIEW: https://review.gluster.org/20436 (glusterfsd: Do not process GLUSTERD_BRICK_XLATOR_OP if graph is not ready) posted (#1) for review on release-3.12 by Ravishankar N

Comment 2 Worker Ant 2018-07-04 04:05:23 UTC
COMMIT: https://review.gluster.org/20436 committed in release-3.12 by "jiffin tony Thottan" <jthottan> with a commit message- glusterfsd: Do not process GLUSTERD_BRICK_XLATOR_OP if graph is not ready

Backport of: https://review.gluster.org/#/c/20435/

Problem:
If glustershd gets restarted by glusterd due to node reboot/volume start force/
or any thing that changes shd graph (add/remove brick), and index heal
is launched via CLI, there can be a chance that shd receives this IPC
before the graph is fully active. Thus when it accesses
glusterfsd_ctx->active, it crashes.

Fix:
Since glusterd does not really wait for the daemons it spawned to be
fully initialized and can send the request as soon as rpc initialization has
succeeded, we just handle it at shd. If glusterfs_graph_activate() is
not yet done in shd but glusterd sends GD_OP_HEAL_VOLUME to shd,
we fail the request.

Change-Id: If6cc07bc5455c4ba03458a36c28b63664496b17d
BUG: 1597230
fixes: bz#1597230
Signed-off-by: Ravishankar N <ravishankar>

Comment 3 Jiffin 2018-08-20 07:01:24 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.12, please open a new bug report.

glusterfs-3.12.12 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2018-July/000105.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.