Bug 1422769 - brick process crashes when glusterd is restarted
Summary: brick process crashes when glusterd is restarted
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: core
Version: 3.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
Assignee: Jeff Darcy
QA Contact:
URL:
Whiteboard: brick-multiplexing-testing
Depends On:
Blocks: 1430860
TreeView+ depends on / blocked
 
Reported: 2017-02-16 07:24 UTC by krishnaram Karthick
Modified: 2019-01-24 11:48 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.10.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1430860 (view as bug list)
Environment:
Last Closed: 2017-04-05 00:01:13 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:
kramdoss: needinfo+


Attachments (Terms of Use)

Description krishnaram Karthick 2017-02-16 07:24:51 UTC
Description of problem:
When glusterd was restarted with around 50 volumes in the cluster, glusterfsd process crashed. 

Brick multiplexing was enabled on the cluster.

(gdb) bt
#0  0x00007fc1d59c97b0 in glusterfs_graph_attach (orig_graph=0x0, path=<optimized out>) at graph.c:1085
#1  0x00007fc1d5e905da in glusterfs_handle_attach (req=0x7fc1c80034a0) at glusterfsd-mgmt.c:842
#2  0x00007fc1d59ca6d0 in synctask_wrap (old_task=<optimized out>) at syncop.c:375
#3  0x00007fc1d4088cf0 in ?? () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

gluster v status vol1
Status of volume: vol1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.29:/mnt/container_brick1/v1-
b1                                          49153     0          Y       15405
Brick 10.70.47.31:/mnt/container_brick1/v1-
b1                                          N/A       N/A        N       N/A  
Brick 10.70.46.128:/mnt/container_brick1/v1
-b1                                         49152     0          Y       336  
Brick 10.70.47.29:/mnt/container_brick1/v1-
b2                                          49153     0          Y       15405
Brick 10.70.47.31:/mnt/container_brick1/v1-
b2                                          N/A       N/A        N       11695
Brick 10.70.46.128:/mnt/container_brick1/v1
-b2                                         49152     0          Y       336  
Self-heal Daemon on localhost               N/A       N/A        Y       11704
Self-heal Daemon on 10.70.47.29             N/A       N/A        Y       18586
Self-heal Daemon on 10.70.46.128            N/A       N/A        Y       11382
 
Task Status of Volume vol1
------------------------------------------------------------------------------
There are no active volume tasks

 - glusterd was restarted on 10.70.47.31

Version-Release number of selected component (if applicable):
rpm -qa | grep 'gluster'
glusterfs-resource-agents-3.10.0rc-0.0.el7.centos.noarch
glusterfs-events-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-debuginfo-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-api-3.10.0rc-0.0.el7.centos.x86_64
python2-gluster-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-fuse-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-server-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-devel-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-api-devel-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-geo-replication-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-libs-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-client-xlators-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-extra-xlators-3.10.0rc-0.0.el7.centos.x86_64
glusterfs-cli-3.10.0rc-0.0.el7.centos.x86_64

How reproducible:
1/1

Steps to Reproduce:
1. set cluster.brick-multiplex  on
2. create 50 volumes (2x3 type volumes were created)
3. Mount all the volumes and run IOs
4. restart glusterd on one of the node

Actual results:
glusterfsd process crashed

Expected results:
No crashes and no other failures should be seen

Additional info:
logs shall be attached shortly

Comment 2 Atin Mukherjee 2017-02-16 10:04:08 UTC
Changed the component to core as it's not related to GlusterD.

Jeff - Can you please take a look at it?

Comment 3 Jeff Darcy 2017-02-16 20:35:03 UTC
Setting priority as medium because GlusterD restarting while bricks continue to run is not a common case either as an administrative action or as a failure event.

Comment 4 Worker Ant 2017-02-17 00:25:09 UTC
REVIEW: https://review.gluster.org/16651 (tests: add test for brick-daemon crash when glusterd restarted) posted (#1) for review on master by Jeff Darcy (jdarcy)

Comment 5 Jeff Darcy 2017-02-17 00:26:16 UTC
Another test missing in upstream.  This one took a couple of hours to develop.  You can see it here.

https://review.gluster.org/#/c/16651/

So far no failures.  I'll run it the traditional 100 times and report back.

Comment 6 Jeff Darcy 2017-02-17 12:16:24 UTC
This takes a while to run.  By the time I shut off my laptop last night it had run 44 times without error.

Comment 7 Atin Mukherjee 2017-03-06 13:42:09 UTC
Karthick - Are you hitting this issue in your latest tests?

Comment 8 Jeff Darcy 2017-03-09 13:51:48 UTC
I still have no way to reproduce this, nor do I have access to the RPMs associated with the core in the referenced sosreports, so my ability to debug this is quite hampered.  However, I do see in the logs that all bricks terminated with the same "Exhausted all volfile servers" message that was seen (on clients) in bug 1422781.  This means that the brick daemons terminated with glusterd, and had to be restarted when glusterd was.  The crash seems to be a result of getting an attach request for a second brick before the first was ready (setting ctx->active).  This is highly reminiscent of bug 1430138, which perhaps shouldn't be surprising since that was found while testing the fix for 1422781.

The fix for 1422781 also affects servers, and should prevent the terminate/restart that leads to this bug.  On the other hand, it also wouldn't hurt to add a null check in glusterfs_handle_attach and/or glusterfs_graph_attach, to reduce the "blast area" in other cases where an attach request might be received before we're ready.

Comment 9 Worker Ant 2017-03-10 14:39:38 UTC
REVIEW: https://review.gluster.org/16888 (glusterfsd+libglusterfs: add null checks during attach) posted (#1) for review on release-3.10 by Jeff Darcy (jdarcy)

Comment 10 Worker Ant 2017-03-10 19:51:17 UTC
COMMIT: https://review.gluster.org/16888 committed in release-3.10 by Shyamsundar Ranganathan (srangana) 
------
commit 41eba3545c46c4cd0b9fcf6fc87284adc64ebcf5
Author: Jeff Darcy <jdarcy>
Date:   Thu Mar 9 12:49:27 2017 -0500

    glusterfsd+libglusterfs: add null checks during attach
    
    It's possible (though unlikely) that we could get a brick-attach
    request while we're not ready to process it (ctx->active not set yet).
    Add code to guard against this possibility, and return appropriate
    error indicators.
    
    Backport of:
    > 90b2b9b29f552fe9ab53de5c4123003522399e6d
    > BUG: 1430860
    > Reviewed-on: https://review.gluster.org/16883
    
    Change-Id: Icb3bc52ce749258a3f03cbbbdf4c2320c5c541a0
    BUG: 1422769
    Signed-off-by: Jeff Darcy <jdarcy>
    Reviewed-on: https://review.gluster.org/16888
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 11 Shyamsundar 2017-04-05 00:01:13 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.1, please open a new bug report.

glusterfs-3.10.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-April/030494.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.