Description of problem: When glusterd was restarted with around 50 volumes in the cluster, glusterfsd process crashed. Brick multiplexing was enabled on the cluster. (gdb) bt #0 0x00007fc1d59c97b0 in glusterfs_graph_attach (orig_graph=0x0, path=<optimized out>) at graph.c:1085 #1 0x00007fc1d5e905da in glusterfs_handle_attach (req=0x7fc1c80034a0) at glusterfsd-mgmt.c:842 #2 0x00007fc1d59ca6d0 in synctask_wrap (old_task=<optimized out>) at syncop.c:375 #3 0x00007fc1d4088cf0 in ?? () from /lib64/libc.so.6 #4 0x0000000000000000 in ?? () gluster v status vol1 Status of volume: vol1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.29:/mnt/container_brick1/v1- b1 49153 0 Y 15405 Brick 10.70.47.31:/mnt/container_brick1/v1- b1 N/A N/A N N/A Brick 10.70.46.128:/mnt/container_brick1/v1 -b1 49152 0 Y 336 Brick 10.70.47.29:/mnt/container_brick1/v1- b2 49153 0 Y 15405 Brick 10.70.47.31:/mnt/container_brick1/v1- b2 N/A N/A N 11695 Brick 10.70.46.128:/mnt/container_brick1/v1 -b2 49152 0 Y 336 Self-heal Daemon on localhost N/A N/A Y 11704 Self-heal Daemon on 10.70.47.29 N/A N/A Y 18586 Self-heal Daemon on 10.70.46.128 N/A N/A Y 11382 Task Status of Volume vol1 ------------------------------------------------------------------------------ There are no active volume tasks - glusterd was restarted on 10.70.47.31 Version-Release number of selected component (if applicable): rpm -qa | grep 'gluster' glusterfs-resource-agents-3.10.0rc-0.0.el7.centos.noarch glusterfs-events-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-debuginfo-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-api-3.10.0rc-0.0.el7.centos.x86_64 python2-gluster-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-fuse-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-server-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-devel-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-api-devel-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-geo-replication-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-libs-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-client-xlators-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-extra-xlators-3.10.0rc-0.0.el7.centos.x86_64 glusterfs-cli-3.10.0rc-0.0.el7.centos.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. set cluster.brick-multiplex on 2. create 50 volumes (2x3 type volumes were created) 3. Mount all the volumes and run IOs 4. restart glusterd on one of the node Actual results: glusterfsd process crashed Expected results: No crashes and no other failures should be seen Additional info: logs shall be attached shortly
Changed the component to core as it's not related to GlusterD. Jeff - Can you please take a look at it?
Setting priority as medium because GlusterD restarting while bricks continue to run is not a common case either as an administrative action or as a failure event.
REVIEW: https://review.gluster.org/16651 (tests: add test for brick-daemon crash when glusterd restarted) posted (#1) for review on master by Jeff Darcy (jdarcy)
Another test missing in upstream. This one took a couple of hours to develop. You can see it here. https://review.gluster.org/#/c/16651/ So far no failures. I'll run it the traditional 100 times and report back.
This takes a while to run. By the time I shut off my laptop last night it had run 44 times without error.
Karthick - Are you hitting this issue in your latest tests?
I still have no way to reproduce this, nor do I have access to the RPMs associated with the core in the referenced sosreports, so my ability to debug this is quite hampered. However, I do see in the logs that all bricks terminated with the same "Exhausted all volfile servers" message that was seen (on clients) in bug 1422781. This means that the brick daemons terminated with glusterd, and had to be restarted when glusterd was. The crash seems to be a result of getting an attach request for a second brick before the first was ready (setting ctx->active). This is highly reminiscent of bug 1430138, which perhaps shouldn't be surprising since that was found while testing the fix for 1422781. The fix for 1422781 also affects servers, and should prevent the terminate/restart that leads to this bug. On the other hand, it also wouldn't hurt to add a null check in glusterfs_handle_attach and/or glusterfs_graph_attach, to reduce the "blast area" in other cases where an attach request might be received before we're ready.
REVIEW: https://review.gluster.org/16888 (glusterfsd+libglusterfs: add null checks during attach) posted (#1) for review on release-3.10 by Jeff Darcy (jdarcy)
COMMIT: https://review.gluster.org/16888 committed in release-3.10 by Shyamsundar Ranganathan (srangana) ------ commit 41eba3545c46c4cd0b9fcf6fc87284adc64ebcf5 Author: Jeff Darcy <jdarcy> Date: Thu Mar 9 12:49:27 2017 -0500 glusterfsd+libglusterfs: add null checks during attach It's possible (though unlikely) that we could get a brick-attach request while we're not ready to process it (ctx->active not set yet). Add code to guard against this possibility, and return appropriate error indicators. Backport of: > 90b2b9b29f552fe9ab53de5c4123003522399e6d > BUG: 1430860 > Reviewed-on: https://review.gluster.org/16883 Change-Id: Icb3bc52ce749258a3f03cbbbdf4c2320c5c541a0 BUG: 1422769 Signed-off-by: Jeff Darcy <jdarcy> Reviewed-on: https://review.gluster.org/16888 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.1, please open a new bug report. glusterfs-3.10.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-April/030494.html [2] https://www.gluster.org/pipermail/gluster-users/