1044327 – glusterd crashing when trying to stop stale rebalance process

Bug 1044327 - glusterd crashing when trying to stop stale rebalance process

Summary: glusterd crashing when trying to stop stale rebalance process

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kaushal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1043535
TreeView+	depends on / blocked

Reported:	2013-12-18 06:15 UTC by Kaushal
Modified:	2014-11-11 08:25 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.6.0beta1
Clone Of:	1043535
Environment:
Last Closed:	2014-11-11 08:25:48 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kaushal 2013-12-18 06:15:52 UTC

Description of problem:
-------------------------

glusterd is seen to crash with the following backtrace - 

pending frames:
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-12-16 07:13:10configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.49rhs
/lib64/libc.so.6[0x309fc32960]
/usr/lib64/libglusterfs.so.0(synctask_yield+0x10)[0x30a104ae00]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(gd_stop_rebalance_process+0x2c5)[0x7fd2cae55f45]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(gd_check_and_update_rebalance_info+0xd8)[0x7fd2cae5d368]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_import_friend_volume+0x19f)[0x7fd2cae6b40f]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_import_friend_volumes+0x66)[0x7fd2cae6b536]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_compare_friend_data+0x142)[0x7fd2cae6b752]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(+0x378ac)[0x7fd2cae478ac]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x19e)[0x7fd2cae47f2e]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(__glusterd_handle_incoming_friend_req+0x2fe)[0x7fd2cae4672e]
/usr/lib64/glusterfs/3.4.0.49rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd2cae3619f]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x295)[0x30a1809585]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x30a18097c3]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x30a180adf8]
/usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0x8d86)[0x7fd2c94bcd86]
/usr/lib64/glusterfs/3.4.0.49rhs/rpc-transport/socket.so(+0xa69d)[0x7fd2c94be69d]
/usr/lib64/libglusterfs.so.0[0x30a1062387]
/usr/sbin/glusterd(main+0x6c7)[0x4069d7]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x309fc1ecdd]
/usr/sbin/glusterd[0x404619]
---------

Steps to Reproduce:
1. Perform rebalance and remove-brick a few times. Restarted glusterd a few times.

Actual results:
glusterd crashed.

Expected results:
glusterd should not crash.

Comment 1 Anand Avati 2013-12-18 06:18:22 UTC

REVIEW: http://review.gluster.org/6531 (glusterd: Improve stopping of stale rebalance processes) posted (#1) for review on master by Kaushal M (kaushal)

Comment 2 Anand Avati 2013-12-18 09:43:01 UTC

REVIEW: http://review.gluster.org/6531 (glusterd: Fix stopping of stale rebalance processes) posted (#2) for review on master by Kaushal M (kaushal)

Comment 3 Anand Avati 2013-12-19 06:18:58 UTC

COMMIT: http://review.gluster.org/6531 committed in master by Vijay Bellur (vbellur) 
------
commit 30bdde315e01d4d71cca121f0cba55b7ae82dd1b
Author: Kaushal M <kaushal>
Date:   Tue Dec 17 16:09:02 2013 +0530

    glusterd: Fix stopping of stale rebalance processes
    
    Trying to stop rebalance process via RPC using the GD_SYNCOP macro,
    could lead to glusterd crashing. In case of an implicit volume update,
    which happens when a peer comes back up, the stop function would be
    called in the epoll thread. This would lead to glusterd crashing as the
    epoll thread doesn't have synctasks for the GD_SYNCOP macro to make use
    of.
    
    Instead of using the RPC method, we now terminate the rebalance process
    by kill(). The rebalance process has been designed to be resistant to
    interruption, so this will not lead to any data corruption.
    
    Also, when checking for stale rebalance task, make sure that the old
    task-id is not null.
    
    Change-Id: I54dd93803954ee55316cc58b5877f38d1ebc40b9
    BUG: 1044327
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/6531
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Gluster Build System <jenkins.com>

Comment 4 Anand Avati 2013-12-23 09:00:05 UTC

REVIEW: http://review.gluster.org/6567 (glusterd: Fix stopping of stale rebalance processes) posted (#1) for review on release-3.5 by Krishnan Parthasarathi (kparthas)

Comment 5 Anand Avati 2013-12-23 14:58:41 UTC

COMMIT: http://review.gluster.org/6567 committed in release-3.5 by Vijay Bellur (vbellur) 
------
commit b546d4fb59230eb0bba102496bca53bb0a5c86f9
Author: Krishnan Parthasarathi <kparthas>
Date:   Mon Dec 23 14:08:00 2013 +0530

    glusterd: Fix stopping of stale rebalance processes
    
            Backport of http://review.gluster.org/6531
    
    Trying to stop rebalance process via RPC using the GD_SYNCOP macro,
    could lead to glusterd crashing. In case of an implicit volume update,
    which happens when a peer comes back up, the stop function would be
    called in the epoll thread. This would lead to glusterd crashing as the
    epoll thread doesn't have synctasks for the GD_SYNCOP macro to make use
    of.
    
    Instead of using the RPC method, we now terminate the rebalance process
    by kill(). The rebalance process has been designed to be resistant to
    interruption, so this will not lead to any data corruption.
    
    Also, when checking for stale rebalance task, make sure that the old
    task-id is not null.
    
    Change-Id: I54dd93803954ee55316cc58b5877f38d1ebc40b9
    BUG: 1044327
    Signed-off-by: Kaushal M <kaushal>
    Signed-off-by: Krishnan Parthasarathi <kparthas>
    Reviewed-on: http://review.gluster.org/6567
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 6 Niels de Vos 2014-09-22 12:33:52 UTC

A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 7 Niels de Vos 2014-11-11 08:25:48 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users

Note You need to log in before you can comment on or make changes to this bug.