1230101 – [glusterd] glusterd crashed while trying to remove a bricks - one selected from each replica set - after shrinking nX3 to nX2 to nX1

Bug 1230101 - [glusterd] glusterd crashed while trying to remove a bricks - one selected from each replica set - after shrinking nX3 to nX2 to nX1

Summary: [glusterd] glusterd crashed while trying to remove a bricks - one selected fr...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.0
Assignee:	Gaurav Kumar Garg
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:	glusterd
Depends On:
Blocks:	1202842 1230121 1231646
TreeView+	depends on / blocked

Reported:	2015-06-10 09:05 UTC by Rahul Hinduja
Modified:	2016-06-05 23:38 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.7.1-4
Doc Type:	Bug Fix
Doc Text:	Previously, glusterd crashed when performing a remove brick operation on a replicate volume after shrinking the volume from replica nx3 to nx2 and from nx2 to nx1. This was due to an issue with the subvol count (replica set) calculation. With this fix glusterd does not crash after shrinking the replicate volume from replica nx3 to nx2 and from nx2 to nx1.
Clone Of:
Clones:	1230121 (view as bug list)
Environment:
Last Closed:	2015-07-29 05:00:32 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:1495	0	normal	SHIPPED_LIVE	Important: Red Hat Gluster Storage 3.1 update	2015-07-29 08:26:26 UTC

Description Rahul Hinduja 2015-06-10 09:05:32 UTC

Description of problem:
=======================

While trying to remove-brick with replica count 2 from the existing volume(replica 2), glusterd crashes with following bt:

#0  0x00007fcdd03e681c in subvol_matcher_update	(req=0x25989cc)	at glusterd-brick-ops.c:662
#1  __glusterd_handle_remove_brick (req=0x25989cc) at glusterd-brick-ops.c:985
#2  0x00007fcdd03542bf in glusterd_big_locked_handler (req=0x25989cc, actor_fn=0x7fcdd03e5f90 <__glusterd_handle_remove_brick>)	at glusterd-handler.c:83
#3  0x0000003b0d8655b2 in synctask_wrap	(old_task=<value optimized out>) at syncop.c:375
#4  0x0000003b028438f0 in ?? ()	from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) 

Logs suggest:
=============

[2015-06-10 14:18:01.134630] I [glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2015-06-10 14:18:01.137158] I [glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2015-06-10 14:18:28.239515] I [glusterd-brick-ops.c:779:__glusterd_handle_remove_brick] 0-management: Received rem brick req
[2015-06-10 14:18:28.239593] I [glusterd-brick-ops.c:849:__glusterd_handle_remove_brick] 0-management: request to change replica-count to 2
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-06-10 14:18:28
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3b0d824b66]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3b0d84359f]
/lib64/libc.so.6[0x3b028326a0]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_remove_brick+0x88c)[0x7fcdd03e681c]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fcdd03542bf]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3b0d8655b2]
/lib64/libc.so.6[0x3b028438f0]
---------
(END) 



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-1.el6rhs.x86_64


How reproducible:
==================

Always


Steps to Reproduce:
===================
1. Create 2x2 volume
2. Remove 2 bricks, one from each subvolume and use replica count as 2


Actual results:
===============

Glusterd crash


Expected results:
=================

Removing brick with replica count 2 from replica count 2 is a failure case, it should print usage or fail gracefully.

Comment 4 SATHEESARAN 2015-06-10 09:44:58 UTC

I have tried to reproduce the issue.

Its reproducible only with the following case :

1. Created 2X3 distributed-replicate volume
2. Shrink it to 2X2 distributed-replicate volume
3. Shrink it to 2X2 to 2X1 distribute volume

Here are few more observations :
1. There is no crash observed when creating a 2X2 volume and shrinking it to 2X1
2. There is no crash observed when creating a 2X3 volume and shrinking it to 2X2
3. There is no crash observed when trying to remove each brick from all replica sets and proper error message is thrown

Comment 5 Atin Mukherjee 2015-06-11 07:00:34 UTC

Upstream patch http://review.gluster.org/#/c/11165 is in review

Comment 7 SATHEESARAN 2015-06-12 06:35:47 UTC

Marking this bug as BLOCKER, as this required for RHGS 3.1 ( Everglades )

Comment 13 SATHEESARAN 2015-06-29 17:37:45 UTC

Verified with RHGS 3.1 Nightly build - glusterfs-3.7.1-6.el6rhs with the steps mentioned in comment4.

There were no issues and marking this bug as VERIFIED

Comment 14 Bhavana 2015-07-15 06:16:51 UTC

Hi Gaurav,

The doc text is updated. Please review the same and share your technical review comments. If it looks ok, then sign-off on the same.

Regards,
Bhavana

Comment 15 errata-xmlrpc 2015-07-29 05:00:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html

Note You need to log in before you can comment on or make changes to this bug.