Bug 1230101

Summary: [glusterd] glusterd crashed while trying to remove a bricks - one selected from each replica set - after shrinking nX3 to nX2 to nX1
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: glusterdAssignee: Gaurav Kumar Garg <ggarg>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amukherj, asrivast, bmohanra, ggarg, nlevinki, nsathyan, rcyriac, sasundar, smohan, vbellur
Target Milestone: ---Keywords: Patch, Triaged
Target Release: RHGS 3.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: glusterd
Fixed In Version: glusterfs-3.7.1-4 Doc Type: Bug Fix
Doc Text:
Previously, glusterd crashed when performing a remove brick operation on a replicate volume after shrinking the volume from replica nx3 to nx2 and from nx2 to nx1. This was due to an issue with the subvol count (replica set) calculation. With this fix glusterd does not crash after shrinking the replicate volume from replica nx3 to nx2 and from nx2 to nx1.
Story Points: ---
Clone Of:
: 1230121 (view as bug list) Environment:
Last Closed: 2015-07-29 05:00:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1202842, 1230121, 1231646    

Description Rahul Hinduja 2015-06-10 09:05:32 UTC
Description of problem:
=======================

While trying to remove-brick with replica count 2 from the existing volume(replica 2), glusterd crashes with following bt:

#0  0x00007fcdd03e681c in subvol_matcher_update	(req=0x25989cc)	at glusterd-brick-ops.c:662
#1  __glusterd_handle_remove_brick (req=0x25989cc) at glusterd-brick-ops.c:985
#2  0x00007fcdd03542bf in glusterd_big_locked_handler (req=0x25989cc, actor_fn=0x7fcdd03e5f90 <__glusterd_handle_remove_brick>)	at glusterd-handler.c:83
#3  0x0000003b0d8655b2 in synctask_wrap	(old_task=<value optimized out>) at syncop.c:375
#4  0x0000003b028438f0 in ?? ()	from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) 

Logs suggest:
=============

[2015-06-10 14:18:01.134630] I [glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2015-06-10 14:18:01.137158] I [glusterd-handler.c:1404:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2015-06-10 14:18:28.239515] I [glusterd-brick-ops.c:779:__glusterd_handle_remove_brick] 0-management: Received rem brick req
[2015-06-10 14:18:28.239593] I [glusterd-brick-ops.c:849:__glusterd_handle_remove_brick] 0-management: request to change replica-count to 2
pending frames:
frame : type(0) op(0)
patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 
2015-06-10 14:18:28
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.7.1
/usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x3b0d824b66]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x3b0d84359f]
/lib64/libc.so.6[0x3b028326a0]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_remove_brick+0x88c)[0x7fcdd03e681c]
/usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fcdd03542bf]
/usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x3b0d8655b2]
/lib64/libc.so.6[0x3b028438f0]
---------
(END) 



Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-1.el6rhs.x86_64


How reproducible:
==================

Always


Steps to Reproduce:
===================
1. Create 2x2 volume
2. Remove 2 bricks, one from each subvolume and use replica count as 2


Actual results:
===============

Glusterd crash


Expected results:
=================

Removing brick with replica count 2 from replica count 2 is a failure case, it should print usage or fail gracefully.

Comment 4 SATHEESARAN 2015-06-10 09:44:58 UTC
I have tried to reproduce the issue.

Its reproducible only with the following case :

1. Created 2X3 distributed-replicate volume
2. Shrink it to 2X2 distributed-replicate volume
3. Shrink it to 2X2 to 2X1 distribute volume

Here are few more observations :
1. There is no crash observed when creating a 2X2 volume and shrinking it to 2X1
2. There is no crash observed when creating a 2X3 volume and shrinking it to 2X2
3. There is no crash observed when trying to remove each brick from all replica sets and proper error message is thrown

Comment 5 Atin Mukherjee 2015-06-11 07:00:34 UTC
Upstream patch http://review.gluster.org/#/c/11165 is in review

Comment 7 SATHEESARAN 2015-06-12 06:35:47 UTC
Marking this bug as BLOCKER, as this required for RHGS 3.1 ( Everglades )

Comment 13 SATHEESARAN 2015-06-29 17:37:45 UTC
Verified with RHGS 3.1 Nightly build - glusterfs-3.7.1-6.el6rhs with the steps mentioned in comment4.

There were no issues and marking this bug as VERIFIED

Comment 14 Bhavana 2015-07-15 06:16:51 UTC
Hi Gaurav,

The doc text is updated. Please review the same and share your technical review comments. If it looks ok, then sign-off on the same.

Regards,
Bhavana

Comment 15 errata-xmlrpc 2015-07-29 05:00:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-1495.html