Bug 803711

Summary: Remove-brick with wrong replica value , gives wrong info and further replace-brick operation causes glusterd to crash
Product: [Community] GlusterFS Reporter: Vijaykumar Koppad <vkoppad>
Component: glusterdAssignee: Amar Tumballi <amarts>
Status: CLOSED CURRENTRELEASE QA Contact: Vijaykumar Koppad <vkoppad>
Severity: medium Docs Contact:
Priority: urgent    
Version: mainlineCC: bbandari, gluster-bugs, vbellur, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 17:19:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: 3.3.0qa42 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 817967    

Description Vijaykumar Koppad 2012-03-15 13:41:58 UTC
Description of problem:
Volume Name: doa
Type: Replicate
Volume ID: 1f0ef1ab-4f35-4dd3-ada9-f1b5d37a2876
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: vostro:/root/bricks/doa/d1
Brick2: vostro:/root/bricks/doa/d2
Brick3: vostro:/root/bricks/doa/d3
root@vostro:~# gluster volume remove-brick doa replica 3 vostro:/root/bricks/doa/d3
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Remove Brick successful
root@vostro:~# gluster volume info
 
Volume Name: doa
Type: Replicate
Volume ID: 1f0ef1ab-4f35-4dd3-ada9-f1b5d37a2876
Status: Started
Number of Bricks: 0 x 3 = 2
Transport-type: tcp
Bricks:
Brick1: vostro:/root/bricks/doa/d1
Brick2: vostro:/root/bricks/doa/d2


root@vostro:~# gluster volume remove-brick doa replica 1 vostro:/root/bricks/doa/d3
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Connection failed. Please check if gluster daemon is operational.


Version-Release number of selected component (if applicable):
[d05708d7976a8340ae7647fd26f38f22f1863b6a]release-3.3

How reproducible:always


Additional info:And replicate translator has only one sub-volume
volume doa-client-0
    type protocol/client
    option remote-host vostro
    option remote-subvolume /root/bricks/doa/d1
    option transport-type tcp
end-volume

volume doa-client-1
    type protocol/client
    option remote-host vostro
    option remote-subvolume /root/bricks/doa/d2
    option transport-type tcp
end-volume

volume doa-replicate-0
    type cluster/replicate
    subvolumes doa-client-0 doa-client-1
end-volume

This is the back-trace- 
###############################################
#0  0x00007f0bdc961ee2 in gd_rmbr_validate_replica_count (volinfo=0x231ef60, replica_count=1, brick_count=1, err_str=0x7fff97d7bc40 "") at glusterd-brick-ops.c:294
#1  0x00007f0bdc963273 in glusterd_handle_remove_brick (req=0x7f0bdc86f04c) at glusterd-brick-ops.c:609
#2  0x00007f0bdfdf1279 in rpcsvc_handle_rpc_call (svc=0x23115d0, trans=0x231bdc0, msg=0x23c0a70) at rpcsvc.c:520
#3  0x00007f0bdfdf15f6 in rpcsvc_notify (trans=0x231bdc0, mydata=0x23115d0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x23c0a70) at rpcsvc.c:616
#4  0x00007f0bdfdf72ac in rpc_transport_notify (this=0x231bdc0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x23c0a70) at rpc-transport.c:498
#5  0x00007f0bdc663317 in socket_event_poll_in (this=0x231bdc0) at socket.c:1686
#6  0x00007f0bdc663880 in socket_event_handler (fd=33, idx=25, data=0x231bdc0, poll_in=1, poll_out=0, poll_err=0) at socket.c:1801
#7  0x00007f0be005179c in event_dispatch_epoll_handler (event_pool=0x22f73a0, events=0x231b180, i=0) at event.c:794
#8  0x00007f0be00519af in event_dispatch_epoll (event_pool=0x22f73a0) at event.c:856
#9  0x00007f0be0051d22 in event_dispatch (event_pool=0x22f73a0) at event.c:956
#10 0x0000000000408247 in main (argc=3, argv=0x7fff97d7cb48) at glusterfsd.c:1624
(gdb) f 1 
#1  0x00007f0bdc963273 in glusterd_handle_remove_brick (req=0x7f0bdc86f04c) at glusterd-brick-ops.c:609
609	                ret = gd_rmbr_validate_replica_count (volinfo, replica_count,
(gdb) f 2 
#2  0x00007f0bdfdf1279 in rpcsvc_handle_rpc_call (svc=0x23115d0, trans=0x231bdc0, msg=0x23c0a70) at rpcsvc.c:520
520	                        ret = actor->actor (req);
(gdb) f 23
#0  0x0000000000000000 in ?? ()
(gdb) f 3
#3  0x00007f0bdfdf15f6 in rpcsvc_notify (trans=0x231bdc0, mydata=0x23115d0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x23c0a70) at rpcsvc.c:616
616	                ret = rpcsvc_handle_rpc_call (svc, trans, msg);
(gdb) f 4
#4  0x00007f0bdfdf72ac in rpc_transport_notify (this=0x231bdc0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x23c0a70) at rpc-transport.c:498
498	                ret = this->notify (this, this->mydata, event, data);
(gdb)

Comment 1 Anand Avati 2012-03-31 14:39:06 UTC
CHANGE: http://review.gluster.com/3050 (glusterd: remove-brick validation behavior fix) merged in master by Vijay Bellur (vijay)

Comment 2 Amar Tumballi 2012-05-05 04:22:34 UTC
Noticed that the fix done in above case was good for one of the few cases. There is still an issue with remove-brick pattern in plain replicate type of volume, hence re-opening this bug. Thanks to Shwetha and Shylesh for trying to verify the bug, and finding the other cases.

Comment 3 Anand Avati 2012-05-08 09:41:06 UTC
CHANGE: http://review.gluster.com/3278 (glusterd: remove-brick: add more error handling) merged in master by Vijay Bellur (vijay)