Bug 982184

Summary: " Remove-brick start " command is not working
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: glusterfsAssignee: Kaushal <kaushal>
Status: CLOSED ERRATA QA Contact: Sudhir D <sdharane>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.1CC: amarts, rhs-bugs, sgowda, shaines, surs, vbellur
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.12rhs.beta6-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-23 22:29:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description shylesh 2013-07-08 11:01:55 UTC
Description of problem:
remove-brick start on any volume is not working



Version-Release number of selected component (if applicable):

[root@beta2 glusterd]# rpm -qa | grep gluster
gluster-swift-container-1.4.8-4.el6.noarch
glusterfs-fuse-3.4.0.12rhs.beta3-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.12rhs.beta3-1.el6rhs.x86_64
vdsm-gluster-4.10.2-22.5.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-proxy-1.4.8-4.el6.noarch
gluster-swift-account-1.4.8-4.el6.noarch
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-3.4.0.12rhs.beta3-1.el6rhs.x86_64
glusterfs-server-3.4.0.12rhs.beta3-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.12rhs.beta3-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.12rhs.beta3-1.el6rhs.x86_64
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-object-1.4.8-4.el6.noarch
glusterfs-devel-3.4.0.12rhs.beta3-1.el6rhs.x86_64


How reproducible:
always

Steps to Reproduce:
1. created a 3 brick distribute volume and start the volume
2. run "gluster volume remove-brick <vol> brick1 start


Actual results:

It fails with the message 
[root@beta1 ~]# gluster v remove-brick a1 10.70.35.62:/brick1/a31 start
volume remove-brick start: failed: A remove-brick task on volume a1 is not yet committed. Either commit or stop the remove-brick task.




RHS nodes
==========
10.70.35.62
10.70.35.64

command issued from 10.70.35.62

Volume Name: a1
Type: Distribute
Volume ID: c5389939-7db4-4248-9236-7abfc8d720d2
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.35.62:/brick1/a11
Brick2: 10.70.35.64:/brick2/a221
Brick3: 10.70.35.62:/brick1/a31



glusterd.log
=============
3698] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled
[2013-07-08 10:38:55.643712] I [socket.c:3502:socket_init] 0-management: using system polling thread
[2013-07-08 10:38:55.644112] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
[2013-07-08 10:38:59.019730] I [glusterd-handler.c:1021:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-07-08 10:38:59.020280] I [glusterd-handler.c:1021:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-07-08 10:38:59.020637] I [glusterd-handler.c:1021:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-07-08 10:38:59.021241] I [glusterd-handler.c:1021:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-07-08 10:39:13.836498] I [glusterd-brick-ops.c:593:__glusterd_handle_remove_brick] 0-management: Received rem brick req
[2013-07-08 10:39:13.840039] I [glusterd-utils.c:7724:glusterd_generate_and_set_task_id] 0-management: Generated task-id 53b196a0-a2ce
-41d0-b940-a79139119421 for key remove-brick-id
[2013-07-08 10:39:13.840577] I [glusterd-op-sm.c:4032:glusterd_bricks_select_remove_brick] 0-management: force flag is not set
[2013-07-08 10:39:13.866741] E [glusterd-brick-ops.c:1712:glusterd_op_remove_brick] 0-management: failed to start the rebalance
[2013-07-08 10:39:13.866759] E [glusterd-syncop.c:927:gd_commit_op_phase] 0-management: Commit of operation 'Volume Remove brick' fail
ed on localhost : A remove-brick task on volume another is not yet committed. Either commit or stop the remove-brick task.
[2013-07-08 10:52:26.849777] I [glusterd-utils.c:7724:glusterd_generate_and_set_task_id] 0-management: Generated task-id a868e373-298b
-4262-a833-7292b7b0af89 for key rebalance-id
[2013-07-08 10:52:26.849814] E [glusterd-op-sm.c:2790:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Rebalan
ce' failed on localhost : A remove-brick task on volume another is not yet committed. Either commit or stop the remove-brick task.
[2013-07-08 10:53:05.613725] I [glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /brick1/a11 on port 49159
[2013-07-08 10:53:05.614575] I [rpc-clnt.c:961:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-07-08 10:53:05.614619] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled
[2013-07-08 10:53:05.614630] I [socket.c:3502:socket_init] 0-management: using system polling thread
[2013-07-08 10:53:05.628241] I [glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /brick1/a31 on port 49160
[2013-07-08 10:53:05.628984] I [rpc-clnt.c:961:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2013-07-08 10:53:05.629062] I [socket.c:3487:socket_init] 0-management: SSL support is NOT enabled
[2013-07-08 10:53:05.629073] I [socket.c:3502:socket_init] 0-management: using system polling thread
[2013-07-08 10:53:05.658005] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0





attached the sosreports

Comment 4 Amar Tumballi 2013-07-19 10:22:29 UTC
Shylesh, was another rebalance running at this time?

[2013-07-08 10:52:26.849814] E [glusterd-op-sm.c:2790:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Rebalan
ce' failed on localhost : A remove-brick task on volume another is not yet committed. Either commit or stop the remove-brick task.

Comment 5 shylesh 2013-07-19 10:27:59 UTC
(In reply to Amar Tumballi from comment #4)
> Shylesh, was another rebalance running at this time?
> 
> [2013-07-08 10:52:26.849814] E
> [glusterd-op-sm.c:2790:glusterd_op_ac_send_stage_op] 0-management: Staging
> of operation 'Volume Rebalan
> ce' failed on localhost : A remove-brick task on volume another is not yet
> committed. Either commit or stop the remove-brick task.

Amar,
No rebalance or any other remove-brick operation was running

Comment 6 Kaushal 2013-07-20 05:47:42 UTC
Going through this now. Will update by EOD.

Comment 7 Kaushal 2013-07-20 12:46:20 UTC
This was caused by a incomplete backport of the upstream patch. Sent out a patch to add the missing code. 
Patch under review at https://code.engineering.redhat.com/gerrit/10517

Comment 8 Kaushal 2013-07-22 03:37:22 UTC
Patch merged as commit b1c79ed82ea3ce2853588e77f5bbcdffdd65dfcd

Comment 10 shylesh 2013-08-01 09:27:18 UTC
Verified on 3.4.0.14rhs-1.el6rhs.x86_64

Comment 11 Scott Haines 2013-09-23 22:29:50 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html