1022648 – "remove-brick" commit failed on peers which are not part of the volume and leaving the volume in inconsistent state

Bug 1022648 - "remove-brick" commit failed on peers which are not part of the volume and leaving the volume in inconsistent state

Summary: "remove-brick" commit failed on peers which are not part of the volume and le...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Bug Updates Notification Mailing List
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1286196
TreeView+	depends on / blocked

Reported:	2013-10-23 17:28 UTC by spandura
Modified:	2015-11-27 12:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1286196 (view as bug list)
Environment:
Last Closed:	2015-11-27 12:24:41 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description spandura 2013-10-23 17:28:00 UTC

Description of problem:
=========================
A cluster with 11 storage nodes (RHS nodes on AWS) contains a 3 x 3 distribute-replicate volume with 1 brick on each node and 2 nodes not part of the volume . removing-bricks with reducing the replica count to 2 from the volume fails on the peers which are not part of the volume. 

Version-Release number of selected component (if applicable):
==============================================================
glusterfs 3.4.0.35rhs built on Oct 15 2013 14:06:04

How reproducible:
=================
Tried once on AWS setup

Steps to Reproduce:
======================
1.Create a distribute-replicate volume (2 x 3) on aws. 

2.Create fuse mounts. Create files/directories. 

3.Disk limit exceeded. Added 3 more bricks to the volume making it  3 x 3 distribute-replicate volume. Started rebalance.

4. node2 and node5 got terminated. Detached node2 and node5 from the cluster(peer detach force).

5.Added 2 more nodes to the cluster to perform replace-brick of the terminated nodes. 

6. Stopped rebalance. Tried to do replace-brick. replace-brick failed (cannot perform replace brick on a detached peer. Please refer to bug https://bugzilla.redhat.com/show_bug.cgi?id=976902

7. performed remove-brick of node3, node6, node9 to reduce the replica count from 3 to 2. 

8. remove-brick commit op failed on newly added peers . 

Actual results:
==============
root@ip-10-80-14-219 [Oct-23-2013-12:00:47] >gluster v remove-brick exporter replica 2 ec2-54-217-61-122.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter ec2-54-216-100-218.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter ec2-54-220-252-186.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit force: failed: Commit failed on ec2-54-220-254-178.eu-west-1.compute.amazonaws.com. Please check log file for details.
Commit failed on ec2-54-220-229-94.eu-west-1.compute.amazonaws.com. Please check log file for details.

Expected results:


Additional info:
=====================
volume information on which remove-brick succeeded:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

root@ip-10-237-21-234 [Oct-23-2013-17:09:14] >gluster v info exporter
 
Volume Name: exporter
Type: Distributed-Replicate
Volume ID: 6a969bfc-2d84-49af-a343-13fc96a9c296
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: ec2-54-247-42-51.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
Brick2: ec2-46-51-162-66.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
Brick3: ec2-54-246-10-1.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
Brick4: ec2-54-217-166-37.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
Brick5: ec2-54-220-195-28.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
Brick6: ec2-54-228-94-130.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter
root@ip-10-237-21-234 [Oct-23-2013-17:09:20] >gluster peer status
Number of Peers: 9

Hostname: 10.36.193.171
Uuid: ea31bdbd-df60-4185-a0db-0f946929bd36
State: Peer in Cluster (Disconnected)

Hostname: ec2-54-246-10-1.eu-west-1.compute.amazonaws.com
Uuid: bef42bdf-540e-4846-a0b2-5665ffdea49f
State: Peer in Cluster (Disconnected)

Hostname: ec2-54-216-100-218.eu-west-1.compute.amazonaws.com
Uuid: 3329b0cf-57a1-48ed-9bec-dc51789378b1
State: Peer in Cluster (Disconnected)

Hostname: ec2-54-220-195-28.eu-west-1.compute.amazonaws.com
Uuid: 9178e0ff-4ccb-4984-8e88-716e791b7f10
State: Peer in Cluster (Connected)

Hostname: ec2-54-228-94-130.eu-west-1.compute.amazonaws.com
Uuid: 7ade3ee6-8c62-46a8-8277-d67a5ecfad05
State: Peer in Cluster (Connected)

Hostname: ec2-54-220-252-186.eu-west-1.compute.amazonaws.com
Uuid: 78f54b8e-3709-4e45-8dfa-1ff44eeef3f3
State: Peer in Cluster (Connected)

Hostname: ec2-54-220-254-178.eu-west-1.compute.amazonaws.com
Uuid: eb0e559a-c3da-4fe0-8d16-2921b5d95880
State: Peer in Cluster (Connected)

Hostname: ec2-54-220-229-94.eu-west-1.compute.amazonaws.com
Uuid: 98b4a63b-d637-4cab-ac60-6cd7d58ab883
State: Peer in Cluster (Connected)

Hostname: ec2-54-247-42-51.eu-west-1.compute.amazonaws.com
Uuid: 1962a65d-56e4-43c3-87c5-2a1cb62b642a
State: Peer in Cluster (Connected)
root@ip-10-237-21-234 [Oct-23-2013-17:09:23] >gluster v status exporter
Status of volume: exporter
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick ec2-54-247-42-51.eu-west-1.compute.amazonaws.com:
/rhs/bricks/exporter					49152	Y	5865
Brick ec2-54-220-195-28.eu-west-1.compute.amazonaws.com
:/rhs/bricks/exporter					49152	Y	6078
Brick ec2-54-228-94-130.eu-west-1.compute.amazonaws.com
:/rhs/bricks/exporter					49152	Y	6044
NFS Server on localhost					2049	Y	19210
Self-heal Daemon on localhost				N/A	Y	19217
NFS Server on ec2-54-220-252-186.eu-west-1.compute.amaz
onaws.com						2049	Y	7498
Self-heal Daemon on ec2-54-220-252-186.eu-west-1.comput
e.amazonaws.com						N/A	Y	7503
NFS Server on ec2-54-247-42-51.eu-west-1.compute.amazon
aws.com							2049	Y	5874
Self-heal Daemon on ec2-54-247-42-51.eu-west-1.compute.
amazonaws.com						N/A	Y	5879
NFS Server on ec2-54-220-254-178.eu-west-1.compute.amaz
onaws.com						2049	Y	7286
Self-heal Daemon on ec2-54-220-254-178.eu-west-1.comput
e.amazonaws.com						N/A	Y	7293
NFS Server on ec2-54-228-94-130.eu-west-1.compute.amazo
naws.com						2049	Y	7479
Self-heal Daemon on ec2-54-228-94-130.eu-west-1.compute
.amazonaws.com						N/A	Y	7480
NFS Server on ec2-54-220-195-28.eu-west-1.compute.amazo
naws.com						2049	Y	7511
Self-heal Daemon on ec2-54-220-195-28.eu-west-1.compute
.amazonaws.com						N/A	Y	7516
NFS Server on ec2-54-220-229-94.eu-west-1.compute.amazo
naws.com						2049	Y	7283
Self-heal Daemon on ec2-54-220-229-94.eu-west-1.compute
.amazonaws.com						N/A	Y	7290
 
There are no active volume tasks
root@ip-10-237-21-234 [Oct-23-2013-17:09:25] >

glusterd log of the peer on which remove-brick failed:
=======================================================
[2013-10-23 10:50:14.440406] E [glusterd-handshake.c:1074:__glusterd_peer_dump_version_cbk] 0-: Error through RPC layer, retry again later
[2013-10-23 10:50:15.581053] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.36.193.171:24007 failed (Connection refused)
[2013-10-23 12:01:11.026734] I [glusterd-op-sm.c:4065:glusterd_bricks_select_remove_brick] 0-management: force flag is not set
[2013-10-23 12:01:11.030012] E [glusterd-op-sm.c:3683:glusterd_op_ac_commit_op] 0-management: Commit of operation 'Volume Remove brick' failed: -1
[2013-10-23 12:03:12.727439] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-23 12:03:12.728692] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-23 12:03:12.729997] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-23 12:07:06.333935] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-23 12:07:06.335331] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-23 12:07:06.336505] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-10-23 12:11:46.182667] W [socket.c:522:__socket_rwv] 0-management: readv on 10.36.193.171:24007 failed (Connection reset by peer)
[2013-10-23 12:11:46.182814] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x7f1cd76a40f4] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f1cd76a3c33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1cd76a3b4e]))) 0-management: forced unwinding frame type(GLUSTERD-DUMP) op(DUMP(1)) called at 2013-10-23 12:11:41.952405 (xid=0x27x)
[2013-10-23 12:11:46.182831] E [glusterd-handshake.c:1074:__glusterd_peer_dump_version_cbk] 0-: Error through RPC layer, retry again later
[2013-10-23 12:11:47.956841] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.36.193.171:24007 failed (Connection refused)

Note You need to log in before you can comment on or make changes to this bug.