Description of problem: ========================= A cluster with 11 storage nodes (RHS nodes on AWS) contains a 3 x 3 distribute-replicate volume with 1 brick on each node and 2 nodes not part of the volume . removing-bricks with reducing the replica count to 2 from the volume fails on the peers which are not part of the volume. Version-Release number of selected component (if applicable): ============================================================== glusterfs 3.4.0.35rhs built on Oct 15 2013 14:06:04 How reproducible: ================= Tried once on AWS setup Steps to Reproduce: ====================== 1.Create a distribute-replicate volume (2 x 3) on aws. 2.Create fuse mounts. Create files/directories. 3.Disk limit exceeded. Added 3 more bricks to the volume making it 3 x 3 distribute-replicate volume. Started rebalance. 4. node2 and node5 got terminated. Detached node2 and node5 from the cluster(peer detach force). 5.Added 2 more nodes to the cluster to perform replace-brick of the terminated nodes. 6. Stopped rebalance. Tried to do replace-brick. replace-brick failed (cannot perform replace brick on a detached peer. Please refer to bug https://bugzilla.redhat.com/show_bug.cgi?id=976902 7. performed remove-brick of node3, node6, node9 to reduce the replica count from 3 to 2. 8. remove-brick commit op failed on newly added peers . Actual results: ============== root@ip-10-80-14-219 [Oct-23-2013-12:00:47] >gluster v remove-brick exporter replica 2 ec2-54-217-61-122.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter ec2-54-216-100-218.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter ec2-54-220-252-186.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit force: failed: Commit failed on ec2-54-220-254-178.eu-west-1.compute.amazonaws.com. Please check log file for details. Commit failed on ec2-54-220-229-94.eu-west-1.compute.amazonaws.com. Please check log file for details. Expected results: Additional info: ===================== volume information on which remove-brick succeeded: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ root@ip-10-237-21-234 [Oct-23-2013-17:09:14] >gluster v info exporter Volume Name: exporter Type: Distributed-Replicate Volume ID: 6a969bfc-2d84-49af-a343-13fc96a9c296 Status: Started Number of Bricks: 3 x 2 = 6 Transport-type: tcp Bricks: Brick1: ec2-54-247-42-51.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter Brick2: ec2-46-51-162-66.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter Brick3: ec2-54-246-10-1.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter Brick4: ec2-54-217-166-37.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter Brick5: ec2-54-220-195-28.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter Brick6: ec2-54-228-94-130.eu-west-1.compute.amazonaws.com:/rhs/bricks/exporter root@ip-10-237-21-234 [Oct-23-2013-17:09:20] >gluster peer status Number of Peers: 9 Hostname: 10.36.193.171 Uuid: ea31bdbd-df60-4185-a0db-0f946929bd36 State: Peer in Cluster (Disconnected) Hostname: ec2-54-246-10-1.eu-west-1.compute.amazonaws.com Uuid: bef42bdf-540e-4846-a0b2-5665ffdea49f State: Peer in Cluster (Disconnected) Hostname: ec2-54-216-100-218.eu-west-1.compute.amazonaws.com Uuid: 3329b0cf-57a1-48ed-9bec-dc51789378b1 State: Peer in Cluster (Disconnected) Hostname: ec2-54-220-195-28.eu-west-1.compute.amazonaws.com Uuid: 9178e0ff-4ccb-4984-8e88-716e791b7f10 State: Peer in Cluster (Connected) Hostname: ec2-54-228-94-130.eu-west-1.compute.amazonaws.com Uuid: 7ade3ee6-8c62-46a8-8277-d67a5ecfad05 State: Peer in Cluster (Connected) Hostname: ec2-54-220-252-186.eu-west-1.compute.amazonaws.com Uuid: 78f54b8e-3709-4e45-8dfa-1ff44eeef3f3 State: Peer in Cluster (Connected) Hostname: ec2-54-220-254-178.eu-west-1.compute.amazonaws.com Uuid: eb0e559a-c3da-4fe0-8d16-2921b5d95880 State: Peer in Cluster (Connected) Hostname: ec2-54-220-229-94.eu-west-1.compute.amazonaws.com Uuid: 98b4a63b-d637-4cab-ac60-6cd7d58ab883 State: Peer in Cluster (Connected) Hostname: ec2-54-247-42-51.eu-west-1.compute.amazonaws.com Uuid: 1962a65d-56e4-43c3-87c5-2a1cb62b642a State: Peer in Cluster (Connected) root@ip-10-237-21-234 [Oct-23-2013-17:09:23] >gluster v status exporter Status of volume: exporter Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick ec2-54-247-42-51.eu-west-1.compute.amazonaws.com: /rhs/bricks/exporter 49152 Y 5865 Brick ec2-54-220-195-28.eu-west-1.compute.amazonaws.com :/rhs/bricks/exporter 49152 Y 6078 Brick ec2-54-228-94-130.eu-west-1.compute.amazonaws.com :/rhs/bricks/exporter 49152 Y 6044 NFS Server on localhost 2049 Y 19210 Self-heal Daemon on localhost N/A Y 19217 NFS Server on ec2-54-220-252-186.eu-west-1.compute.amaz onaws.com 2049 Y 7498 Self-heal Daemon on ec2-54-220-252-186.eu-west-1.comput e.amazonaws.com N/A Y 7503 NFS Server on ec2-54-247-42-51.eu-west-1.compute.amazon aws.com 2049 Y 5874 Self-heal Daemon on ec2-54-247-42-51.eu-west-1.compute. amazonaws.com N/A Y 5879 NFS Server on ec2-54-220-254-178.eu-west-1.compute.amaz onaws.com 2049 Y 7286 Self-heal Daemon on ec2-54-220-254-178.eu-west-1.comput e.amazonaws.com N/A Y 7293 NFS Server on ec2-54-228-94-130.eu-west-1.compute.amazo naws.com 2049 Y 7479 Self-heal Daemon on ec2-54-228-94-130.eu-west-1.compute .amazonaws.com N/A Y 7480 NFS Server on ec2-54-220-195-28.eu-west-1.compute.amazo naws.com 2049 Y 7511 Self-heal Daemon on ec2-54-220-195-28.eu-west-1.compute .amazonaws.com N/A Y 7516 NFS Server on ec2-54-220-229-94.eu-west-1.compute.amazo naws.com 2049 Y 7283 Self-heal Daemon on ec2-54-220-229-94.eu-west-1.compute .amazonaws.com N/A Y 7290 There are no active volume tasks root@ip-10-237-21-234 [Oct-23-2013-17:09:25] > glusterd log of the peer on which remove-brick failed: ======================================================= [2013-10-23 10:50:14.440406] E [glusterd-handshake.c:1074:__glusterd_peer_dump_version_cbk] 0-: Error through RPC layer, retry again later [2013-10-23 10:50:15.581053] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.36.193.171:24007 failed (Connection refused) [2013-10-23 12:01:11.026734] I [glusterd-op-sm.c:4065:glusterd_bricks_select_remove_brick] 0-management: force flag is not set [2013-10-23 12:01:11.030012] E [glusterd-op-sm.c:3683:glusterd_op_ac_commit_op] 0-management: Commit of operation 'Volume Remove brick' failed: -1 [2013-10-23 12:03:12.727439] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-23 12:03:12.728692] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-23 12:03:12.729997] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-23 12:07:06.333935] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-23 12:07:06.335331] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-23 12:07:06.336505] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req [2013-10-23 12:11:46.182667] W [socket.c:522:__socket_rwv] 0-management: readv on 10.36.193.171:24007 failed (Connection reset by peer) [2013-10-23 12:11:46.182814] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x164) [0x7f1cd76a40f4] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f1cd76a3c33] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f1cd76a3b4e]))) 0-management: forced unwinding frame type(GLUSTERD-DUMP) op(DUMP(1)) called at 2013-10-23 12:11:41.952405 (xid=0x27x) [2013-10-23 12:11:46.182831] E [glusterd-handshake.c:1074:__glusterd_peer_dump_version_cbk] 0-: Error through RPC layer, retry again later [2013-10-23 12:11:47.956841] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.36.193.171:24007 failed (Connection refused)