Bug 1006317 - add-brick operations fails when one of the bricks in the replica pair is down
add-brick operations fails when one of the bricks in the replica pair is down
Status: CLOSED DEFERRED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs (Show other bugs)
2.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Bug Updates Notification Mailing List
storage-qa-internal@redhat.com
:
Depends On:
Blocks: 1286181
  Show dependency treegraph
 
Reported: 2013-09-10 08:35 EDT by Anush Shetty
Modified: 2015-11-27 07:20 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1286181 (view as bug list)
Environment:
Last Closed: 2015-11-27 07:14:49 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Anush Shetty 2013-09-10 08:35:01 EDT
Description of problem: In a 1x2 replicate volume, we brought down one of the replica bricks by powering off the brick machine. Now 2 bricks were added to the volume using add-brick command; the add-brick operation didn't succeed. Now the brick was brought up again and we tried adding the bricks again. The add-brick operation failed saying that the bricks are already part of the volume. Even though the add-brick failed for the first time, the volume-id extended attributes were still set on the new brick path.


Version-Release number of selected component (if applicable): glusterfs-3.4.0.33rhs-1.el6rhs.x86_64


How reproducible: Always


Steps to Reproduce:
1. Create a 1x2 Replicate volume
   
# gluster volume info
 
Volume Name: cinder-vol
Type: Replicate
Volume ID: 843455d2-3665-4cf5-8390-923f8edac27f
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: rhshdp01.lab.eng.blr.redhat.com:/rhs/brick2/s1
Brick2: rhshdp02.lab.eng.blr.redhat.com:/rhs/brick2/s2
Options Reconfigured:
storage.owner-gid: 165
storage.owner-uid: 165
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off

2. Bring one of the bricks in the replica pair down by powering off the machine

3. Try adding 2 more bricks to the volume. add-brick fails.

# gluster volume add-brick cinder-vol rhshdp03.lab.eng.blr.redhat.com:/rhs/brick2/s3 rhshdp04.lab.eng.blr.redhat.com:/rhs/brick2/s4

4. Bring up the brick which was powered off 

5. Trying the add-brick operation again.

# gluster volume add-brick cinder-vol rhshdp03.lab.eng.blr.redhat.com:/rhs/brick2/s3 rhshdp04.lab.eng.blr.redhat.com:/rhs/brick2/s4 
volume add-brick: failed: Staging failed on rhshdp03.lab.eng.blr.redhat.com. Error: /rhs/brick2/s3 is already part of a volume
Staging failed on rhshdp04.lab.eng.blr.redhat.com. Error: /rhs/brick2/s4 is already part of a volume

Actual results:

add-brick fails saying that the bricks are already part of the volume. 

Expected results:

extended-attributes shouldn't be set on bricks when add-brick operation fails

Additional info:

# getfattr -d -m . -e hex /rhs/brick2/s3/
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/s3/
trusted.glusterfs.volume-id=0x843455d236654cf58390923f8edac27f

From glusterd logs:

[2013-09-10 11:58:51.018570] I [socket.c:2237:socket_event_handler] 0-transport: disconnecting now
[2013-09-10 12:02:56.591092] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req
[2013-09-10 12:05:27.427239] E [glusterd-utils.c:149:glusterd_lock] 0-management: Unable to get lock for uuid: b5550547-9704-490f-a891-b102766b7da5, lock held by: b5550547-9704-490f-a89
1-b102766b7da5
[2013-09-10 12:05:27.427296] E [glusterd-handler.c:513:glusterd_op_txn_begin] 0-management: Unable to acquire lock on localhost, ret: -1
[2013-09-10 12:05:35.204205] E [glusterd-utils.c:149:glusterd_lock] 0-management: Unable to get lock for uuid: b5550547-9704-490f-a891-b102766b7da5, lock held by: b5550547-9704-490f-a89
1-b102766b7da5
[2013-09-10 12:05:35.204284] E [glusterd-handler.c:513:glusterd_op_txn_begin] 0-management: Unable to acquire lock on localhost, ret: -1
[2013-09-10 12:05:43.586895] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
[2013-09-10 12:09:01.981120] I [glusterd-handler.c:2235:__glusterd_handle_friend_update] 0-: Received uuid: b5550547-9704-490f-a891-b102766b7da5, hostname:10.70.36.116
[2013-09-10 12:09:01.981135] I [glusterd-handler.c:2244:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2013-09-10 12:09:01.982494] E [glusterd-syncop.c:102:gd_collate_errors] 0-: Unlocking failed on 9178fbf0-11a0-451c-90c5-853f0cca3e02. Please check log file for details.
[2013-09-10 12:09:01.982945] E [glusterd-syncop.c:1068:gd_unlock_op_phase] 0-management: Failed to unlock on some peer(s)
[2013-09-10 12:09:01.983043] I [socket.c:3108:socket_submit_reply] 0-socket.management: not connected (priv->connected = -1)
[2013-09-10 12:09:01.983062] E [rpcsvc.c:1111:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1x, Program: GlusterD svc cli, ProgVers: 2, Proc: 13) to rpc-transport (socket.management)
[2013-09-10 12:09:01.983086] E [glusterd-utils.c:380:glusterd_submit_reply] 0-: Reply submission failed
[2013-09-10 12:10:29.965354] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req
[2013-09-10 12:10:29.972646] E [glusterd-syncop.c:102:gd_collate_errors] 0-: Staging failed on rhshdp03.lab.eng.blr.redhat.com. Error: /rhs/brick2/s3 is already part of a volume
[2013-09-10 12:10:29.973099] E [glusterd-syncop.c:102:gd_collate_errors] 0-: Staging failed on rhshdp04.lab.eng.blr.redhat.com. Error: /rhs/brick2/s4 is already part of a volume
[2013-09-10 12:10:41.027802] I [glusterd-handler.c:1073:__glusterd_handle_cli_get_volume] 0-glusterd: Received get vol req
(END)
Comment 2 Vijaikumar Mallikarjuna 2013-12-12 05:14:51 EST
With add-brick, the framework works as below

lock
validate and set volume-id on bricks on node1
unclock

lock
validate and set volume-id on bricks on node2
unclock

and so on...

In case if there is a failure in one of the node, the framework doesn't support rollback of configurations done on previous node.

I will see how rollback operation can be performed here.

Note You need to log in before you can comment on or make changes to this bug.