Bug 812214

Summary: [b337b755325f75a6fcf65616eaf4467b70b8b245]: add-brick should not be allowed for a directory which already has a volume-id
Product: [Community] GlusterFS Reporter: Raghavendra Bhat <rabhat>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED CURRENTRELEASE QA Contact: shylesh <shmohan>
Severity: medium Docs Contact:
Priority: high    
Version: mainlineCC: amarts, bloch, bturner, cyrille.duverne, gluster-bugs, haruo.tomita, hateya, jdarcy, m, ndevos, nsathyan, stanislav.polasek, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 13:41:02 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: DP CRM:
Verified Versions: glusterfs-3.3.0qa43 Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 817967    

Description Raghavendra Bhat 2012-04-13 02:25:01 EDT
Description of problem:
Suppose there is a 3 replica volume. Now linux-kernel untarring is being performed on the mount point. Immedietly 1st 2 bricks are brought down. Untarring still continues because 3rd brick is still up. After some time (many files are created) bring the down bricks up and give volume heal command. And immedietly bring the 3rd brick (which is the source brick for self-heal) down. 

Now there will be either stale data or some data loss on the remaining 2 bricks. 
Now on the mount point again do linux-kernel untarring which creates the files and directories again (but with different gfid). Now add the 3rd brick again to the volume.

Now the volume might end-up in a complicated state where the same file/directory might be having different gfids on different bricks.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. create a 3 replica volume, start it and mount it
2. Start untarring linux kernel on the mount point and immedietly bring 1st 2 bricks down.
3. After many files and directories have been created, stop untarring, bring the down bricks up.
4. give volume heal command and immedietly bring the 3rd brick down.
5. Now again start untarring linux kernel on the mount point.
6. After sometime add the 3rd brick back to the volume
  
Actual results:

3rd brick which has a volume-id is allowed to be added to the brick
Expected results:

If a directory contains a volume-id, then it should not be allowed to be added back to the volume even though the volume's volume-id and the directory's volume-id are same.

Additional info:


[2012-04-13 11:37:46.802264] W [fuse-bridge.c:291:fuse_entry_cbk] 0-glusterfs-fuse: 14529: LOOKUP() /linux-2.6.31.1/.gitignore => -1 (Input/output error)
[2012-04-13 11:37:46.912155] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2
[2012-04-13 11:37:46.912196] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2
[2012-04-13 11:37:46.912211] W [afr-common.c:1190:afr_detect_self_heal_by_iatt] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid different on subvolume
[2012-04-13 11:37:46.912237] I [afr-common.c:1335:afr_launch_self_heal] 1-mirror-replicate-0: background  meta-data data missing-entry self-heal triggered. path: /linux-2.6.31.1/.mailmap, reason: lookup detected pending operations
[2012-04-13 11:37:46.926673] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2
[2012-04-13 11:37:46.931818] I [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0: split brain found, aborting selfheal of /linux-2.6.31.1/.mailmap
[2012-04-13 11:37:46.931871] E [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] 1-mirror-replicate-0: background  meta-data data missing-entry self-heal failed on /linux-2.6.31.1/.mailmap
[2012-04-13 11:37:46.931940] W [fuse-bridge.c:291:fuse_entry_cbk] 0-glusterfs-fuse: 14532: LOOKUP() /linux-2.6.31.1/.mailmap => -1 (Input/output error)
[2012-04-13 11:37:46.977815] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2
[2012-04-13 11:37:46.977865] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2
[2012-04-13 11:37:46.977881] W [afr-common.c:1190:afr_detect_self_heal_by_iatt] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid different on subvolume
[2012-04-13 11:37:46.977906] I [afr-common.c:1335:afr_launch_self_heal] 1-mirror-replicate-0: background  meta-data data missing-entry self-heal triggered. path: /linux-2.6.31.1/COPYING, reason: lookup detected pending operations
[2012-04-13 11:37:46.996465] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 1
[2012-04-13 11:37:47.002091] I [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0: split brain found, aborting selfheal of /linux-2.6.31.1/COPYING
[2012-04-13 11:37:47.002135] E [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] 1-mirror-replicate-0: background  meta-data data missing-entry self-heal failed on /linux-2.6.31.1/COPYING
[2012-04-13 11:37:47.002167] W [fuse-bridge.c:291:fuse_entry_cbk] 0-glusterfs-fuse: 14533: LOOKUP() /linux-2.6.31.1/COPYING => -1 (Input/output error)


gluster volume info
 
Volume Name: mirror
Type: Replicate
Volume ID: e68ec23f-140e-46fd-9d21-e2662dc175f9
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: hyperspace:/mnt/sda7/export3
Brick2: hyperspace:/mnt/sda8/export3
Brick3: hyperspace:/mnt/sda10/export3
Comment 1 Raghavendra Bhat 2012-04-13 02:54:10 EDT
(In reply to comment #0)
> Description of problem:
> Suppose there is a 3 replica volume. Now linux-kernel untarring is being
> performed on the mount point. Immedietly 1st 2 bricks are brought down.
> Untarring still continues because 3rd brick is still up. After some time (many
> files are created) bring the down bricks up and give volume heal command. And
> immedietly bring the 3rd brick (which is the source brick for self-heal) down. 
> 

Here 3rd brick was removed using remove-brick (i.e. gluster volume remove-brick replica 2 <volname> <brick>) instead of killing the brick process (i.e. the graph itself was changed).

> Now there will be either stale data or some data loss on the remaining 2
> bricks. 
> Now on the mount point again do linux-kernel untarring which creates the files
> and directories again (but with different gfid). Now add the 3rd brick again to
> the volume.
> 
> Now the volume might end-up in a complicated state where the same
> file/directory might be having different gfids on different bricks.
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> 1. create a 3 replica volume, start it and mount it
> 2. Start untarring linux kernel on the mount point and immedietly bring 1st 2
> bricks down.
> 3. After many files and directories have been created, stop untarring, bring
> the down bricks up.
> 4. give volume heal command and immedietly bring the 3rd brick down.
> 5. Now again start untarring linux kernel on the mount point.
> 6. After sometime add the 3rd brick back to the volume
> 
> Actual results:
> 
> 3rd brick which has a volume-id is allowed to be added to the brick
> Expected results:
> 
> If a directory contains a volume-id, then it should not be allowed to be added
> back to the volume even though the volume's volume-id and the directory's
> volume-id are same.
> 
> Additional info:
> 
> 
> [2012-04-13 11:37:46.802264] W [fuse-bridge.c:291:fuse_entry_cbk]
> 0-glusterfs-fuse: 14529: LOOKUP() /linux-2.6.31.1/.gitignore => -1
> (Input/output error)
> [2012-04-13 11:37:46.912155] W [afr-common.c:1414:afr_conflicting_iattrs]
> 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2
> [2012-04-13 11:37:46.912196] W [afr-common.c:1414:afr_conflicting_iattrs]
> 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2
> [2012-04-13 11:37:46.912211] W [afr-common.c:1190:afr_detect_self_heal_by_iatt]
> 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid different on subvolume
> [2012-04-13 11:37:46.912237] I [afr-common.c:1335:afr_launch_self_heal]
> 1-mirror-replicate-0: background  meta-data data missing-entry self-heal
> triggered. path: /linux-2.6.31.1/.mailmap, reason: lookup detected pending
> operations
> [2012-04-13 11:37:46.926673] W [afr-common.c:1414:afr_conflicting_iattrs]
> 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2
> [2012-04-13 11:37:46.931818] I
> [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0:
> split brain found, aborting selfheal of /linux-2.6.31.1/.mailmap
> [2012-04-13 11:37:46.931871] E
> [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk]
> 1-mirror-replicate-0: background  meta-data data missing-entry self-heal failed
> on /linux-2.6.31.1/.mailmap
> [2012-04-13 11:37:46.931940] W [fuse-bridge.c:291:fuse_entry_cbk]
> 0-glusterfs-fuse: 14532: LOOKUP() /linux-2.6.31.1/.mailmap => -1 (Input/output
> error)
> [2012-04-13 11:37:46.977815] W [afr-common.c:1414:afr_conflicting_iattrs]
> 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2
> [2012-04-13 11:37:46.977865] W [afr-common.c:1414:afr_conflicting_iattrs]
> 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2
> [2012-04-13 11:37:46.977881] W [afr-common.c:1190:afr_detect_self_heal_by_iatt]
> 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid different on subvolume
> [2012-04-13 11:37:46.977906] I [afr-common.c:1335:afr_launch_self_heal]
> 1-mirror-replicate-0: background  meta-data data missing-entry self-heal
> triggered. path: /linux-2.6.31.1/COPYING, reason: lookup detected pending
> operations
> [2012-04-13 11:37:46.996465] W [afr-common.c:1414:afr_conflicting_iattrs]
> 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 1
> [2012-04-13 11:37:47.002091] I
> [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0:
> split brain found, aborting selfheal of /linux-2.6.31.1/COPYING
> [2012-04-13 11:37:47.002135] E
> [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk]
> 1-mirror-replicate-0: background  meta-data data missing-entry self-heal failed
> on /linux-2.6.31.1/COPYING
> [2012-04-13 11:37:47.002167] W [fuse-bridge.c:291:fuse_entry_cbk]
> 0-glusterfs-fuse: 14533: LOOKUP() /linux-2.6.31.1/COPYING => -1 (Input/output
> error)
> 
> 
> gluster volume info
> 
> Volume Name: mirror
> Type: Replicate
> Volume ID: e68ec23f-140e-46fd-9d21-e2662dc175f9
> Status: Started
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: hyperspace:/mnt/sda7/export3
> Brick2: hyperspace:/mnt/sda8/export3
> Brick3: hyperspace:/mnt/sda10/export3
Comment 2 Amar Tumballi 2012-04-27 05:11:59 EDT
patch available @ http://review.gluster.com/3147
Comment 3 Anand Avati 2012-05-03 15:59:49 EDT
CHANGE: http://review.gluster.com/3147 (glusterd: Disallow (re)-using bricks that were part of any volume) merged in master by Anand Avati (avati@redhat.com)
Comment 4 Anand Avati 2012-05-05 09:15:06 EDT
CHANGE: http://review.gluster.com/3279 (mgmt/glusterd: allow volume start force) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 5 Amar Tumballi 2012-05-11 03:00:04 EDT
patches sent for both upstream (http://review.gluster.com/3280) and 3.3 branch (http://review.gluster.com/3313)
Comment 6 Anand Avati 2012-05-16 06:53:07 EDT
CHANGE: http://review.gluster.com/3280 (glusterd: Fixed glusterd_brick_create_path algo.) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 7 Anand Avati 2012-05-19 06:27:27 EDT
CHANGE: http://review.gluster.com/3313 (glusterd: Fixed glusterd_brick_create_path algo.) merged in release-3.3 by Vijay Bellur (vijay@gluster.com)
Comment 8 Raghavendra Bhat 2012-05-25 08:19:47 EDT
gluster volume remove-brick vol replica 2 hyperspace:/mnt/sda10/export3
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
Remove Brick commit force successful
root@hyperspace:~# gluster volume add-brick vol replica 3 hyperspace:/mnt/sda10/export3
/mnt/sda10/export3 or a prefix of it is already part of a volume

Now remove-brick and adding the same brick fails. Checked with glusterfs-3.3.0qa43.
Comment 9 b.candler 2012-06-07 12:11:23 EDT
With glusterfs 3.3.0 I now have a specific problem: I cannot destroy a volume and then recreate it.

On a two-node cluster:

gluster volume create foo server1:/data/brickdir server2:/data/brickdir
gluster volume start foo
...
gluster volume stop foo
gluster volume delete foo
# gluster volume info says there are no volumes
# the two peers are in agreement over this
gluster volume create foo server1:/data/brickdir server2:/data/brickdir
# Error: "/data/brickdir or a prefix of it is already part of a volume"
rm -rf /data/brickdir/.glusterfs ## doesn't help!
rm -rf /data/brickdir ## fixes the problem, but this is drastic

Deleting and recreating a volume was allowed in glusterfs 3.2.5. Of course it's possible that the two bricks are out of sync at the point when the volume is recreated; but this is no different to any other split-brain scenario AFAICS.

IMO there needs to be (at least) a clearly-documented recovery path for this situation, which doesn't involve deleting all your data.

(Aside: the glusterfs 3.3 internals, and semantics of "gfid" and "volume-id", are not known to me; are they documented anywhere?)
Comment 10 krishnan parthasarathi 2012-06-08 01:46:42 EDT
To be able to 'reuse' the brick, one needs to remove "volume-id" and "gfid" present on the brick path. The above are present in the form of extended attributes to check if the brick has been used or in use as part of a volume and has not been 'cleaned' appropriately. The cleaning may be required since the brick which is being reused does not only contain data, but also Glusterfs metadata/state from the previous volume it was part of. This may result in undesirable effect when 'added' to a 'new' volume, especially of the same name.

I will see to that this is documented and a recovery path without having to delete data is also provided in the documentation.
Comment 11 krishnan parthasarathi 2012-06-08 06:39:03 EDT
*** Bug 829808 has been marked as a duplicate of this bug. ***
Comment 12 Niels de Vos 2012-06-11 03:17:10 EDT
*** Bug 829675 has been marked as a duplicate of this bug. ***
Comment 13 Marnus van Niekerk 2012-06-12 05:22:20 EDT
This "feature" has SERIOUS implications for newcomers testing diferent combinations of bricks.  It is IMPOSSIBLE to create test volume, delete it and then create a different test volume with the same bricks.

This is just STUPID.  The error message should be clearer and the receovery path well documented.

It is simple enough to remove test directories if they under existing mount points, but of they ARE the mountpoints then the actual mountmounts have to be removed and recreated to get passed this!!!
Comment 14 CyD 2012-06-12 05:36:17 EDT
Actually, I would suggest to launch the recovery script in the "volume delete" instruction, since it has to flush the whole setup of the volume.

Why do we have to keep those attributes on the mount point ?
Comment 15 Marnus van Niekerk 2012-06-12 05:46:36 EDT
Can someone tell us *exactly* how to work around this in the meantime please?

"attr -l /mountpoint" list the gvid and volume-id attributes, but "attr -r gfid /mountpoint" says it does not exist and does not remove it!!!
Comment 16 Marnus van Niekerk 2012-06-12 06:25:24 EDT
This seems to work:

cd /mount/point
for i in `attr -lq .`; do   setfattr -x trusted.$i .; done

M
Comment 17 Jeff Darcy 2012-06-13 10:11:20 EDT
(In reply to comment #16)
> This seems to work:
> 
> cd /mount/point
> for i in `attr -lq .`; do   setfattr -x trusted.$i .; done
> 
> M

That will delete xattrs needed by other packages as well.  What you really need to do is clear is trusted.glusterfs.volume-id and trusted.gfid, plus you should delete the .glusterfs directory as well.  I do this all the time during development, and haven't seen any failures attributable to leftover artifacts from the brick's previous incarnation.  Then again, I'm a developer.  *THIS PROCEDURE CAN LEAD TO DATA LOSS* because of GFID conflicts or inconsistencies in the AFR/DHT/stripe xattrs, and you'll be on your own in unsupported-land if that happens.  You should only ever do it on bricks that contain test data, not on anything you're responsible for keeping.
Comment 18 Marnus van Niekerk 2012-06-13 10:15:56 EDT
(In reply to comment #17)
> (In reply to comment #16)
> > This seems to work:
> > 
> > cd /mount/point
> > for i in `attr -lq .`; do   setfattr -x trusted.$i .; done
> > 
> > M
> 
> That will delete xattrs needed by other packages as well.

I realise that, but this comment was in the context of my first comment (#13) which relates to the re-use of empty bricks in a test environment.
Comment 19 Niels de Vos 2012-07-10 10:55:23 EDT
http://review.gluster.com/3644 has been filed to get the extras/clear_xattrs.sh packaged in the RPM.
Comment 20 Vijay Bellur 2012-07-12 03:22:52 EDT
CHANGE: http://review.gluster.com/3644 (extras: add clear_xattrs.sh to the glusterfs-server sub-package) merged in master by Anand Avati (avati@redhat.com)
Comment 21 Niels de Vos 2012-07-12 06:51:08 EDT
http://review.gluster.com/3659 is needed as well for this, currently clear_xattrs.sh is not included in 'make dist' and causes the building of the RPMs to fail.
Comment 22 Vijay Bellur 2012-07-12 12:27:33 EDT
CHANGE: http://review.gluster.com/3659 (extras: install clear_xattrs.sh) merged in master by Anand Avati (avati@redhat.com)
Comment 23 Ben Turner 2012-08-13 11:02:37 EDT
*** Bug 847778 has been marked as a duplicate of this bug. ***
Comment 24 Niels de Vos 2012-09-04 05:30:28 EDT
release-3.3 changes:
- http://review.gluster.org/3897
- http://review.gluster.org/3900
Comment 25 Vijay Bellur 2012-10-25 03:47:46 EDT
CHANGE: http://review.gluster.org/3897 (extras: install clear_xattrs.sh) merged in release-3.3 by Vijay Bellur (vbellur@redhat.com)
Comment 26 Vijay Bellur 2012-10-25 03:49:35 EDT
CHANGE: http://review.gluster.org/3900 (extras: add clear_xattrs.sh to the glusterfs-server sub-package) merged in release-3.3 by Vijay Bellur (vbellur@redhat.com)