Description of problem: Suppose there is a 3 replica volume. Now linux-kernel untarring is being performed on the mount point. Immedietly 1st 2 bricks are brought down. Untarring still continues because 3rd brick is still up. After some time (many files are created) bring the down bricks up and give volume heal command. And immedietly bring the 3rd brick (which is the source brick for self-heal) down. Now there will be either stale data or some data loss on the remaining 2 bricks. Now on the mount point again do linux-kernel untarring which creates the files and directories again (but with different gfid). Now add the 3rd brick again to the volume. Now the volume might end-up in a complicated state where the same file/directory might be having different gfids on different bricks. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. create a 3 replica volume, start it and mount it 2. Start untarring linux kernel on the mount point and immedietly bring 1st 2 bricks down. 3. After many files and directories have been created, stop untarring, bring the down bricks up. 4. give volume heal command and immedietly bring the 3rd brick down. 5. Now again start untarring linux kernel on the mount point. 6. After sometime add the 3rd brick back to the volume Actual results: 3rd brick which has a volume-id is allowed to be added to the brick Expected results: If a directory contains a volume-id, then it should not be allowed to be added back to the volume even though the volume's volume-id and the directory's volume-id are same. Additional info: [2012-04-13 11:37:46.802264] W [fuse-bridge.c:291:fuse_entry_cbk] 0-glusterfs-fuse: 14529: LOOKUP() /linux-2.6.31.1/.gitignore => -1 (Input/output error) [2012-04-13 11:37:46.912155] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2 [2012-04-13 11:37:46.912196] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2 [2012-04-13 11:37:46.912211] W [afr-common.c:1190:afr_detect_self_heal_by_iatt] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid different on subvolume [2012-04-13 11:37:46.912237] I [afr-common.c:1335:afr_launch_self_heal] 1-mirror-replicate-0: background meta-data data missing-entry self-heal triggered. path: /linux-2.6.31.1/.mailmap, reason: lookup detected pending operations [2012-04-13 11:37:46.926673] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2 [2012-04-13 11:37:46.931818] I [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0: split brain found, aborting selfheal of /linux-2.6.31.1/.mailmap [2012-04-13 11:37:46.931871] E [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] 1-mirror-replicate-0: background meta-data data missing-entry self-heal failed on /linux-2.6.31.1/.mailmap [2012-04-13 11:37:46.931940] W [fuse-bridge.c:291:fuse_entry_cbk] 0-glusterfs-fuse: 14532: LOOKUP() /linux-2.6.31.1/.mailmap => -1 (Input/output error) [2012-04-13 11:37:46.977815] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2 [2012-04-13 11:37:46.977865] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2 [2012-04-13 11:37:46.977881] W [afr-common.c:1190:afr_detect_self_heal_by_iatt] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid different on subvolume [2012-04-13 11:37:46.977906] I [afr-common.c:1335:afr_launch_self_heal] 1-mirror-replicate-0: background meta-data data missing-entry self-heal triggered. path: /linux-2.6.31.1/COPYING, reason: lookup detected pending operations [2012-04-13 11:37:46.996465] W [afr-common.c:1414:afr_conflicting_iattrs] 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 1 [2012-04-13 11:37:47.002091] I [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0: split brain found, aborting selfheal of /linux-2.6.31.1/COPYING [2012-04-13 11:37:47.002135] E [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] 1-mirror-replicate-0: background meta-data data missing-entry self-heal failed on /linux-2.6.31.1/COPYING [2012-04-13 11:37:47.002167] W [fuse-bridge.c:291:fuse_entry_cbk] 0-glusterfs-fuse: 14533: LOOKUP() /linux-2.6.31.1/COPYING => -1 (Input/output error) gluster volume info Volume Name: mirror Type: Replicate Volume ID: e68ec23f-140e-46fd-9d21-e2662dc175f9 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: hyperspace:/mnt/sda7/export3 Brick2: hyperspace:/mnt/sda8/export3 Brick3: hyperspace:/mnt/sda10/export3
(In reply to comment #0) > Description of problem: > Suppose there is a 3 replica volume. Now linux-kernel untarring is being > performed on the mount point. Immedietly 1st 2 bricks are brought down. > Untarring still continues because 3rd brick is still up. After some time (many > files are created) bring the down bricks up and give volume heal command. And > immedietly bring the 3rd brick (which is the source brick for self-heal) down. > Here 3rd brick was removed using remove-brick (i.e. gluster volume remove-brick replica 2 <volname> <brick>) instead of killing the brick process (i.e. the graph itself was changed). > Now there will be either stale data or some data loss on the remaining 2 > bricks. > Now on the mount point again do linux-kernel untarring which creates the files > and directories again (but with different gfid). Now add the 3rd brick again to > the volume. > > Now the volume might end-up in a complicated state where the same > file/directory might be having different gfids on different bricks. > > Version-Release number of selected component (if applicable): > > > How reproducible: > > > Steps to Reproduce: > 1. create a 3 replica volume, start it and mount it > 2. Start untarring linux kernel on the mount point and immedietly bring 1st 2 > bricks down. > 3. After many files and directories have been created, stop untarring, bring > the down bricks up. > 4. give volume heal command and immedietly bring the 3rd brick down. > 5. Now again start untarring linux kernel on the mount point. > 6. After sometime add the 3rd brick back to the volume > > Actual results: > > 3rd brick which has a volume-id is allowed to be added to the brick > Expected results: > > If a directory contains a volume-id, then it should not be allowed to be added > back to the volume even though the volume's volume-id and the directory's > volume-id are same. > > Additional info: > > > [2012-04-13 11:37:46.802264] W [fuse-bridge.c:291:fuse_entry_cbk] > 0-glusterfs-fuse: 14529: LOOKUP() /linux-2.6.31.1/.gitignore => -1 > (Input/output error) > [2012-04-13 11:37:46.912155] W [afr-common.c:1414:afr_conflicting_iattrs] > 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2 > [2012-04-13 11:37:46.912196] W [afr-common.c:1414:afr_conflicting_iattrs] > 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2 > [2012-04-13 11:37:46.912211] W [afr-common.c:1190:afr_detect_self_heal_by_iatt] > 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid different on subvolume > [2012-04-13 11:37:46.912237] I [afr-common.c:1335:afr_launch_self_heal] > 1-mirror-replicate-0: background meta-data data missing-entry self-heal > triggered. path: /linux-2.6.31.1/.mailmap, reason: lookup detected pending > operations > [2012-04-13 11:37:46.926673] W [afr-common.c:1414:afr_conflicting_iattrs] > 1-mirror-replicate-0: /linux-2.6.31.1/.mailmap: gfid differs on subvolume 2 > [2012-04-13 11:37:46.931818] I > [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0: > split brain found, aborting selfheal of /linux-2.6.31.1/.mailmap > [2012-04-13 11:37:46.931871] E > [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] > 1-mirror-replicate-0: background meta-data data missing-entry self-heal failed > on /linux-2.6.31.1/.mailmap > [2012-04-13 11:37:46.931940] W [fuse-bridge.c:291:fuse_entry_cbk] > 0-glusterfs-fuse: 14532: LOOKUP() /linux-2.6.31.1/.mailmap => -1 (Input/output > error) > [2012-04-13 11:37:46.977815] W [afr-common.c:1414:afr_conflicting_iattrs] > 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2 > [2012-04-13 11:37:46.977865] W [afr-common.c:1414:afr_conflicting_iattrs] > 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 2 > [2012-04-13 11:37:46.977881] W [afr-common.c:1190:afr_detect_self_heal_by_iatt] > 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid different on subvolume > [2012-04-13 11:37:46.977906] I [afr-common.c:1335:afr_launch_self_heal] > 1-mirror-replicate-0: background meta-data data missing-entry self-heal > triggered. path: /linux-2.6.31.1/COPYING, reason: lookup detected pending > operations > [2012-04-13 11:37:46.996465] W [afr-common.c:1414:afr_conflicting_iattrs] > 1-mirror-replicate-0: /linux-2.6.31.1/COPYING: gfid differs on subvolume 1 > [2012-04-13 11:37:47.002091] I > [afr-self-heal-common.c:918:afr_sh_missing_entries_done] 1-mirror-replicate-0: > split brain found, aborting selfheal of /linux-2.6.31.1/COPYING > [2012-04-13 11:37:47.002135] E > [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] > 1-mirror-replicate-0: background meta-data data missing-entry self-heal failed > on /linux-2.6.31.1/COPYING > [2012-04-13 11:37:47.002167] W [fuse-bridge.c:291:fuse_entry_cbk] > 0-glusterfs-fuse: 14533: LOOKUP() /linux-2.6.31.1/COPYING => -1 (Input/output > error) > > > gluster volume info > > Volume Name: mirror > Type: Replicate > Volume ID: e68ec23f-140e-46fd-9d21-e2662dc175f9 > Status: Started > Number of Bricks: 1 x 3 = 3 > Transport-type: tcp > Bricks: > Brick1: hyperspace:/mnt/sda7/export3 > Brick2: hyperspace:/mnt/sda8/export3 > Brick3: hyperspace:/mnt/sda10/export3
patch available @ http://review.gluster.com/3147
CHANGE: http://review.gluster.com/3147 (glusterd: Disallow (re)-using bricks that were part of any volume) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/3279 (mgmt/glusterd: allow volume start force) merged in master by Vijay Bellur (vijay)
patches sent for both upstream (http://review.gluster.com/3280) and 3.3 branch (http://review.gluster.com/3313)
CHANGE: http://review.gluster.com/3280 (glusterd: Fixed glusterd_brick_create_path algo.) merged in master by Vijay Bellur (vijay)
CHANGE: http://review.gluster.com/3313 (glusterd: Fixed glusterd_brick_create_path algo.) merged in release-3.3 by Vijay Bellur (vijay)
gluster volume remove-brick vol replica 2 hyperspace:/mnt/sda10/export3 Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y Remove Brick commit force successful root@hyperspace:~# gluster volume add-brick vol replica 3 hyperspace:/mnt/sda10/export3 /mnt/sda10/export3 or a prefix of it is already part of a volume Now remove-brick and adding the same brick fails. Checked with glusterfs-3.3.0qa43.
With glusterfs 3.3.0 I now have a specific problem: I cannot destroy a volume and then recreate it. On a two-node cluster: gluster volume create foo server1:/data/brickdir server2:/data/brickdir gluster volume start foo ... gluster volume stop foo gluster volume delete foo # gluster volume info says there are no volumes # the two peers are in agreement over this gluster volume create foo server1:/data/brickdir server2:/data/brickdir # Error: "/data/brickdir or a prefix of it is already part of a volume" rm -rf /data/brickdir/.glusterfs ## doesn't help! rm -rf /data/brickdir ## fixes the problem, but this is drastic Deleting and recreating a volume was allowed in glusterfs 3.2.5. Of course it's possible that the two bricks are out of sync at the point when the volume is recreated; but this is no different to any other split-brain scenario AFAICS. IMO there needs to be (at least) a clearly-documented recovery path for this situation, which doesn't involve deleting all your data. (Aside: the glusterfs 3.3 internals, and semantics of "gfid" and "volume-id", are not known to me; are they documented anywhere?)
To be able to 'reuse' the brick, one needs to remove "volume-id" and "gfid" present on the brick path. The above are present in the form of extended attributes to check if the brick has been used or in use as part of a volume and has not been 'cleaned' appropriately. The cleaning may be required since the brick which is being reused does not only contain data, but also Glusterfs metadata/state from the previous volume it was part of. This may result in undesirable effect when 'added' to a 'new' volume, especially of the same name. I will see to that this is documented and a recovery path without having to delete data is also provided in the documentation.
*** Bug 829808 has been marked as a duplicate of this bug. ***
*** Bug 829675 has been marked as a duplicate of this bug. ***
This "feature" has SERIOUS implications for newcomers testing diferent combinations of bricks. It is IMPOSSIBLE to create test volume, delete it and then create a different test volume with the same bricks. This is just STUPID. The error message should be clearer and the receovery path well documented. It is simple enough to remove test directories if they under existing mount points, but of they ARE the mountpoints then the actual mountmounts have to be removed and recreated to get passed this!!!
Actually, I would suggest to launch the recovery script in the "volume delete" instruction, since it has to flush the whole setup of the volume. Why do we have to keep those attributes on the mount point ?
Can someone tell us *exactly* how to work around this in the meantime please? "attr -l /mountpoint" list the gvid and volume-id attributes, but "attr -r gfid /mountpoint" says it does not exist and does not remove it!!!
This seems to work: cd /mount/point for i in `attr -lq .`; do setfattr -x trusted.$i .; done M
(In reply to comment #16) > This seems to work: > > cd /mount/point > for i in `attr -lq .`; do setfattr -x trusted.$i .; done > > M That will delete xattrs needed by other packages as well. What you really need to do is clear is trusted.glusterfs.volume-id and trusted.gfid, plus you should delete the .glusterfs directory as well. I do this all the time during development, and haven't seen any failures attributable to leftover artifacts from the brick's previous incarnation. Then again, I'm a developer. *THIS PROCEDURE CAN LEAD TO DATA LOSS* because of GFID conflicts or inconsistencies in the AFR/DHT/stripe xattrs, and you'll be on your own in unsupported-land if that happens. You should only ever do it on bricks that contain test data, not on anything you're responsible for keeping.
(In reply to comment #17) > (In reply to comment #16) > > This seems to work: > > > > cd /mount/point > > for i in `attr -lq .`; do setfattr -x trusted.$i .; done > > > > M > > That will delete xattrs needed by other packages as well. I realise that, but this comment was in the context of my first comment (#13) which relates to the re-use of empty bricks in a test environment.
http://review.gluster.com/3644 has been filed to get the extras/clear_xattrs.sh packaged in the RPM.
CHANGE: http://review.gluster.com/3644 (extras: add clear_xattrs.sh to the glusterfs-server sub-package) merged in master by Anand Avati (avati)
http://review.gluster.com/3659 is needed as well for this, currently clear_xattrs.sh is not included in 'make dist' and causes the building of the RPMs to fail.
CHANGE: http://review.gluster.com/3659 (extras: install clear_xattrs.sh) merged in master by Anand Avati (avati)
*** Bug 847778 has been marked as a duplicate of this bug. ***
release-3.3 changes: - http://review.gluster.org/3897 - http://review.gluster.org/3900
CHANGE: http://review.gluster.org/3897 (extras: install clear_xattrs.sh) merged in release-3.3 by Vijay Bellur (vbellur)
CHANGE: http://review.gluster.org/3900 (extras: add clear_xattrs.sh to the glusterfs-server sub-package) merged in release-3.3 by Vijay Bellur (vbellur)