Bug 1032475 - AFR : split-brains are not resolved even after completely removing the entries from the one of the replica
Summary: AFR : split-brains are not resolved even after completely removing the entrie...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: vsomyaju
QA Contact: Sudhir D
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-11-20 09:35 UTC by spandura
Modified: 2015-03-05 00:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-11-21 09:52:39 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description spandura 2013-11-20 09:35:10 UTC
Description of problem:
========================
In a 6 x 2 distribute-replicate volume, 3 sub-volumes went into split-brain state. To resolve the split-brain deleted the directories and files from one of the replica from the all 3 sub-volumes and started "heal full". 

The directories and files are healed to the replica but the split-brains are not resolved. 

Brick11's file which was removed before self-heal
===================================================

root@mia [Nov-20-2013- 9:10:23] >getfattr -d -e hex -m . /rhs/brick1/b11/testdir_fuse/M_file.4
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b11/testdir_fuse/M_file.4
trusted.afr.vol-dr-client-10=0x000000000000000000000000
trusted.afr.vol-dr-client-11=0x000000000000000300000000
trusted.gfid=0xfb39a65af8804d2c98e76436b34349d6
trusted.glusterfs.quota.a96737a2-a076-41fb-8f30-f13152667adc.contri=0x0000000000001000
trusted.name=0x74657374696e675f78617474725f73656c666865616c5f6f6e5f66696c6573
trusted.pgfid.a96737a2-a076-41fb-8f30-f13152667adc=0x00000003

root@mia [Nov-20-2013- 9:10:48] >
root@mia [Nov-20-2013- 9:11:07] >
root@mia [Nov-20-2013- 9:11:07] >
root@mia [Nov-20-2013- 9:11:07] >rm -rf /rhs/brick1/b*/*

root@mia [Nov-20-2013- 9:11:32] >getfattr -d -e hex -m . /rhs/brick1/b11/testdir_fuse/M_file.4
getfattr: /rhs/brick1/b11/testdir_fuse/M_file.4: No such file or directory


Brick12's file which was removed before self-heal
===================================================
root@wingo [Nov-20-2013- 9:10:23] >getfattr -d -e hex -m . /rhs/brick1/b12/testdir_fuse/M_file.4
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b12/testdir_fuse/M_file.4
trusted.afr.vol-dr-client-10=0x000000000000000400000000
trusted.afr.vol-dr-client-11=0x000000000000000000000000
trusted.gfid=0xfb39a65af8804d2c98e76436b34349d6
trusted.glusterfs.quota.a96737a2-a076-41fb-8f30-f13152667adc.contri=0x0000000000001000
trusted.pgfid.a96737a2-a076-41fb-8f30-f13152667adc=0x00000001

Brick11's file which was removed after self-heal
===================================================
root@mia [Nov-20-2013- 9:11:53] >getfattr -d -e hex -m . /rhs/brick1/b11/testdir_fuse/M_file.4
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b11/testdir_fuse/M_file.4
trusted.afr.vol-dr-client-10=0x000000000000000000000000
trusted.afr.vol-dr-client-11=0x000000000000000300000000
trusted.gfid=0xfb39a65af8804d2c98e76436b34349d6
trusted.glusterfs.quota.a96737a2-a076-41fb-8f30-f13152667adc.contri=0x0000000000001000
trusted.name=0x74657374696e675f78617474725f73656c666865616c5f6f6e5f66696c6573
trusted.pgfid.a96737a2-a076-41fb-8f30-f13152667adc=0x00000004


Brick12's file which was removed after self-heal
===================================================
root@wingo [Nov-20-2013- 9:11:50] >getfattr -d -e hex -m . /rhs/brick1/b12/testdir_fuse/M_file.4
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/b12/testdir_fuse/M_file.4
trusted.afr.vol-dr-client-10=0x000000000000000400000000
trusted.afr.vol-dr-client-11=0x000000000000000000000000
trusted.gfid=0xfb39a65af8804d2c98e76436b34349d6
trusted.glusterfs.quota.a96737a2-a076-41fb-8f30-f13152667adc.contri=0x0000000000001000
trusted.pgfid.a96737a2-a076-41fb-8f30-f13152667adc=0x00000001

Version-Release number of selected component (if applicable):
===============================================================
glusterfs 3.4.0.44rhs built on Nov 13 2013 08:03:02

How reproducible:
===================
Often

Steps to Reproduce:
===================
1. Create a distribute-replicate volume. ( 6 x 2 with 4 nodes.). Start the volume. 

2. Create fuse, nfs mount from a client node.

3. From fuse mount create testdir_fuse directory . From nfs mount create testdir_nfs directories. cd to the respective directories created from the mount point.

4. From fuse mount run "self_heal_all_file_types_script1.sh" . From nfs mount run "self_heal_sanity_nfs_and_cifs_script1.sh".

5. While the scripts are running :

     a. set quota-limit on testdir_fuse, testdir_nfs to 20GB

     b. Bring down node2 and node4. 

6. Wait for the script to complete. 

7. Once the scripts are successfully executed, bring back node2 and node4. 

8. From testdir_fuse (from fuse mount) and testdir_nfs (from nfs mount) execute : "du -s" ( "This self-healed all the testdir_nfs directories contents on node2 and node4. The testdir_fuse contents self-heal was still pending)

9. Once "du -s" is successfully complete, From fuse mount run "self_heal_all_file_types_script2.sh" . From nfs mount run "self_heal_sanity_nfs_and_cifs_script2.sh".

10. Bring down node2 and node3. 

11. Once the scripts are successfully executed, bring back node2 and node3. 

12. Node3 and node4 now are in split-brain state. 

13. Start the heal with "volume heal <volume_name> "

14. Since the node3 and node4 are in split-brain, remove all the entries from node3 bricks to resolve split-brain . (rm -rf <brick_path>/*"

15 Start full self-heal with  "volume heal <volume_name> full"

Actual results:
================
Even after resolving the split-brains of directories and files by removing the directories and files completely from the bricks, the self-healed files are still in split-brain state. 

Expected results:
==================
"heal full" after resolving the split-brain should resolve the split-brain completely. 

Additional info:
================

root@dj [Nov-20-2013- 9:10:48] >gluster v info
 
Volume Name: vol-dr
Type: Distributed-Replicate
Volume ID: 95904f4e-adb7-4ffb-bd7c-a41f7e691f39
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dj:/rhs/brick1/b1
Brick2: fan:/rhs/brick1/b2
Brick3: dj:/rhs/brick1/b3
Brick4: fan:/rhs/brick1/b4
Brick5: dj:/rhs/brick1/b5
Brick6: fan:/rhs/brick1/b6
Brick7: mia:/rhs/brick1/b7
Brick8: wingo:/rhs/brick1/b8
Brick9: mia:/rhs/brick1/b9
Brick10: wingo:/rhs/brick1/b10
Brick11: mia:/rhs/brick1/b11
Brick12: wingo:/rhs/brick1/b12
Options Reconfigured:
features.quota: on
diagnostics.latency-measurement: on
diagnostics.count-fop-hits: on
cluster.self-heal-daemon: on


Note You need to log in before you can comment on or make changes to this bug.