Bug 831151 - Self heal fails on directories with symlinks
Self heal fails on directories with symlinks
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
3.3.0
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Pranith Kumar K
Rahul Hinduja
:
: 857081 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-12 06:08 EDT by Robert Hassing
Modified: 2013-02-25 13:09 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-10-31 09:45:00 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Robert Hassing 2012-06-12 06:08:27 EDT
Description of problem: Self heal fails on symlink


Version-Release number of selected component (if applicable): 3.3


How reproducible:


Steps to Reproduce:

Used config: 2 bricks (brick1/brick2) 1 client

1. shut down brick1
2. put some files in de root dir of mountpoint at he client
3. Bring brick1 up again
  
Actual results:

Failing self heal on /, caused by a symlink that wasn't chaned/added when brick1 was offline.

Log: (message repeated every 5 seconds)
[2012-06-12 12:06:16.066084] I [afr-common.c:1340:afr_launch_self_heal] 0-replicated-data-replicate-0: background  entry self-heal triggered. path: /, reason: lookup detected pending operations
[2012-06-12 12:06:16.071270] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-replicated-data-client-0: remote operation failed: File exists (00000000-0000-0000-0000-000000000000 -> /config)
[2012-06-12 12:06:16.071968] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-replicated-data-replicate-0: background  entry self-heal failed on /



Expected results:

Succesfull seld-heal on /




Additional info:
Comment 1 Robert Hassing 2012-06-13 10:14:22 EDT
After compiling gluster with the patched file provided and futher testing self-heal completes on / of the volume

testing in a subdir gives me this:

[2012-06-13 15:55:34.856780] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-replicated-data-client-0: remote operation failed: No such file or directory
[2012-06-13 15:55:34.857143] E [afr-self-heal-metadata.c:539:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-replicated-data-replicate-0: Non Blocking metadata inodelks failed for <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>.
[2012-06-13 15:55:34.857182] E [afr-self-heal-metadata.c:541:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-replicated-data-replicate-0: Metadata self-heal failed for <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>.
[2012-06-13 15:55:34.857681] W [client3_1-fops.c:418:client3_1_open_cbk] 0-replicated-data-client-0: remote operation failed: No such file or directory. Path: <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> (00000000-0000-0000-0000-000000000000)
[2012-06-13 15:55:34.857718] E [afr-self-heal-data.c:1314:afr_sh_data_open_cbk] 0-replicated-data-replicate-0: open of <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> failed on child replicated-data-client-0 (No such file or directory)
[2012-06-13 15:55:34.857740] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-replicated-data-replicate-0: background  meta-data data entry self-heal failed on <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>
[2012-06-13 15:55:34.863450] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-replicated-data-replicate-0: path <gfid:bbd127aa-074e-4ea1-a1bf-9c842240a89c>/tt1 on subvolume replicated-data-client-0 => -1 (No such file or directory)

steps to reprouce:

1. cd /<VOL>/tmp
2. echo testdata > tt1
3. bring brick1 down
4. mv tt1 tt2
5. echo testdata2 > tt1
6. rm tt2
6. bring brick1 back online

in other words, we've cerated a split brain situation


FYI: this subfolder did NOT contain a symlink
Comment 2 Pranith Kumar K 2012-06-13 12:37:33 EDT
Did you miss giving any step?
With the steps you mentioned everything works fine for me.

Before restarting the brick:
-------------------------------------------------------------------------
[root@pranithk-laptop tmp]# ls -l /gfs/r2_?/tmp
/gfs/r2_0/tmp:
total 4
-rw-r--r--. 2 root root 0 Jun 13 22:01 tt1

/gfs/r2_1/tmp:
total 8
-rw-r--r--. 2 root root 10 Jun 13 22:03 tt1
[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

# file: gfs/r2_1/tmp
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000004
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/tt1
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0xc598ee1b6810437d88919b641977e7af

# file: gfs/r2_1/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000010000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c

As you see above same file has different gfids.
---------------------------------------------------------------------
[root@pranithk-laptop tmp]# gluster volume start r2 force
Starting volume r2 has been successful
[root@pranithk-laptop tmp]# cat tt1
testdata2
[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/tt1
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c

# file: gfs/r2_1/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c

[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

# file: gfs/r2_1/tmp/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

As you can see above, everything is fine. Gfids match, self-heal is successful
-------------------------------------------------------------------------------
Comment 3 Robert Hassing 2012-06-14 03:44:15 EDT
2 things:

1. You're bringing the brick down with gluster cmd. We are simulating a power failure (hard reset)
2. When doing the same steps as mentioned above all will be heal fine but it takes appr. 5 minutes for self-healing to start. isn't that long? can it start earlier? (server-server auto-heal without a client involved)
Comment 4 Pranith Kumar K 2012-06-14 08:31:32 EDT
1) No, I brought down the brick with kill -9 of the brick process, I did not paste that. gluster command I used is for restarting the brick.

2) Self-heal daemon checks for 'anything to heal' every 10 minutes. so 5 minutes is okay. We Added a way of configuring for future release. The patch did not make it to 3.3.0.
Comment 5 Robert Hassing 2012-06-14 10:22:59 EDT
After disussing things and some more testing, verything is looking good now.

next step is to compile a package with the second patch provided and vreate a fresh cluster.
Comment 6 Robert Hassing 2012-07-24 09:24:42 EDT
The newly installed cluster runs without any problems now...... Thanks!
Comment 7 Lukas Bezdicka 2012-09-24 09:10:40 EDT
*** Bug 857081 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.