Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 831151

Summary:	Self heal fails on directories with symlinks
Product:	[Community] GlusterFS	Reporter:	Robert Hassing <rhassing>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.3.0	CC:	bturner, gluster-bugs, jdarcy, joe, sdharane, social
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-10-31 13:45:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert Hassing 2012-06-12 10:08:27 UTC

Description of problem: Self heal fails on symlink


Version-Release number of selected component (if applicable): 3.3


How reproducible:


Steps to Reproduce:

Used config: 2 bricks (brick1/brick2) 1 client

1. shut down brick1
2. put some files in de root dir of mountpoint at he client
3. Bring brick1 up again
  
Actual results:

Failing self heal on /, caused by a symlink that wasn't chaned/added when brick1 was offline.

Log: (message repeated every 5 seconds)
[2012-06-12 12:06:16.066084] I [afr-common.c:1340:afr_launch_self_heal] 0-replicated-data-replicate-0: background  entry self-heal triggered. path: /, reason: lookup detected pending operations
[2012-06-12 12:06:16.071270] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-replicated-data-client-0: remote operation failed: File exists (00000000-0000-0000-0000-000000000000 -> /config)
[2012-06-12 12:06:16.071968] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-replicated-data-replicate-0: background  entry self-heal failed on /



Expected results:

Succesfull seld-heal on /




Additional info:

Comment 1 Robert Hassing 2012-06-13 14:14:22 UTC

After compiling gluster with the patched file provided and futher testing self-heal completes on / of the volume

testing in a subdir gives me this:

[2012-06-13 15:55:34.856780] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-replicated-data-client-0: remote operation failed: No such file or directory
[2012-06-13 15:55:34.857143] E [afr-self-heal-metadata.c:539:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-replicated-data-replicate-0: Non Blocking metadata inodelks failed for <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>.
[2012-06-13 15:55:34.857182] E [afr-self-heal-metadata.c:541:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-replicated-data-replicate-0: Metadata self-heal failed for <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>.
[2012-06-13 15:55:34.857681] W [client3_1-fops.c:418:client3_1_open_cbk] 0-replicated-data-client-0: remote operation failed: No such file or directory. Path: <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> (00000000-0000-0000-0000-000000000000)
[2012-06-13 15:55:34.857718] E [afr-self-heal-data.c:1314:afr_sh_data_open_cbk] 0-replicated-data-replicate-0: open of <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> failed on child replicated-data-client-0 (No such file or directory)
[2012-06-13 15:55:34.857740] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-replicated-data-replicate-0: background  meta-data data entry self-heal failed on <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>
[2012-06-13 15:55:34.863450] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-replicated-data-replicate-0: path <gfid:bbd127aa-074e-4ea1-a1bf-9c842240a89c>/tt1 on subvolume replicated-data-client-0 => -1 (No such file or directory)

steps to reprouce:

1. cd /<VOL>/tmp
2. echo testdata > tt1
3. bring brick1 down
4. mv tt1 tt2
5. echo testdata2 > tt1
6. rm tt2
6. bring brick1 back online

in other words, we've cerated a split brain situation


FYI: this subfolder did NOT contain a symlink

Comment 2 Pranith Kumar K 2012-06-13 16:37:33 UTC

Did you miss giving any step?
With the steps you mentioned everything works fine for me.

Before restarting the brick:
-------------------------------------------------------------------------
[root@pranithk-laptop tmp]# ls -l /gfs/r2_?/tmp
/gfs/r2_0/tmp:
total 4
-rw-r--r--. 2 root root 0 Jun 13 22:01 tt1

/gfs/r2_1/tmp:
total 8
-rw-r--r--. 2 root root 10 Jun 13 22:03 tt1
[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

# file: gfs/r2_1/tmp
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000004
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/tt1
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.gfid=0xc598ee1b6810437d88919b641977e7af

# file: gfs/r2_1/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000010000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c

As you see above same file has different gfids.
---------------------------------------------------------------------
[root@pranithk-laptop tmp]# gluster volume start r2 force
Starting volume r2 has been successful
[root@pranithk-laptop tmp]# cat tt1
testdata2
[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/tt1
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c

# file: gfs/r2_1/tmp/tt1
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c

[root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/tmp/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

# file: gfs/r2_1/tmp/
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126

As you can see above, everything is fine. Gfids match, self-heal is successful
-------------------------------------------------------------------------------

Comment 3 Robert Hassing 2012-06-14 07:44:15 UTC

2 things:

1. You're bringing the brick down with gluster cmd. We are simulating a power failure (hard reset)
2. When doing the same steps as mentioned above all will be heal fine but it takes appr. 5 minutes for self-healing to start. isn't that long? can it start earlier? (server-server auto-heal without a client involved)

Comment 4 Pranith Kumar K 2012-06-14 12:31:32 UTC

1) No, I brought down the brick with kill -9 of the brick process, I did not paste that. gluster command I used is for restarting the brick.

2) Self-heal daemon checks for 'anything to heal' every 10 minutes. so 5 minutes is okay. We Added a way of configuring for future release. The patch did not make it to 3.3.0.

Comment 5 Robert Hassing 2012-06-14 14:22:59 UTC

After disussing things and some more testing, verything is looking good now.

next step is to compile a package with the second patch provided and vreate a fresh cluster.

Comment 6 Robert Hassing 2012-07-24 13:24:42 UTC

The newly installed cluster runs without any problems now...... Thanks!

Comment 7 Lukas Bezdicka 2012-09-24 13:10:40 UTC

*** Bug 857081 has been marked as a duplicate of this bug. ***