Description of problem: Self heal fails on symlink Version-Release number of selected component (if applicable): 3.3 How reproducible: Steps to Reproduce: Used config: 2 bricks (brick1/brick2) 1 client 1. shut down brick1 2. put some files in de root dir of mountpoint at he client 3. Bring brick1 up again Actual results: Failing self heal on /, caused by a symlink that wasn't chaned/added when brick1 was offline. Log: (message repeated every 5 seconds) [2012-06-12 12:06:16.066084] I [afr-common.c:1340:afr_launch_self_heal] 0-replicated-data-replicate-0: background entry self-heal triggered. path: /, reason: lookup detected pending operations [2012-06-12 12:06:16.071270] W [client3_1-fops.c:2457:client3_1_link_cbk] 0-replicated-data-client-0: remote operation failed: File exists (00000000-0000-0000-0000-000000000000 -> /config) [2012-06-12 12:06:16.071968] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-replicated-data-replicate-0: background entry self-heal failed on / Expected results: Succesfull seld-heal on / Additional info:
After compiling gluster with the patched file provided and futher testing self-heal completes on / of the volume testing in a subdir gives me this: [2012-06-13 15:55:34.856780] W [client3_1-fops.c:1495:client3_1_inodelk_cbk] 0-replicated-data-client-0: remote operation failed: No such file or directory [2012-06-13 15:55:34.857143] E [afr-self-heal-metadata.c:539:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-replicated-data-replicate-0: Non Blocking metadata inodelks failed for <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>. [2012-06-13 15:55:34.857182] E [afr-self-heal-metadata.c:541:afr_sh_metadata_post_nonblocking_inodelk_cbk] 0-replicated-data-replicate-0: Metadata self-heal failed for <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24>. [2012-06-13 15:55:34.857681] W [client3_1-fops.c:418:client3_1_open_cbk] 0-replicated-data-client-0: remote operation failed: No such file or directory. Path: <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> (00000000-0000-0000-0000-000000000000) [2012-06-13 15:55:34.857718] E [afr-self-heal-data.c:1314:afr_sh_data_open_cbk] 0-replicated-data-replicate-0: open of <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> failed on child replicated-data-client-0 (No such file or directory) [2012-06-13 15:55:34.857740] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-replicated-data-replicate-0: background meta-data data entry self-heal failed on <gfid:5c92a1ae-bd2d-463c-95db-bf527f649b24> [2012-06-13 15:55:34.863450] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-replicated-data-replicate-0: path <gfid:bbd127aa-074e-4ea1-a1bf-9c842240a89c>/tt1 on subvolume replicated-data-client-0 => -1 (No such file or directory) steps to reprouce: 1. cd /<VOL>/tmp 2. echo testdata > tt1 3. bring brick1 down 4. mv tt1 tt2 5. echo testdata2 > tt1 6. rm tt2 6. bring brick1 back online in other words, we've cerated a split brain situation FYI: this subfolder did NOT contain a symlink
Did you miss giving any step? With the steps you mentioned everything works fine for me. Before restarting the brick: ------------------------------------------------------------------------- [root@pranithk-laptop tmp]# ls -l /gfs/r2_?/tmp /gfs/r2_0/tmp: total 4 -rw-r--r--. 2 root root 0 Jun 13 22:01 tt1 /gfs/r2_1/tmp: total 8 -rw-r--r--. 2 root root 10 Jun 13 22:03 tt1 [root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp getfattr: Removing leading '/' from absolute path names # file: gfs/r2_0/tmp security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126 # file: gfs/r2_1/tmp security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.r2-client-0=0x000000000000000000000004 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126 [root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/tt1 getfattr: Removing leading '/' from absolute path names # file: gfs/r2_0/tmp/tt1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.gfid=0xc598ee1b6810437d88919b641977e7af # file: gfs/r2_1/tmp/tt1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.r2-client-0=0x000000010000000000000000 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c As you see above same file has different gfids. --------------------------------------------------------------------- [root@pranithk-laptop tmp]# gluster volume start r2 force Starting volume r2 has been successful [root@pranithk-laptop tmp]# cat tt1 testdata2 [root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/tt1 getfattr: Removing leading '/' from absolute path names # file: gfs/r2_0/tmp/tt1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.r2-client-0=0x000000000000000000000000 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c # file: gfs/r2_1/tmp/tt1 security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.r2-client-0=0x000000000000000000000000 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x6e2b800c18e44bf78d3f595f0085405c [root@pranithk-laptop tmp]# getfattr -d -m . -e hex /gfs/r2_?/tmp/ getfattr: Removing leading '/' from absolute path names # file: gfs/r2_0/tmp/ security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.r2-client-0=0x000000000000000000000000 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126 # file: gfs/r2_1/tmp/ security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a64656661756c745f743a733000 trusted.afr.r2-client-0=0x000000000000000000000000 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x5fe0a753572b41faa97c8b05753f0126 As you can see above, everything is fine. Gfids match, self-heal is successful -------------------------------------------------------------------------------
2 things: 1. You're bringing the brick down with gluster cmd. We are simulating a power failure (hard reset) 2. When doing the same steps as mentioned above all will be heal fine but it takes appr. 5 minutes for self-healing to start. isn't that long? can it start earlier? (server-server auto-heal without a client involved)
1) No, I brought down the brick with kill -9 of the brick process, I did not paste that. gluster command I used is for restarting the brick. 2) Self-heal daemon checks for 'anything to heal' every 10 minutes. so 5 minutes is okay. We Added a way of configuring for future release. The patch did not make it to 3.3.0.
After disussing things and some more testing, verything is looking good now. next step is to compile a package with the second patch provided and vreate a fresh cluster.
The newly installed cluster runs without any problems now...... Thanks!
*** Bug 857081 has been marked as a duplicate of this bug. ***