978335 – afr: Self-Heal of directories are unsuccessful with error : "Non Blocking entrylks failed"

Bug 978335 - afr: Self-Heal of directories are unsuccessful with error : "Non Blocking entrylks failed"

Summary: afr: Self-Heal of directories are unsuccessful with error : "Non Blocking en...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	2.1
Hardware:	x86_64
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ravishankar N
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-26 11:52 UTC by Rahul Hinduja
Modified:	2016-09-17 12:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-12-03 17:12:25 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Script1.sh (3.79 KB, application/x-shellscript) 2013-07-16 10:01 UTC, spandura	no flags	Details
Script2.sh (4.29 KB, application/x-shellscript) 2013-07-16 10:02 UTC, spandura	no flags	Details
self_heal_all_file_types_script1.sh (3.79 KB, application/x-shellscript) 2013-07-17 07:21 UTC, spandura	no flags	Details
self_heal_all_file_types_script2.sh (4.29 KB, application/x-shellscript) 2013-07-17 07:21 UTC, spandura	no flags	Details
Show Obsolete (2) View All

Description Rahul Hinduja 2013-06-26 11:52:35 UTC

Description of problem:
=======================

Following error messages are reported in glustershd.log file.

[2013-06-26 21:01:32.640118] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-vol-test-replicate-0: Non Blocking entrylks failed for <gfid:8e764682-8e9e-4c1a-abf8-2beaed863ade>.
[2013-06-26 21:01:32.640841] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-vol-test-client-1: remote operation failed: No such file or directory
[2013-06-26 21:01:32.641152] W [client-rpc-fops.c:1631:client3_3_entrylk_cbk] 0-vol-test-client-1: remote operation failed: No such file or directory
[2013-06-26 21:01:32.641241] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-vol-test-replicate-0: Non Blocking entrylks failed for <gfid:d138b565-409b-47df-96c5-5e79d636df32>.
[2013-06-26 21:01:32.642012] W [client-rpc-fops.c:1529:client3_3_inodelk_cbk] 0-vol-test-client-1: remote operation failed: No such file or directory
[2013-06-26 21:01:32.642337] W [client-rpc-fops.c:1631:client3_3_entrylk_cbk] 0-vol-test-client-1: remote operation failed: No such file or directory
[2013-06-26 21:01:32.642425] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-vol-test-replicate-0: Non Blocking entrylks failed for <gfid:312a591a-9407-4f34-ab8d-fdc2774537c3>.


Version-Release number of selected component (if applicable):


Steps carried:
=============

1. Created and started 1*2 replicate setup.
2. Mounted on client (Fuse)
3. Brought down brick2 (kill -9)
4. Created huge number of files and directories from fuse mount using

[root@darrel vol-test]# for i in {1..10}; do cp -rf /etc etc.$i ; done
[root@darrel vol-test]# ls -lR | wc
  26482  196523 1277009
[root@darrel vol-test]# 

5. Brought back the brick b2 up (gluster volume start <vol-name> force>
6. These errors were observed in glustershd.log

Actual results:
================

[2013-06-26 21:01:32.642425] E [afr-self-heal-entry.c:2296:afr_sh_post_nonblocking_entry_cbk] 0-vol-test-replicate-0: Non Blocking entrylks failed for <gfid:312a591a-9407-4f34-ab8d-fdc2774537c3>.


Expected results:
=================

In this general scenario of bricks go down and comes up followed by self heal via shd should not have reported any Error.

Comment 5 raghav 2013-07-01 06:58:50 UTC

Discussions with Rahul and Pranith on this. Following is the summary and action items:

1) Rahul has seen cases where an entry just lies in xatrrop directory and does not get healed even after couple of hours. A lookup on the original file then causes the heal to kick in. This is on the 3.4.0.12 branch of rhs. I was not able
to reproduce this issue though.
So he will try to reproduce this and get the self heal state dumps and also turn on higher logging level on self heal for further debugging.

2) The present way of self heal is pretty undeterministic in the sense that gfids
are picked up in the xattrop directory in fifo fashion. Dependencies are not taken care and that is the reason why multiple crawls are reuqired. This causes problems in estimating the time taken for self heal as well as reporting. We need a mechanism to build structure among the entries to be healed. This will be taken as a feature extension to self heal and discussion will be taken on gluster devel mailing list.

Comment 7 spandura 2013-07-16 10:01:36 UTC

Created attachment 774161 [details]
Script1.sh

Comment 8 spandura 2013-07-16 10:02:29 UTC

Created attachment 774162 [details]
Script2.sh

Comment 10 spandura 2013-07-17 07:21:23 UTC

Created attachment 774652 [details]
self_heal_all_file_types_script1.sh

Comment 11 spandura 2013-07-17 07:21:55 UTC

Created attachment 774653 [details]
self_heal_all_file_types_script2.sh

Comment 12 raghav 2013-07-20 05:17:09 UTC

It looks like AFR has a problem in removing the xattrop entry when any of the directories does not have the dht related xattr key-value; when heal happens 2 things are seen which I have been able to reproduce locally:
1) the dht key-value does not get restored
2) the index gfid file is not removed from the indices directory.

Triaging the issue.

This issue can be bypassed by making sure that before we power down the machines in the test setup, we do a sync so that everything gets written to the disk. If that solution is there, do we still need to make this a blocker for big bend?

Comment 13 raghav 2013-07-22 07:27:21 UTC

One more thing: this entry in xattrop directory is not malignant. It will be cleared on next heal. Also the heal on the directory has hapenned except that this entry still remains. We need to take this also into account before deciding if this is a blocker.

Comment 14 raghav 2013-07-24 12:30:41 UTC

AM able to reproduce the exact issue as quoted by spandura.

These are the steps:
1) Have  a 2*1 replicate cluster
2) create a directory "top_dir"
3) create a new directory under "top_dir" say "test_dir".
4) bring down the brick process on one of the bricks
5) remove the soft link created for the test_dir(under .glusterfs directory) on the backend directory of the brick whose process is done. (this is what seems to be hapening when a brick volume is shutdown improperly)
6) create a new directory under test_dir from client
7) bring the brick process up

You will see that on the brick which did not go down, the xattrop directory will have entries that will never get healed by self heal daemon. The reason being is that for self heal for directories, the .glusterfs soft link has to be there on all the bricks. Else self heal daemon will fail. Here since we are not entry healing, the gfid will never get healed; hence this issue.

Fix for the same under discussion as this a design related bug.

Comment 15 Scott Haines 2013-09-23 23:34:54 UTC

Targeting for 2.1.z U2 (Corbett) release.

Comment 21 Vivek Agarwal 2015-12-03 17:12:25 UTC

Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.