1153629 – AFR : excessive logging of "Non blocking entrylks failed" in glfsheal log file.

Bug 1153629 - AFR : excessive logging of "Non blocking entrylks failed" in glfsheal log file.

Summary: AFR : excessive logging of "Non blocking entrylks failed" in glfsheal log file.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.5.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Krutika Dhananjay
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1055489
TreeView+	depends on / blocked

Reported:	2014-10-16 11:25 UTC by Krutika Dhananjay
Modified:	2014-11-21 16:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.5.3
Clone Of:	1055489
Environment:
Last Closed:	2014-11-21 16:03:14 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Krutika Dhananjay 2014-10-16 11:25:28 UTC

+++ This bug was initially created as a clone of Bug #1055489 +++

Description of problem:
=======================
The following info message is seen excessively in the glfsheal log file. 

[2014-01-20 10:48:48.153464] I [afr-self-heal-common.c:2098:afr_sh_post_nb_entrylk_missing_entry_sh_cbk] 0-vol-replicate-0: Non blocking entrylks failed.

This INFO message is of no use. This doesn't even provide information on what directory the "Non blocking entrylks failed" . 

Number of messages in the log file : 4169241  and the log file is still growing with this message. 

root@rhs-client12 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l
4169241
root@rhs-client12 [Jan-20-2014-11:28:33] >


Version-Release number of selected component (if applicable):
=========================================================
glusterfs 3.4.0.57rhs built on Jan 13 2014 06:59:05

How reproducible:


Steps to Reproduce:
======================
1. Create a 2 x 3 distribute-replicate volume. ( 3 storage nodes with 2 bricks on each node ). Start the volume. Set data-self-heal, metadata-self-heal, entry-self-heal to "off".

2. Bring down brick2 and brick5 (from node2). 

3. Create a fuse mount. Create files and directories ( had around 45,00,000 files in each sub-volume to self-heal )

4. Brought back the brick2 and brick5 online (service glusterd restart)

5. Executed "gluster volume heal <volume_name> info" from all the nodes 

Actual results:
===================
Node1:-
--------------
root@rhs-client11 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l
2887547


Node2 :-
-----------
root@rhs-client12 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l
4169241

Node3 :-
------------
root@rhs-client13 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l
1421132


Expected results:
===============


Additional info:
===================
root@rhs-client11 [Jan-20-2014-11:28:27] >gluster v info
 
Volume Name: vol
Type: Distributed-Replicate
Volume ID: db9d926b-63b1-448a-8d8c-9e4a9756a773
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: rhs-client11:/rhs/bricks/b1
Brick2: rhs-client12:/rhs/bricks/b1-rep1
Brick3: rhs-client13:/rhs/bricks/b1-rep2
Brick4: rhs-client11:/rhs/bricks/b2
Brick5: rhs-client12:/rhs/bricks/b2-rep1
Brick6: rhs-client13:/rhs/bricks/b2-rep2
Options Reconfigured:
cluster.entry-self-heal: off
cluster.metadata-self-heal: off
cluster.data-self-heal: off

Number of files to self-heal :
===============================
brick1:-
=======
root@rhs-client11 [Jan-20-2014- 3:45:26] >ls -l /rhs/bricks/b2/.glusterfs/indices/xattrop/ | wc
4593810 41344283 372098553


brick3:
========
root@rhs-client13 [Jan-20-2014- 7:49:23] >ls -l /rhs/bricks/b1-rep2/.glusterfs/indices/xattrop/ | wc
4551006 40959047 368631429

brick4:-
==========

root@rhs-client11 [Jan-20-2014-10:59:54] >find /rhs/bricks/b2/.glusterfs/indices/xattrop/ | wc -l
4587245

brick6:-
=========

root@rhs-client13 [Jan-20-2014-10:59:54] >find /rhs/bricks/b2-rep2/.glusterfs/indices/xattrop/ | wc -l
4587390


--- Additional comment from Krutika Dhananjay on 2014-10-13 03:50:18 EDT ---

Preliminary analysis from the bug Description:
---------------------------------------------

1) Heal info was executed from all nodes in the cluster. And glfsheal program in turn takes locks in both sh-domain and xlator domain before determining if a file needs healing. So it is quite likely that the glfsheal process on one of the nodes grabs the locks while the ones running on other nodes attempt to hold the same lock at the same time and fail.

2) The steps involved creating data on the mount while some of the bricks were down. This means that after these nodes are brought back up, index selfheal kicks in, which will again take locks of the same nature as the glfsheal processes. This would mean there is lock contention between:
a) glfsheal process on node x v/s glfsheal process on node y (x != y)
b) selfheal daemon vs glfsheal processes

Will run a stripped down version of the test case (as simple as only 1 heal info execution without any selfheal in progress) with 3.0.0, examine logs, then add more participants into the test case, one at a time and post the results.

Comment 1 Anand Avati 2014-10-22 09:35:33 UTC

REVIEW: http://review.gluster.org/8965 (cluster/afr: Fix excessive logging in glfsheal log file) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)

Comment 2 Anand Avati 2014-10-27 15:47:53 UTC

COMMIT: http://review.gluster.org/8965 committed in release-3.5 by Niels de Vos (ndevos) 
------
commit a539b29c1c28dff78fa2314deafd2948f5f8ae1a
Author: Krutika Dhananjay <kdhananj>
Date:   Wed Oct 22 04:18:59 2014 +0530

    cluster/afr: Fix excessive logging in glfsheal log file
    
    Wrong afr_local_t instance was being used in the missing entry sh
    check in afr_self_heal(), which was leading to entrylk failure messages
    of the following kind in glfsheal logfile:
    
    [2014-10-21 12:39:04.109875] I
    [afr-self-heal-common.c:2146:afr_sh_post_nb_entrylk_missing_entry_sh_cbk]
    0-vol-replicate-1: Non blocking entrylks failed
    
    The fix involves sending the right "local" to
    afr_can_start_missing_entry_gfid_self_heal().
    
    After fixing this, there were two more codepaths giving out too many log messages
    of the following kinds:
    
    [2014-10-21 22:19:29.568533] E [afr-self-heal-data.c:1611:afr_sh_data_open_cbk]
    0-dis-rep-replicate-1: open of 8a858b02-0fc7-4713-9f61-8ca28dea82c0
    failed on child dis-rep-client-2 (Stale file handle)
    
    [2014-10-21 22:19:29.577948] E [afr-self-heal-entry.c:2353:afr_sh_post_nonblocking_entry_cbk]
    0-dis-rep-replicate-1: Non Blocking entrylks failed for ff9c82c4-5c0c-4ed9-b745-604a28dc352d.
    
    which are also fixed appropriately as part of this patch.
    
    Change-Id: Idd8d8e5735ee7a4ac36f369525f96e53276e0859
    BUG: 1153629
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/8965
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Niels de Vos <ndevos>

Comment 3 Niels de Vos 2014-11-05 09:25:09 UTC

The second Beta for GlusterFS 3.5.3 has been released [1]. Please verify if the release solves this bug report for you. In case the glusterfs-3.5.3beta2 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions have been made available on [2] to make testing easier.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019359.html
[2] http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.5.3beta2/

Comment 4 Niels de Vos 2014-11-21 16:03:14 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.3, please reopen this bug report.

glusterfs-3.5.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/announce/2014-November/000042.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.