+++ This bug was initially created as a clone of Bug #1055489 +++ Description of problem: ======================= The following info message is seen excessively in the glfsheal log file. [2014-01-20 10:48:48.153464] I [afr-self-heal-common.c:2098:afr_sh_post_nb_entrylk_missing_entry_sh_cbk] 0-vol-replicate-0: Non blocking entrylks failed. This INFO message is of no use. This doesn't even provide information on what directory the "Non blocking entrylks failed" . Number of messages in the log file : 4169241 and the log file is still growing with this message. root@rhs-client12 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l 4169241 root@rhs-client12 [Jan-20-2014-11:28:33] > Version-Release number of selected component (if applicable): ========================================================= glusterfs 3.4.0.57rhs built on Jan 13 2014 06:59:05 How reproducible: Steps to Reproduce: ====================== 1. Create a 2 x 3 distribute-replicate volume. ( 3 storage nodes with 2 bricks on each node ). Start the volume. Set data-self-heal, metadata-self-heal, entry-self-heal to "off". 2. Bring down brick2 and brick5 (from node2). 3. Create a fuse mount. Create files and directories ( had around 45,00,000 files in each sub-volume to self-heal ) 4. Brought back the brick2 and brick5 online (service glusterd restart) 5. Executed "gluster volume heal <volume_name> info" from all the nodes Actual results: =================== Node1:- -------------- root@rhs-client11 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l 2887547 Node2 :- ----------- root@rhs-client12 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l 4169241 Node3 :- ------------ root@rhs-client13 [Jan-20-2014-11:28:10] >grep "Non blocking entrylks failed" /var/log/glusterfs/glfsheal-vol.log | wc -l 1421132 Expected results: =============== Additional info: =================== root@rhs-client11 [Jan-20-2014-11:28:27] >gluster v info Volume Name: vol Type: Distributed-Replicate Volume ID: db9d926b-63b1-448a-8d8c-9e4a9756a773 Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: rhs-client11:/rhs/bricks/b1 Brick2: rhs-client12:/rhs/bricks/b1-rep1 Brick3: rhs-client13:/rhs/bricks/b1-rep2 Brick4: rhs-client11:/rhs/bricks/b2 Brick5: rhs-client12:/rhs/bricks/b2-rep1 Brick6: rhs-client13:/rhs/bricks/b2-rep2 Options Reconfigured: cluster.entry-self-heal: off cluster.metadata-self-heal: off cluster.data-self-heal: off Number of files to self-heal : =============================== brick1:- ======= root@rhs-client11 [Jan-20-2014- 3:45:26] >ls -l /rhs/bricks/b2/.glusterfs/indices/xattrop/ | wc 4593810 41344283 372098553 brick3: ======== root@rhs-client13 [Jan-20-2014- 7:49:23] >ls -l /rhs/bricks/b1-rep2/.glusterfs/indices/xattrop/ | wc 4551006 40959047 368631429 brick4:- ========== root@rhs-client11 [Jan-20-2014-10:59:54] >find /rhs/bricks/b2/.glusterfs/indices/xattrop/ | wc -l 4587245 brick6:- ========= root@rhs-client13 [Jan-20-2014-10:59:54] >find /rhs/bricks/b2-rep2/.glusterfs/indices/xattrop/ | wc -l 4587390 --- Additional comment from Krutika Dhananjay on 2014-10-13 03:50:18 EDT --- Preliminary analysis from the bug Description: --------------------------------------------- 1) Heal info was executed from all nodes in the cluster. And glfsheal program in turn takes locks in both sh-domain and xlator domain before determining if a file needs healing. So it is quite likely that the glfsheal process on one of the nodes grabs the locks while the ones running on other nodes attempt to hold the same lock at the same time and fail. 2) The steps involved creating data on the mount while some of the bricks were down. This means that after these nodes are brought back up, index selfheal kicks in, which will again take locks of the same nature as the glfsheal processes. This would mean there is lock contention between: a) glfsheal process on node x v/s glfsheal process on node y (x != y) b) selfheal daemon vs glfsheal processes Will run a stripped down version of the test case (as simple as only 1 heal info execution without any selfheal in progress) with 3.0.0, examine logs, then add more participants into the test case, one at a time and post the results.
REVIEW: http://review.gluster.org/8965 (cluster/afr: Fix excessive logging in glfsheal log file) posted (#1) for review on release-3.5 by Krutika Dhananjay (kdhananj)
COMMIT: http://review.gluster.org/8965 committed in release-3.5 by Niels de Vos (ndevos) ------ commit a539b29c1c28dff78fa2314deafd2948f5f8ae1a Author: Krutika Dhananjay <kdhananj> Date: Wed Oct 22 04:18:59 2014 +0530 cluster/afr: Fix excessive logging in glfsheal log file Wrong afr_local_t instance was being used in the missing entry sh check in afr_self_heal(), which was leading to entrylk failure messages of the following kind in glfsheal logfile: [2014-10-21 12:39:04.109875] I [afr-self-heal-common.c:2146:afr_sh_post_nb_entrylk_missing_entry_sh_cbk] 0-vol-replicate-1: Non blocking entrylks failed The fix involves sending the right "local" to afr_can_start_missing_entry_gfid_self_heal(). After fixing this, there were two more codepaths giving out too many log messages of the following kinds: [2014-10-21 22:19:29.568533] E [afr-self-heal-data.c:1611:afr_sh_data_open_cbk] 0-dis-rep-replicate-1: open of 8a858b02-0fc7-4713-9f61-8ca28dea82c0 failed on child dis-rep-client-2 (Stale file handle) [2014-10-21 22:19:29.577948] E [afr-self-heal-entry.c:2353:afr_sh_post_nonblocking_entry_cbk] 0-dis-rep-replicate-1: Non Blocking entrylks failed for ff9c82c4-5c0c-4ed9-b745-604a28dc352d. which are also fixed appropriately as part of this patch. Change-Id: Idd8d8e5735ee7a4ac36f369525f96e53276e0859 BUG: 1153629 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: http://review.gluster.org/8965 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Niels de Vos <ndevos>
The second Beta for GlusterFS 3.5.3 has been released [1]. Please verify if the release solves this bug report for you. In case the glusterfs-3.5.3beta2 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions have been made available on [2] to make testing easier. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019359.html [2] http://download.gluster.org/pub/gluster/glusterfs/qa-releases/3.5.3beta2/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.5.3, please reopen this bug report. glusterfs-3.5.3 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/announce/2014-November/000042.html [2] http://supercolony.gluster.org/pipermail/gluster-users/