763317 – (GLUSTER-1585) Errors during self-heal

Bug 763317 (GLUSTER-1585) - Errors during self-heal

Summary: Errors during self-heal

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-1585
Product:	GlusterFS
Classification:	Community
Component:	transport
Sub Component:
Version:	mainline
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Amar Tumballi
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-09-09 12:31 UTC by Anush Shetty
Modified:	2015-12-01 16:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log file in DEBUG (380.36 KB, application/x-gzip) 2010-09-14 13:16 UTC, Amar Tumballi	no flags	Details
View All

Description Anush Shetty 2010-09-09 10:59:55 UTC

This was specifically observed in self-healing large no of files ( 10000 files). Self-heal worked for small no of files. Tried with only 10 files and it worked

Comment 1 Anush Shetty 2010-09-09 12:31:51 UTC

In a 2 subvolume replica, no files get healed from server1 to server2. Trigger self-heal gives "Failed to encode message" in the server log file. The client3_1_readdirp_cbk errors out too.

Server log:

[2010-09-09 17:24:24.540695] I [server-handshake.c:523:server_setvolume] 1724_09_09-server: accepted client from 127.0.0.1:1014
[2010-09-09 17:24:24.634869] I [server-handshake.c:523:server_setvolume] 1724_09_09-server: accepted client from 127.0.0.1:1012
[2010-09-09 17:40:38.294617] E [server.c:66:gfs_serialize_reply] : Failed to encode message

Client log:

[2010-09-09 17:40:38.282340] D [afr-self-heal-entry.c:2239:afr_sh_entry_sync_prepare] 1724_09_09-replicate-0: self-healing directory / from subvolume 1724_09_09-client-1 to 1 other
[2010-09-09 17:40:38.295105] E [client3_1-fops.c:1604:client3_1_readdirp_cbk] : error
[2010-09-09 17:40:38.295148] D [afr-self-heal-entry.c:1979:afr_sh_entry_impunge_readdir_cbk] 1724_09_09-replicate-0: readdir of / on subvolume 1724_09_09-client-1 failed (Invalid argument)
[2010-09-09 17:40:38.295475] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value.
[2010-09-09 17:40:38.295495] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value.
[2010-09-09 17:40:38.295747] D [afr-lk-common.c:415:transaction_lk_op] 1724_09_09-replicate-0: lk op is for a self heal
[2010-09-09 17:40:38.295920] I [afr-self-heal-common.c:1583:afr_self_heal_completion_cbk] 1724_09_09-replicate-0: background  entry self-heal completed on /
[2010-09-09 17:40:38.301945] D [stat-prefetch.c:3581:sp_release] 1724_09_09: cache hits: 0, cache miss: 0

Comment 2 Vijay Bellur 2010-09-09 15:53:45 UTC

Downgrading severity as it does not happen for all cases.

Comment 3 Amar Tumballi 2010-09-13 05:17:07 UTC

Found that this is happening due to sending data more than the buffer size.. (happens because of xdr encoding)

Comment 4 Raghavendra G 2010-09-13 08:17:14 UTC

Is this bug observed even after commit fb0bb972dfac3c255 (fix to bug 763162)?

Comment 5 Vijay Bellur 2010-09-13 10:45:01 UTC

PATCH: http://patches.gluster.com/patch/4736 in master (afr: reduce the size of readdir request during entry-self-heal)

Comment 6 Amar Tumballi 2010-09-14 02:44:09 UTC

with the committed patch, self heal proceeded further, but has some more issues on server side (stack overflow due to recursion).

Comment 7 Amar Tumballi 2010-09-14 13:16:32 UTC

Created attachment 309

Comment 8 Amar Tumballi 2010-09-14 13:20:35 UTC

noticed that the pattern is something like below

>afr_setattr() -> afr locking/xattrop process -> afr_lock() will then try a inodelk, it got back EAGAIN error, so the control goes to blocking locks.. and all the operations happen.. after the unlock(), control goes back to afr_internal_lock_finish(), where the whole process will restart.. (because it goes pre_op() instead of finishing the all operations..

Comment 9 Amar Tumballi 2010-09-14 13:46:44 UTC

tried with removing following lines in pump.c, things worked fine..


diff --git a/xlators/cluster/afr/src/pump.c b/xlators/cluster/afr/src/pump.c
index 977de07..bdce874 100644
--- a/xlators/cluster/afr/src/pump.c
+++ b/xlators/cluster/afr/src/pump.c
@@ -1807,34 +1807,21 @@ fini (xlator_t *this)
 struct xlator_fops fops = {
        .lookup      = afr_lookup,
        .open        = afr_open,
-       .lk          = afr_lk,
        .flush       = afr_flush,
-       .statfs      = afr_statfs,
        .fsync       = afr_fsync,
-       .fsyncdir    = afr_fsyncdir,
        .xattrop     = afr_xattrop,
        .fxattrop    = afr_fxattrop,
-       .inodelk     = afr_inodelk,
-       .finodelk    = afr_finodelk,
-       .entrylk     = afr_entrylk,
-       .fentrylk    = afr_fentrylk,

        /* inode read */
-       .access      = afr_access,
-       .stat        = afr_stat,
-       .fstat       = afr_fstat,
-       .readlink    = afr_readlink,
        .getxattr    = pump_getxattr,
-       .readv       = afr_readv,

        /* inode write */
        .writev      = afr_writev,
        .truncate    = afr_truncate,
        .ftruncate   = afr_ftruncate,
        .setxattr    = pump_setxattr,
-        .setattr     = afr_setattr,
-       .fsetattr    = afr_fsetattr,
-       .removexattr = afr_removexattr,
+//        .setattr     = afr_setattr,
+//        .fsetattr    = afr_fsetattr,

        /* dir read */
        .opendir     = afr_opendir,

Comment 10 Amar Tumballi 2010-09-16 03:22:15 UTC

this particular bug is fixed for now, but there is bug 763322, which is the blocker for now. Will close this and track the bug 763322.

Note You need to log in before you can comment on or make changes to this bug.