Bug 763317 (GLUSTER-1585) - Errors during self-heal
Summary: Errors during self-heal
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1585
Product: GlusterFS
Classification: Community
Component: transport
Version: mainline
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Amar Tumballi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-09 12:31 UTC by Anush Shetty
Modified: 2015-12-01 16:45 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
log file in DEBUG (380.36 KB, application/x-gzip)
2010-09-14 13:16 UTC, Amar Tumballi
no flags Details

Description Anush Shetty 2010-09-09 10:59:55 UTC
This was specifically observed in self-healing large no of files ( 10000 files). Self-heal worked for small no of files. Tried with only 10 files and it worked

Comment 1 Anush Shetty 2010-09-09 12:31:51 UTC
In a 2 subvolume replica, no files get healed from server1 to server2. Trigger self-heal gives "Failed to encode message" in the server log file. The client3_1_readdirp_cbk errors out too.

Server log:

[2010-09-09 17:24:24.540695] I [server-handshake.c:523:server_setvolume] 1724_09_09-server: accepted client from 127.0.0.1:1014
[2010-09-09 17:24:24.634869] I [server-handshake.c:523:server_setvolume] 1724_09_09-server: accepted client from 127.0.0.1:1012
[2010-09-09 17:40:38.294617] E [server.c:66:gfs_serialize_reply] : Failed to encode message

Client log:

[2010-09-09 17:40:38.282340] D [afr-self-heal-entry.c:2239:afr_sh_entry_sync_prepare] 1724_09_09-replicate-0: self-healing directory / from subvolume 1724_09_09-client-1 to 1 other
[2010-09-09 17:40:38.295105] E [client3_1-fops.c:1604:client3_1_readdirp_cbk] : error
[2010-09-09 17:40:38.295148] D [afr-self-heal-entry.c:1979:afr_sh_entry_impunge_readdir_cbk] 1724_09_09-replicate-0: readdir of / on subvolume 1724_09_09-client-1 failed (Invalid argument)
[2010-09-09 17:40:38.295475] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value.
[2010-09-09 17:40:38.295495] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value.
[2010-09-09 17:40:38.295747] D [afr-lk-common.c:415:transaction_lk_op] 1724_09_09-replicate-0: lk op is for a self heal
[2010-09-09 17:40:38.295920] I [afr-self-heal-common.c:1583:afr_self_heal_completion_cbk] 1724_09_09-replicate-0: background  entry self-heal completed on /
[2010-09-09 17:40:38.301945] D [stat-prefetch.c:3581:sp_release] 1724_09_09: cache hits: 0, cache miss: 0

Comment 2 Vijay Bellur 2010-09-09 15:53:45 UTC
Downgrading severity as it does not happen for all cases.

Comment 3 Amar Tumballi 2010-09-13 05:17:07 UTC
Found that this is happening due to sending data more than the buffer size.. (happens because of xdr encoding)

Comment 4 Raghavendra G 2010-09-13 08:17:14 UTC
Is this bug observed even after commit fb0bb972dfac3c255 (fix to bug 763162)?

Comment 5 Vijay Bellur 2010-09-13 10:45:01 UTC
PATCH: http://patches.gluster.com/patch/4736 in master (afr: reduce the size of readdir request during entry-self-heal)

Comment 6 Amar Tumballi 2010-09-14 02:44:09 UTC
with the committed patch, self heal proceeded further, but has some more issues on server side (stack overflow due to recursion).

Comment 7 Amar Tumballi 2010-09-14 13:16:32 UTC
Created attachment 309

Comment 8 Amar Tumballi 2010-09-14 13:20:35 UTC
noticed that the pattern is something like below

>afr_setattr() -> afr locking/xattrop process -> afr_lock() will then try a inodelk, it got back EAGAIN error, so the control goes to blocking locks.. and all the operations happen.. after the unlock(), control goes back to afr_internal_lock_finish(), where the whole process will restart.. (because it goes pre_op() instead of finishing the all operations..

Comment 9 Amar Tumballi 2010-09-14 13:46:44 UTC
tried with removing following lines in pump.c, things worked fine..


diff --git a/xlators/cluster/afr/src/pump.c b/xlators/cluster/afr/src/pump.c
index 977de07..bdce874 100644
--- a/xlators/cluster/afr/src/pump.c
+++ b/xlators/cluster/afr/src/pump.c
@@ -1807,34 +1807,21 @@ fini (xlator_t *this)
 struct xlator_fops fops = {
        .lookup      = afr_lookup,
        .open        = afr_open,
-       .lk          = afr_lk,
        .flush       = afr_flush,
-       .statfs      = afr_statfs,
        .fsync       = afr_fsync,
-       .fsyncdir    = afr_fsyncdir,
        .xattrop     = afr_xattrop,
        .fxattrop    = afr_fxattrop,
-       .inodelk     = afr_inodelk,
-       .finodelk    = afr_finodelk,
-       .entrylk     = afr_entrylk,
-       .fentrylk    = afr_fentrylk,

        /* inode read */
-       .access      = afr_access,
-       .stat        = afr_stat,
-       .fstat       = afr_fstat,
-       .readlink    = afr_readlink,
        .getxattr    = pump_getxattr,
-       .readv       = afr_readv,

        /* inode write */
        .writev      = afr_writev,
        .truncate    = afr_truncate,
        .ftruncate   = afr_ftruncate,
        .setxattr    = pump_setxattr,
-        .setattr     = afr_setattr,
-       .fsetattr    = afr_fsetattr,
-       .removexattr = afr_removexattr,
+//        .setattr     = afr_setattr,
+//        .fsetattr    = afr_fsetattr,

        /* dir read */
        .opendir     = afr_opendir,

Comment 10 Amar Tumballi 2010-09-16 03:22:15 UTC
this particular bug is fixed for now, but there is bug 763322, which is the blocker for now. Will close this and track the bug 763322.


Note You need to log in before you can comment on or make changes to this bug.