Bug 763317 (GLUSTER-1585)

Summary: Errors during self-heal
Product: [Community] GlusterFS Reporter: Anush Shetty <anush>
Component: transportAssignee: Amar Tumballi <amarts>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: amarts, gluster-bugs, raghavendra, vijay, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log file in DEBUG none

Description Anush Shetty 2010-09-09 10:59:55 UTC
This was specifically observed in self-healing large no of files ( 10000 files). Self-heal worked for small no of files. Tried with only 10 files and it worked

Comment 1 Anush Shetty 2010-09-09 12:31:51 UTC
In a 2 subvolume replica, no files get healed from server1 to server2. Trigger self-heal gives "Failed to encode message" in the server log file. The client3_1_readdirp_cbk errors out too.

Server log:

[2010-09-09 17:24:24.540695] I [server-handshake.c:523:server_setvolume] 1724_09_09-server: accepted client from 127.0.0.1:1014
[2010-09-09 17:24:24.634869] I [server-handshake.c:523:server_setvolume] 1724_09_09-server: accepted client from 127.0.0.1:1012
[2010-09-09 17:40:38.294617] E [server.c:66:gfs_serialize_reply] : Failed to encode message

Client log:

[2010-09-09 17:40:38.282340] D [afr-self-heal-entry.c:2239:afr_sh_entry_sync_prepare] 1724_09_09-replicate-0: self-healing directory / from subvolume 1724_09_09-client-1 to 1 other
[2010-09-09 17:40:38.295105] E [client3_1-fops.c:1604:client3_1_readdirp_cbk] : error
[2010-09-09 17:40:38.295148] D [afr-self-heal-entry.c:1979:afr_sh_entry_impunge_readdir_cbk] 1724_09_09-replicate-0: readdir of / on subvolume 1724_09_09-client-1 failed (Invalid argument)
[2010-09-09 17:40:38.295475] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value.
[2010-09-09 17:40:38.295495] W [afr-self-heal-common.c:584:afr_sh_pending_to_delta] afr_sh_pending_to_delta: Unable to get dict value.
[2010-09-09 17:40:38.295747] D [afr-lk-common.c:415:transaction_lk_op] 1724_09_09-replicate-0: lk op is for a self heal
[2010-09-09 17:40:38.295920] I [afr-self-heal-common.c:1583:afr_self_heal_completion_cbk] 1724_09_09-replicate-0: background  entry self-heal completed on /
[2010-09-09 17:40:38.301945] D [stat-prefetch.c:3581:sp_release] 1724_09_09: cache hits: 0, cache miss: 0

Comment 2 Vijay Bellur 2010-09-09 15:53:45 UTC
Downgrading severity as it does not happen for all cases.

Comment 3 Amar Tumballi 2010-09-13 05:17:07 UTC
Found that this is happening due to sending data more than the buffer size.. (happens because of xdr encoding)

Comment 4 Raghavendra G 2010-09-13 08:17:14 UTC
Is this bug observed even after commit fb0bb972dfac3c255 (fix to bug 763162)?

Comment 5 Vijay Bellur 2010-09-13 10:45:01 UTC
PATCH: http://patches.gluster.com/patch/4736 in master (afr: reduce the size of readdir request during entry-self-heal)

Comment 6 Amar Tumballi 2010-09-14 02:44:09 UTC
with the committed patch, self heal proceeded further, but has some more issues on server side (stack overflow due to recursion).

Comment 7 Amar Tumballi 2010-09-14 13:16:32 UTC
Created attachment 309

Comment 8 Amar Tumballi 2010-09-14 13:20:35 UTC
noticed that the pattern is something like below

>afr_setattr() -> afr locking/xattrop process -> afr_lock() will then try a inodelk, it got back EAGAIN error, so the control goes to blocking locks.. and all the operations happen.. after the unlock(), control goes back to afr_internal_lock_finish(), where the whole process will restart.. (because it goes pre_op() instead of finishing the all operations..

Comment 9 Amar Tumballi 2010-09-14 13:46:44 UTC
tried with removing following lines in pump.c, things worked fine..


diff --git a/xlators/cluster/afr/src/pump.c b/xlators/cluster/afr/src/pump.c
index 977de07..bdce874 100644
--- a/xlators/cluster/afr/src/pump.c
+++ b/xlators/cluster/afr/src/pump.c
@@ -1807,34 +1807,21 @@ fini (xlator_t *this)
 struct xlator_fops fops = {
        .lookup      = afr_lookup,
        .open        = afr_open,
-       .lk          = afr_lk,
        .flush       = afr_flush,
-       .statfs      = afr_statfs,
        .fsync       = afr_fsync,
-       .fsyncdir    = afr_fsyncdir,
        .xattrop     = afr_xattrop,
        .fxattrop    = afr_fxattrop,
-       .inodelk     = afr_inodelk,
-       .finodelk    = afr_finodelk,
-       .entrylk     = afr_entrylk,
-       .fentrylk    = afr_fentrylk,

        /* inode read */
-       .access      = afr_access,
-       .stat        = afr_stat,
-       .fstat       = afr_fstat,
-       .readlink    = afr_readlink,
        .getxattr    = pump_getxattr,
-       .readv       = afr_readv,

        /* inode write */
        .writev      = afr_writev,
        .truncate    = afr_truncate,
        .ftruncate   = afr_ftruncate,
        .setxattr    = pump_setxattr,
-        .setattr     = afr_setattr,
-       .fsetattr    = afr_fsetattr,
-       .removexattr = afr_removexattr,
+//        .setattr     = afr_setattr,
+//        .fsetattr    = afr_fsetattr,

        /* dir read */
        .opendir     = afr_opendir,

Comment 10 Amar Tumballi 2010-09-16 03:22:15 UTC
this particular bug is fixed for now, but there is bug 763322, which is the blocker for now. Will close this and track the bug 763322.