Bug 764373 (GLUSTER-2641) - inconsistent profile info data when a brick goes down and comes up in a distributed replicated system
Summary: inconsistent profile info data when a brick goes down and comes up in a distr...
Keywords:
Status: CLOSED WONTFIX
Alias: GLUSTER-2641
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-31 10:37 UTC by M S Vishwanath Bhat
Modified: 2016-06-01 01:55 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
profile info results taken for every 4000 secs (339.09 KB, text/plain)
2011-03-31 07:38 UTC, M S Vishwanath Bhat
no flags Details

Description M S Vishwanath Bhat 2011-03-31 07:38:12 UTC
Created attachment 468


Attaching the overnight profile results file

Comment 1 M S Vishwanath Bhat 2011-03-31 10:37:10 UTC
During overnight heavy io tests when a brick goes down and comes up in a distributed replicates system, the profile info data is inconsistent.
After the brick comes up, first profile info lists data properly, But from second time onwards the data displayed by profile info on that particular brick is always same even though there was heavy i/o going on that brick.
I am attaching the profile results file and also archiving the logs.

Comment 2 shishir gowda 2011-04-01 08:46:02 UTC
From the server log, we see that there are a lot of READDIRP failures  once the server is back online:

[2011-03-30 12:45:50.361394] I [server3_1-fops.c:590:server_readdir_cbk] 0-hosdu-server: 23: READDIR 0 (1) ==> 0 (No such file or directory)
[2011-03-30 12:45:50.926563] I [server3_1-fops.c:590:server_readdir_cbk] 0-hosdu-server: 634: READDIR 149 (3145732) ==> 0 (No such file or directory)

from the client logs, there seems to be a lot of split brains which have lead to failures on brick:

[2011-03-30 13:29:22.18475] I [afr-common.c:716:afr_lookup_done] 0-hosdu-replicate-1: background  meta-data data entry self-heal triggered. path: /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19
[2011-03-30 13:29:22.18868] I [client3_1-fops.c:1300:client3_1_entrylk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.19445] I [client3_1-fops.c:1225:client3_1_inodelk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.20004] I [client3_1-fops.c:1300:client3_1_entrylk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.20204] I [afr-self-heal-common.c:1532:afr_self_heal_completion_cbk] 0-hosdu-replicate-1: background  meta-data data entry self-heal completed on /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19
[2011-03-30 13:29:22.29031] I [client3_1-fops.c:2127:client3_1_opendir_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.29190] W [client3_1-fops.c:5037:client3_1_readdir] 0-hosdu-client-3: (36144048): failed to get fd ctx. EBADFD
[2011-03-30 13:29:22.29215] W [client3_1-fops.c:5102:client3_1_readdir] 0-hosdu-client-3: failed to send the fop: File descriptor in bad state
[2011-03-30 13:29:22.29482] I [afr-dir-read.c:171:afr_examine_dir_readdir_cbk] 0-hosdu-replicate-1:  entry self-heal triggered. path: /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19, reason: checksums of directory differ, forced merge option set
[2011-03-30 13:29:22.29856] I [client3_1-fops.c:1300:client3_1_entrylk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.30586] W [afr-common.c:110:afr_set_split_brain] (-->/usr/local/lib/glusterfs/3.2.0qa5/xlator/cluster/replicate.so(afr_sh_post_nonblocking_entry_cbk+0xab) [0x7fbba92b34dc] (-->/usr/local/lib/glusterfs/3.2.0qa5/xlator/cluster/replicate.so(afr_sh_entry_done+0xf1) [0x7fbba92aa6eb] (-->/usr/local/lib/glusterfs/3.2.0qa5/xlator/cluster/replicate.so(afr_self_heal_completion_cbk+0xcc) [0x7fbba92a68ac]))) 0-hosdu-replicate-1: invalid argument: inode
[2011-03-30 13:29:22.30617] I [afr-self-heal-common.c:1532:afr_self_heal_completion_cbk] 0-hosdu-replicate-1: background  entry self-heal completed on /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19

Due to which the brick is not getting any i/o as afr sends requests only to the other brick in the pair.

Closing the bug.


Note You need to log in before you can comment on or make changes to this bug.