Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 764373 (GLUSTER-2641)

Summary:

inconsistent profile info data when a brick goes down and comes up in a distributed replicated system

Product:

[Community] GlusterFS

Reporter:

M S Vishwanath Bhat <vbhat>

Component:

replicate

Assignee:

Pranith Kumar K <pkarampu>

Status:

CLOSED WONTFIX

QA Contact:

Severity:

medium

Docs Contact:

Priority:

medium

Version:

mainline

CC:

gluster-bugs, mzywusko, sgowda

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
profile info results taken for every 4000 secs	none

Description M S Vishwanath Bhat 2011-03-31 07:38:12 UTC

Created attachment 468


Attaching the overnight profile results file

Comment 1 M S Vishwanath Bhat 2011-03-31 10:37:10 UTC

During overnight heavy io tests when a brick goes down and comes up in a distributed replicates system, the profile info data is inconsistent.
After the brick comes up, first profile info lists data properly, But from second time onwards the data displayed by profile info on that particular brick is always same even though there was heavy i/o going on that brick.
I am attaching the profile results file and also archiving the logs.

Comment 2 shishir gowda 2011-04-01 08:46:02 UTC

From the server log, we see that there are a lot of READDIRP failures  once the server is back online:

[2011-03-30 12:45:50.361394] I [server3_1-fops.c:590:server_readdir_cbk] 0-hosdu-server: 23: READDIR 0 (1) ==> 0 (No such file or directory)
[2011-03-30 12:45:50.926563] I [server3_1-fops.c:590:server_readdir_cbk] 0-hosdu-server: 634: READDIR 149 (3145732) ==> 0 (No such file or directory)

from the client logs, there seems to be a lot of split brains which have lead to failures on brick:

[2011-03-30 13:29:22.18475] I [afr-common.c:716:afr_lookup_done] 0-hosdu-replicate-1: background  meta-data data entry self-heal triggered. path: /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19
[2011-03-30 13:29:22.18868] I [client3_1-fops.c:1300:client3_1_entrylk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.19445] I [client3_1-fops.c:1225:client3_1_inodelk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.20004] I [client3_1-fops.c:1300:client3_1_entrylk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.20204] I [afr-self-heal-common.c:1532:afr_self_heal_completion_cbk] 0-hosdu-replicate-1: background  meta-data data entry self-heal completed on /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19
[2011-03-30 13:29:22.29031] I [client3_1-fops.c:2127:client3_1_opendir_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.29190] W [client3_1-fops.c:5037:client3_1_readdir] 0-hosdu-client-3: (36144048): failed to get fd ctx. EBADFD
[2011-03-30 13:29:22.29215] W [client3_1-fops.c:5102:client3_1_readdir] 0-hosdu-client-3: failed to send the fop: File descriptor in bad state
[2011-03-30 13:29:22.29482] I [afr-dir-read.c:171:afr_examine_dir_readdir_cbk] 0-hosdu-replicate-1:  entry self-heal triggered. path: /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19, reason: checksums of directory differ, forced merge option set
[2011-03-30 13:29:22.29856] I [client3_1-fops.c:1300:client3_1_entrylk_cbk] 0-hosdu-client-3: remote operation failed: No such file or directory
[2011-03-30 13:29:22.30586] W [afr-common.c:110:afr_set_split_brain] (-->/usr/local/lib/glusterfs/3.2.0qa5/xlator/cluster/replicate.so(afr_sh_post_nonblocking_entry_cbk+0xab) [0x7fbba92b34dc] (-->/usr/local/lib/glusterfs/3.2.0qa5/xlator/cluster/replicate.so(afr_sh_entry_done+0xf1) [0x7fbba92aa6eb] (-->/usr/local/lib/glusterfs/3.2.0qa5/xlator/cluster/replicate.so(afr_self_heal_completion_cbk+0xcc) [0x7fbba92a68ac]))) 0-hosdu-replicate-1: invalid argument: inode
[2011-03-30 13:29:22.30617] I [afr-self-heal-common.c:1532:afr_self_heal_completion_cbk] 0-hosdu-replicate-1: background  entry self-heal completed on /fileop_L1_33/fileop_L1_33_L2_28/fileop_dir_33_28_19

Due to which the brick is not getting any i/o as afr sends requests only to the other brick in the pair.

Closing the bug.