Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 763097 (GLUSTER-1365)

Summary:

[3.1.0qa5-15] Self-heal doesn't happen from 1st subvolume to others

Product:

[Community] GlusterFS

Reporter:

Anush Shetty <anush>

Component:

replicate

Assignee:

Pavan Vilas Sondur <pavan>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

low

Version:

3.1-alpha

CC:

gluster-bugs, raghavendra, vijay

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Client log	none

Description Anush Shetty 2010-08-16 04:33:01 UTC

Works perfectly on replicate-only setup

Comment 1 Anush Shetty 2010-08-16 05:26:27 UTC

On a 2x2 Distribute-Replicate setup, 10000 files were created and Server 2 and Server 3 were brought down during the process. After the files were created, there servers were brought back up and self-heal was triggered. The files got healed only on Server 2 but didn't get healed on Server 3.

Before self-heal:
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew1/ | wc -l
4975
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew2/ | wc -l
440
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew3/ | wc -l
616
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew4/ | wc -l
5031

Self-heal  root@pitta:/mnt/gluster# ls -lR > /dev/null

After self-heal: 
root@pitta:/mnt/gluster# find /mnt/exportnew1/ | wc -l
4975
root@pitta:/mnt/gluster# find /mnt/exportnew2/ | wc -l
4975
root@pitta:/mnt/gluster# find /mnt/exportnew3/ | wc -l
616
root@pitta:/mnt/gluster# find /mnt/exportnew4/ | wc -l
5031

Comment 2 Anush Shetty 2010-08-16 07:18:34 UTC

This is the issue with the self-heal when first subvolume is down. Tried this by creating 10000 files on the mount point with the first subvolume being down. It was brought up again and `find . | xargs stat` was executed on the mount point to trigger self-heal.

Comment 3 Anush Shetty 2010-08-17 04:15:22 UTC

Created attachment 287

Comment 4 Anush Shetty 2010-08-17 04:17:01 UTC

The logs show client3_1_readdir_cbk erroring out

[2010-08-17 12:35:05.626475] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.626591] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.626622] I [afr-common.c:699:afr_lookup_done] rep1: background  entry self-heal triggered. path: /
[2010-08-17 12:35:05.627512] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.628947] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.630899] D [afr-self-heal-entry.c:2291:afr_sh_entry_sync_prepare] rep1: self-healing directory / from subvolume client2 to 1 other
[2010-08-17 12:35:05.650995] E [client3_1-fops.c:1652:client3_1_readdirp_cbk] : error
[2010-08-17 12:35:05.651031] D [afr-self-heal-entry.c:2031:afr_sh_entry_impunge_readdir_cbk] rep1: readdir of / on subvolume client2 failed (Invalid argument)
[2010-08-17 12:35:05.706296] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.796214] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.798018] I [afr-self-heal-common.c:1520:afr_self_heal_completion_cbk] rep1: background  entry self-heal completed on /

Comment 5 Raghavendra G 2010-08-23 07:12:46 UTC

I ran the tests again and found that self heal is indeed not happening when the stopped server starts running again. But I did not find any errors in client_readdirp_cbk. Hence moving the bug to component afr.

Comment 6 Raghavendra G 2010-08-25 04:08:25 UTC

errors in client_readdirp_cbk are addressed in bug 763162.

Comment 7 Vijay Bellur 2010-08-27 06:15:28 UTC

PATCH: http://patches.gluster.com/patch/4332 in master (cluster/afr: Hold ref on the right fd)