Bug 763097 (GLUSTER-1365) - [3.1.0qa5-15] Self-heal doesn't happen from 1st subvolume to others
Summary: [3.1.0qa5-15] Self-heal doesn't happen from 1st subvolume to others
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1365
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.1-alpha
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Pavan Vilas Sondur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-16 05:26 UTC by Anush Shetty
Modified: 2015-12-01 16:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Client log (154.20 KB, application/x-bzip)
2010-08-17 04:15 UTC, Anush Shetty
no flags Details

Description Anush Shetty 2010-08-16 04:33:01 UTC
Works perfectly on replicate-only setup

Comment 1 Anush Shetty 2010-08-16 05:26:27 UTC
On a 2x2 Distribute-Replicate setup, 10000 files were created and Server 2 and Server 3 were brought down during the process. After the files were created, there servers were brought back up and self-heal was triggered. The files got healed only on Server 2 but didn't get healed on Server 3.

Before self-heal:
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew1/ | wc -l
4975
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew2/ | wc -l
440
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew3/ | wc -l
616
root@pitta:/mnt/gluster/gluster# find /mnt/exportnew4/ | wc -l
5031

Self-heal  root@pitta:/mnt/gluster# ls -lR > /dev/null

After self-heal: 
root@pitta:/mnt/gluster# find /mnt/exportnew1/ | wc -l
4975
root@pitta:/mnt/gluster# find /mnt/exportnew2/ | wc -l
4975
root@pitta:/mnt/gluster# find /mnt/exportnew3/ | wc -l
616
root@pitta:/mnt/gluster# find /mnt/exportnew4/ | wc -l
5031

Comment 2 Anush Shetty 2010-08-16 07:18:34 UTC
This is the issue with the self-heal when first subvolume is down. Tried this by creating 10000 files on the mount point with the first subvolume being down. It was brought up again and `find . | xargs stat` was executed on the mount point to trigger self-heal.

Comment 3 Anush Shetty 2010-08-17 04:15:22 UTC
Created attachment 287

Comment 4 Anush Shetty 2010-08-17 04:17:01 UTC
The logs show client3_1_readdir_cbk erroring out

[2010-08-17 12:35:05.626475] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.626591] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.626622] I [afr-common.c:699:afr_lookup_done] rep1: background  entry self-heal triggered. path: /
[2010-08-17 12:35:05.627512] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.628947] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.630899] D [afr-self-heal-entry.c:2291:afr_sh_entry_sync_prepare] rep1: self-healing directory / from subvolume client2 to 1 other
[2010-08-17 12:35:05.650995] E [client3_1-fops.c:1652:client3_1_readdirp_cbk] : error
[2010-08-17 12:35:05.651031] D [afr-self-heal-entry.c:2031:afr_sh_entry_impunge_readdir_cbk] rep1: readdir of / on subvolume client2 failed (Invalid argument)
[2010-08-17 12:35:05.706296] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.796214] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /.
[2010-08-17 12:35:05.798018] I [afr-self-heal-common.c:1520:afr_self_heal_completion_cbk] rep1: background  entry self-heal completed on /

Comment 5 Raghavendra G 2010-08-23 07:12:46 UTC
I ran the tests again and found that self heal is indeed not happening when the stopped server starts running again. But I did not find any errors in client_readdirp_cbk. Hence moving the bug to component afr.

Comment 6 Raghavendra G 2010-08-25 04:08:25 UTC
errors in client_readdirp_cbk are addressed in bug 763162.

Comment 7 Vijay Bellur 2010-08-27 06:15:28 UTC
PATCH: http://patches.gluster.com/patch/4332 in master (cluster/afr: Hold ref on the right fd)


Note You need to log in before you can comment on or make changes to this bug.