Hide Forgot
Works perfectly on replicate-only setup
On a 2x2 Distribute-Replicate setup, 10000 files were created and Server 2 and Server 3 were brought down during the process. After the files were created, there servers were brought back up and self-heal was triggered. The files got healed only on Server 2 but didn't get healed on Server 3. Before self-heal: root@pitta:/mnt/gluster/gluster# find /mnt/exportnew1/ | wc -l 4975 root@pitta:/mnt/gluster/gluster# find /mnt/exportnew2/ | wc -l 440 root@pitta:/mnt/gluster/gluster# find /mnt/exportnew3/ | wc -l 616 root@pitta:/mnt/gluster/gluster# find /mnt/exportnew4/ | wc -l 5031 Self-heal root@pitta:/mnt/gluster# ls -lR > /dev/null After self-heal: root@pitta:/mnt/gluster# find /mnt/exportnew1/ | wc -l 4975 root@pitta:/mnt/gluster# find /mnt/exportnew2/ | wc -l 4975 root@pitta:/mnt/gluster# find /mnt/exportnew3/ | wc -l 616 root@pitta:/mnt/gluster# find /mnt/exportnew4/ | wc -l 5031
This is the issue with the self-heal when first subvolume is down. Tried this by creating 10000 files on the mount point with the first subvolume being down. It was brought up again and `find . | xargs stat` was executed on the mount point to trigger self-heal.
Created attachment 287
The logs show client3_1_readdir_cbk erroring out [2010-08-17 12:35:05.626475] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /. [2010-08-17 12:35:05.626591] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /. [2010-08-17 12:35:05.626622] I [afr-common.c:699:afr_lookup_done] rep1: background entry self-heal triggered. path: / [2010-08-17 12:35:05.627512] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /. [2010-08-17 12:35:05.628947] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /. [2010-08-17 12:35:05.630899] D [afr-self-heal-entry.c:2291:afr_sh_entry_sync_prepare] rep1: self-healing directory / from subvolume client2 to 1 other [2010-08-17 12:35:05.650995] E [client3_1-fops.c:1652:client3_1_readdirp_cbk] : error [2010-08-17 12:35:05.651031] D [afr-self-heal-entry.c:2031:afr_sh_entry_impunge_readdir_cbk] rep1: readdir of / on subvolume client2 failed (Invalid argument) [2010-08-17 12:35:05.706296] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /. [2010-08-17 12:35:05.796214] D [afr-common.c:544:afr_lookup_collect_xattr] rep1: entry self-heal is pending for /. [2010-08-17 12:35:05.798018] I [afr-self-heal-common.c:1520:afr_self_heal_completion_cbk] rep1: background entry self-heal completed on /
I ran the tests again and found that self heal is indeed not happening when the stopped server starts running again. But I did not find any errors in client_readdirp_cbk. Hence moving the bug to component afr.
errors in client_readdirp_cbk are addressed in bug 763162.
PATCH: http://patches.gluster.com/patch/4332 in master (cluster/afr: Hold ref on the right fd)