Description of problem: ======================= In a scenario where files were created before enabling shard on master and slave, which were synced too on slave. After enabling shard any modification/append/remove on those file from master doesn't sync to slave. i.e, Files which are synced before shard is enabled, never gets sync again. => new_file which was synced to slave and appended doesn't get reflected at slave. [root@dj master]# ls -l total 14150 -rw-r--r--. 1 root root 7125831 May 1 2016 after_shard -rw-r--r--. 1 root root 237506 May 1 2016 file -rw-r--r--. 1 root root 7125831 May 1 2016 new_file [root@dj master]# [root@dj master]# du -sh * 6.8M after_shard 232K file 6.8M new_file [root@dj master]# [root@dj slave]# ls -l total 9511 -rw-r--r--. 1 root root 7125831 May 1 2016 after_shard -rw-r--r--. 1 root root 237506 May 1 2016 file -rw-r--r--. 1 root root 2375271 May 1 2016 new_file [root@dj slave]# du -sh * 6.8M after_shard 232K file 2.3M new_file [root@dj slave]# => Even removal of files file and new_file which were synced before shard was enabled, doesn't get remove from slave and it remains. [root@dj master]# ls -l total 6959 -rw-r--r--. 1 root root 7125831 May 1 2016 after_shard [root@dj master]# du -sh * 6.8M after_shard [root@dj master]# [root@dj slave]# ls -l total 9511 -rw-r--r--. 1 root root 7125831 May 1 2016 after_shard -rw-r--r--. 1 root root 237506 May 1 2016 file -rw-r--r--. 1 root root 2375271 May 1 2016 new_file [root@dj slave]# du -sh * 6.8M after_shard 232K file 2.3M new_file [root@dj slave]# Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.7.9-2.el7rhgs.x86_64 How reproducible: ================= 2/2 Steps to Reproduce: =================== 1. Create Master and Slave Cluster and Volume (6x2) 2. Write 2 files both less than 4M (215k and 2.8M) (file and new_file) 3. Create geo-rep session, both the files gets sync to slave successfully. 4. Enable shard on Master and Slave 5. Append the new_file(2.8M) on master to exceed 4M. Now the size is 6.8M 6. This file never gets sync to slave {Appended data doesn't get sync} 7. cp new_file after_shar 8. after_shard file gets synced to slave Actual results: =============== Conclusion: Files which are synced before shard is enabled, never gets sync again. Expected results: ================= Files should have the latest modification on slave too.
Moving this bug out of 3.1.3 since it is not reproducible always and this bug is not applicable for the Hyperconvergence use case.
Hit this issue again with the build: glusterfs-3.7.9-4 Master: ======= [root@dj master]# du -sh * 279K file 8.2M new_file [root@dj master]# pwd /mnt/master [root@dj master]# Slave: ====== [root@dj slave]# du -sh * 279K file 2.8M new_file [root@dj slave]# pwd /mnt/slave [root@dj slave]# Shard feature enabled on both master and slave: =============================================== Master: +++++++ [root@dhcp37-182 scripts]# gluster volume info po | grep shard features.shard: enable [root@dhcp37-182 scripts]# Slave: +++++++ [root@dhcp37-122 scripts]# gluster volume info shifu | grep shard features.shard: enable [root@dhcp37-122 scripts]#
RCA Update: I have verified the following things. 1. The issue is not related to quota and USS which were enabled in the volume. 2 The changelog records the DATA entry for the problematic file. 3. Geo-replication picks the changelog and processes it. But geo-rep does 'lstat' on .gfid/<gfid> on master volume before syncing to check for the presence for the file. The lstat on .gfid/<gfid> on is failing on master and hence the data sync is being missed. lstat is failing because the lookup is failing for '.gfid' virtual directory with ESTALE. We need to further debug for why lookup is failing for first time on enabling sharding on master. Following are lookup errors: [2016-05-20 12:01:32.538645] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle) [2016-05-20 12:01:32.538663] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null) [2016-05-20 12:01:32.539294] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle) [2016-05-20 12:01:32.539319] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null) [2016-05-20 12:01:32.583722] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle) [2016-05-20 12:01:32.583761] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null) [2016-05-20 12:00:24.235843] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-master-md-cache: adding option 'cache-posix-acl' for volume 'master-md-cache' with value 'true'
Upstream Patch http://review.gluster.org/14773 (master)
Upstream mainline : http://review.gluster.org/14773 Upstream 3.8 : http://review.gluster.org/14776 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
Verified with the build: glusterfs-geo-replication-3.8.4-13.el7rhgs.x86_64 After enabling the shard, the files gets properly sync or removed. Moving this bug to verified state. Master: ======= [root@dj master]# ls -l total 1960 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 1783293 Feb 7 2017 new_file [root@dj master]# cat /root/files/new_file >> new_file [root@dj master]# ls -l total 5590 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj master]# Slave: ====== [root@dj slave]# ls -l total 1960 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 1783293 Feb 7 2017 new_file [root@dj slave]# [root@dj slave]# ls -l total 5590 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj slave]# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Master: ======= [root@dj master]# cp new_file after_shard [root@dj master]# ls -l total 10961 -rw-r--r--. 1 root root 5499686 Feb 7 2017 after_shard -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj master]# Slave: ====== [root@dj slave]# ls -l total 5590 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj slave]# [root@dj slave]# ls -l total 10961 -rw-r--r--. 1 root root 5499686 Feb 7 2017 after_shard -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj slave]# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Master: ======= [root@dj master]# rm -rf after_shard [root@dj master]# ls -l total 5590 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj master]# rm -rf new_file [root@dj master]# ls file1 [root@dj master]# ls -l total 219 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 [root@dj master]# Slave: ====== [root@dj slave]# ls -l total 5590 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 -rw-r--r--. 1 root root 5499686 Feb 7 2017 new_file [root@dj slave]# [root@dj slave]# [root@dj slave]# ls -l total 219 -rw-r--r--. 1 root root 223741 Feb 7 2017 file1 [root@dj slave]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html