Bug 1332080

Summary: [geo-rep+shard]: Files which were synced to slave before enabling shard doesn't get sync/remove upon modification
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED ERRATA QA Contact: Rahul Hinduja <rhinduja>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, avishwan, chrisw, csaba, khiremat, nlevinki, rcyriac
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 05:29:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1344908    
Bug Blocks: 1351522    

Description Rahul Hinduja 2016-05-02 07:01:55 UTC
Description of problem:
=======================

In a scenario where files were created before enabling shard on master and slave, which were synced too on slave. After enabling shard any modification/append/remove  on those file from master doesn't sync to slave. 

i.e, Files which are synced before shard is enabled, never gets sync again.


=> new_file which was synced to slave and appended doesn't get reflected at slave. 


[root@dj master]# ls -l
total 14150
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
-rw-r--r--. 1 root root  237506 May  1  2016 file
-rw-r--r--. 1 root root 7125831 May  1  2016 new_file
[root@dj master]# 
[root@dj master]# du -sh *
6.8M    after_shard
232K    file
6.8M    new_file
[root@dj master]# 


[root@dj slave]# ls -l
total 9511
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
-rw-r--r--. 1 root root  237506 May  1  2016 file
-rw-r--r--. 1 root root 2375271 May  1  2016 new_file
[root@dj slave]# du -sh *
6.8M    after_shard
232K    file
2.3M    new_file
[root@dj slave]# 

=> Even removal of files file and new_file which were synced before shard was enabled, doesn't get remove from slave and it remains. 

[root@dj master]# ls -l
total 6959
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
[root@dj master]# du -sh *
6.8M    after_shard
[root@dj master]# 


[root@dj slave]# ls -l
total 9511
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
-rw-r--r--. 1 root root  237506 May  1  2016 file
-rw-r--r--. 1 root root 2375271 May  1  2016 new_file
[root@dj slave]# du -sh *
6.8M    after_shard
232K    file
2.3M    new_file
[root@dj slave]# 

Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.9-2.el7rhgs.x86_64


How reproducible:
=================

2/2


Steps to Reproduce:
===================

1. Create Master and Slave Cluster and Volume (6x2)
2. Write 2 files both less than 4M (215k and 2.8M) (file and new_file)
3. Create geo-rep session, both the files gets sync to slave successfully. 
4. Enable shard on Master and Slave
5. Append the new_file(2.8M) on master to exceed 4M. Now the size is 6.8M
6. This file never gets sync to slave {Appended data doesn't get sync}
7. cp new_file after_shar
8. after_shard file gets synced to slave

Actual results:
===============

Conclusion: Files which are synced before shard is enabled, never gets sync again. 


Expected results:
=================

Files should have the latest modification on slave too.

Comment 4 Aravinda VK 2016-05-10 12:16:00 UTC
Moving this bug out of 3.1.3 since it is not reproducible always and this bug is not applicable for the Hyperconvergence use case.

Comment 5 Rahul Hinduja 2016-05-16 17:48:29 UTC
Hit this issue again with the build: glusterfs-3.7.9-4

Master:
=======
[root@dj master]# du -sh *
279K	file
8.2M	new_file
[root@dj master]# pwd
/mnt/master
[root@dj master]# 

Slave:
======
[root@dj slave]# du -sh *
279K	file
2.8M	new_file
[root@dj slave]# pwd
/mnt/slave
[root@dj slave]# 

Shard feature enabled on both master and slave:
===============================================

Master:
+++++++

[root@dhcp37-182 scripts]# gluster volume info po  | grep shard
features.shard: enable
[root@dhcp37-182 scripts]# 

Slave:
+++++++

[root@dhcp37-122 scripts]# gluster volume info shifu | grep shard
features.shard: enable
[root@dhcp37-122 scripts]#

Comment 6 Kotresh HR 2016-05-20 12:13:22 UTC
RCA Update:
I have verified the following things.

1. The issue is not related to quota and USS which were enabled in the volume.

2  The changelog records the DATA entry for the problematic file.

3. Geo-replication picks the changelog and processes it.

But geo-rep does 'lstat' on .gfid/<gfid> on master volume before syncing to check for the presence for the file. The lstat on .gfid/<gfid> on is failing on master and hence the data sync is being missed. lstat is failing because the lookup is failing for '.gfid' virtual directory with ESTALE. We need to further debug for why
lookup is failing for first time on enabling sharding on master.

Following are lookup errors:
[2016-05-20 12:01:32.538645] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle)
[2016-05-20 12:01:32.538663] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null)
[2016-05-20 12:01:32.539294] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle)
[2016-05-20 12:01:32.539319] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null)
[2016-05-20 12:01:32.583722] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle)
[2016-05-20 12:01:32.583761] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null)
[2016-05-20 12:00:24.235843] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-master-md-cache: adding option 'cache-posix-acl' for volume 'master-md-cache' with value 'true'

Comment 7 Kotresh HR 2016-06-22 09:47:45 UTC
Upstream Patch
http://review.gluster.org/14773 (master)

Comment 9 Atin Mukherjee 2016-09-17 14:51:25 UTC
Upstream mainline : http://review.gluster.org/14773
Upstream 3.8 : http://review.gluster.org/14776

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 12 Rahul Hinduja 2017-02-07 09:54:46 UTC
Verified with the build: glusterfs-geo-replication-3.8.4-13.el7rhgs.x86_64

After enabling the shard, the files gets properly sync or removed. Moving this bug to verified state. 

Master:
=======

[root@dj master]# ls -l
total 1960
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 1783293 Feb  7  2017 new_file
[root@dj master]# cat /root/files/new_file >> new_file 
[root@dj master]# ls -l 
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj master]#

Slave:
======

[root@dj slave]# ls -l
total 1960
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 1783293 Feb  7  2017 new_file
[root@dj slave]# 
[root@dj slave]# ls -l
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]#


++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Master:
=======
[root@dj master]# cp new_file after_shard
[root@dj master]# ls -l
total 10961
-rw-r--r--. 1 root root 5499686 Feb  7  2017 after_shard
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj master]# 

Slave:
======
[root@dj slave]# ls -l 
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]# 
[root@dj slave]# ls -l
total 10961
-rw-r--r--. 1 root root 5499686 Feb  7  2017 after_shard
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]# 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Master:
=======
[root@dj master]# rm -rf after_shard 
[root@dj master]# ls -l
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj master]# rm -rf new_file 
[root@dj master]# ls
file1
[root@dj master]# ls -l
total 219
-rw-r--r--. 1 root root 223741 Feb  7  2017 file1
[root@dj master]# 


Slave:
======
[root@dj slave]# ls -l
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]# 
[root@dj slave]# 
[root@dj slave]# ls -l
total 219
-rw-r--r--. 1 root root 223741 Feb  7  2017 file1
[root@dj slave]#

Comment 14 errata-xmlrpc 2017-03-23 05:29:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html