1332080 – [geo-rep+shard]: Files which were synced to slave before enabling shard doesn't get sync/remove upon modification

Bug 1332080 - [geo-rep+shard]: Files which were synced to slave before enabling shard doesn't get sync/remove upon modification

Summary: [geo-rep+shard]: Files which were synced to slave before enabling shard doesn...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	geo-replication
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Kotresh HR
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:	1344908
Blocks:	1351522
TreeView+	depends on / blocked

Reported:	2016-05-02 07:01 UTC by Rahul Hinduja
Modified:	2017-03-23 05:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 05:29:00 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Rahul Hinduja 2016-05-02 07:01:55 UTC

Description of problem:
=======================

In a scenario where files were created before enabling shard on master and slave, which were synced too on slave. After enabling shard any modification/append/remove  on those file from master doesn't sync to slave. 

i.e, Files which are synced before shard is enabled, never gets sync again.


=> new_file which was synced to slave and appended doesn't get reflected at slave. 


[root@dj master]# ls -l
total 14150
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
-rw-r--r--. 1 root root  237506 May  1  2016 file
-rw-r--r--. 1 root root 7125831 May  1  2016 new_file
[root@dj master]# 
[root@dj master]# du -sh *
6.8M    after_shard
232K    file
6.8M    new_file
[root@dj master]# 


[root@dj slave]# ls -l
total 9511
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
-rw-r--r--. 1 root root  237506 May  1  2016 file
-rw-r--r--. 1 root root 2375271 May  1  2016 new_file
[root@dj slave]# du -sh *
6.8M    after_shard
232K    file
2.3M    new_file
[root@dj slave]# 

=> Even removal of files file and new_file which were synced before shard was enabled, doesn't get remove from slave and it remains. 

[root@dj master]# ls -l
total 6959
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
[root@dj master]# du -sh *
6.8M    after_shard
[root@dj master]# 


[root@dj slave]# ls -l
total 9511
-rw-r--r--. 1 root root 7125831 May  1  2016 after_shard
-rw-r--r--. 1 root root  237506 May  1  2016 file
-rw-r--r--. 1 root root 2375271 May  1  2016 new_file
[root@dj slave]# du -sh *
6.8M    after_shard
232K    file
2.3M    new_file
[root@dj slave]# 

Version-Release number of selected component (if applicable):
==============================================================

glusterfs-3.7.9-2.el7rhgs.x86_64


How reproducible:
=================

2/2


Steps to Reproduce:
===================

1. Create Master and Slave Cluster and Volume (6x2)
2. Write 2 files both less than 4M (215k and 2.8M) (file and new_file)
3. Create geo-rep session, both the files gets sync to slave successfully. 
4. Enable shard on Master and Slave
5. Append the new_file(2.8M) on master to exceed 4M. Now the size is 6.8M
6. This file never gets sync to slave {Appended data doesn't get sync}
7. cp new_file after_shar
8. after_shard file gets synced to slave

Actual results:
===============

Conclusion: Files which are synced before shard is enabled, never gets sync again. 


Expected results:
=================

Files should have the latest modification on slave too.

Comment 4 Aravinda VK 2016-05-10 12:16:00 UTC

Moving this bug out of 3.1.3 since it is not reproducible always and this bug is not applicable for the Hyperconvergence use case.

Comment 5 Rahul Hinduja 2016-05-16 17:48:29 UTC

Hit this issue again with the build: glusterfs-3.7.9-4

Master:
=======
[root@dj master]# du -sh *
279K	file
8.2M	new_file
[root@dj master]# pwd
/mnt/master
[root@dj master]# 

Slave:
======
[root@dj slave]# du -sh *
279K	file
2.8M	new_file
[root@dj slave]# pwd
/mnt/slave
[root@dj slave]# 

Shard feature enabled on both master and slave:
===============================================

Master:
+++++++

[root@dhcp37-182 scripts]# gluster volume info po  | grep shard
features.shard: enable
[root@dhcp37-182 scripts]# 

Slave:
+++++++

[root@dhcp37-122 scripts]# gluster volume info shifu | grep shard
features.shard: enable
[root@dhcp37-122 scripts]#

Comment 6 Kotresh HR 2016-05-20 12:13:22 UTC

RCA Update:
I have verified the following things.

1. The issue is not related to quota and USS which were enabled in the volume.

2  The changelog records the DATA entry for the problematic file.

3. Geo-replication picks the changelog and processes it.

But geo-rep does 'lstat' on .gfid/<gfid> on master volume before syncing to check for the presence for the file. The lstat on .gfid/<gfid> on is failing on master and hence the data sync is being missed. lstat is failing because the lookup is failing for '.gfid' virtual directory with ESTALE. We need to further debug for why
lookup is failing for first time on enabling sharding on master.

Following are lookup errors:
[2016-05-20 12:01:32.538645] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle)
[2016-05-20 12:01:32.538663] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null)
[2016-05-20 12:01:32.539294] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle)
[2016-05-20 12:01:32.539319] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null)
[2016-05-20 12:01:32.583722] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 0-fuse: 00000000-0000-0000-0000-00000000000d: failed to resolve (Stale file handle)
[2016-05-20 12:01:32.583761] E [fuse-bridge.c:564:fuse_lookup_resume] 0-fuse: failed to resolve path (null)
[2016-05-20 12:00:24.235843] I [MSGID: 101173] [graph.c:269:gf_add_cmdline_options] 0-master-md-cache: adding option 'cache-posix-acl' for volume 'master-md-cache' with value 'true'

Comment 7 Kotresh HR 2016-06-22 09:47:45 UTC

Upstream Patch
http://review.gluster.org/14773 (master)

Comment 9 Atin Mukherjee 2016-09-17 14:51:25 UTC

Upstream mainline : http://review.gluster.org/14773
Upstream 3.8 : http://review.gluster.org/14776

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 12 Rahul Hinduja 2017-02-07 09:54:46 UTC

Verified with the build: glusterfs-geo-replication-3.8.4-13.el7rhgs.x86_64

After enabling the shard, the files gets properly sync or removed. Moving this bug to verified state. 

Master:
=======

[root@dj master]# ls -l
total 1960
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 1783293 Feb  7  2017 new_file
[root@dj master]# cat /root/files/new_file >> new_file 
[root@dj master]# ls -l 
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj master]#

Slave:
======

[root@dj slave]# ls -l
total 1960
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 1783293 Feb  7  2017 new_file
[root@dj slave]# 
[root@dj slave]# ls -l
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]#


++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Master:
=======
[root@dj master]# cp new_file after_shard
[root@dj master]# ls -l
total 10961
-rw-r--r--. 1 root root 5499686 Feb  7  2017 after_shard
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj master]# 

Slave:
======
[root@dj slave]# ls -l 
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]# 
[root@dj slave]# ls -l
total 10961
-rw-r--r--. 1 root root 5499686 Feb  7  2017 after_shard
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]# 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Master:
=======
[root@dj master]# rm -rf after_shard 
[root@dj master]# ls -l
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj master]# rm -rf new_file 
[root@dj master]# ls
file1
[root@dj master]# ls -l
total 219
-rw-r--r--. 1 root root 223741 Feb  7  2017 file1
[root@dj master]# 


Slave:
======
[root@dj slave]# ls -l
total 5590
-rw-r--r--. 1 root root  223741 Feb  7  2017 file1
-rw-r--r--. 1 root root 5499686 Feb  7  2017 new_file
[root@dj slave]# 
[root@dj slave]# 
[root@dj slave]# ls -l
total 219
-rw-r--r--. 1 root root 223741 Feb  7  2017 file1
[root@dj slave]#

Comment 14 errata-xmlrpc 2017-03-23 05:29:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.