Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1217930

Summary:	Geo-replication very slow, not able to sync all the files to slave
Product:	[Community] GlusterFS	Reporter:	Aravinda VK <avishwan>
Component:	geo-replication	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.7.0	CC:	aavati, bkunal, bugs, csaba, gluster-bugs, nlevinki, rhs-bugs, sasundar, storage-qa-internal
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.0beta2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	1210965	Environment:
Last Closed:	2015-05-14 17:27:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1210719, 1210965
Bug Blocks:	1199352

Description Aravinda VK 2015-05-03 05:07:07 UTC

+++ This bug was initially created as a clone of Bug #1210965 +++

+++ This bug was initially created as a clone of Bug #1210719 +++

Description of problem:
=======================

Geo-replication is very slow, not able to sync all the files to slave. Geo-replication state moving to faulty very frequently.


Reproduction:

1. Create 2 node replica master volume and slave volume.
2. Set up geo-replication from master volume to slave volume.
3. Put numerous files into master volume. (I'm currently testing with 52 million files which size is 5KB. CU's case is same file size and the number of file is 45 million.)
4. Check the geo-replication status, 'df -h', 'df -i' command output.
5. Initially, the geo-replication crawl status is Changelog crawl.
6. After several times(From my testing, it took 1 - 2 days in our lab env), the geo-replication crawl status changed to History crawl then the file transfer to slave volume seems to be not working.

~~~
[Sample result]
<Master side>
[root@master1 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_master1-lv_root
                       17G  4.8G   11G  31% /
tmpfs                 4.0G     0  4.0G   0% /dev/shm
/dev/vda1             477M   28M  424M   7% /boot
/dev/vdb              3.0T  130G  2.9T   5% /data

[root@master1 ~]# df -i
Filesystem              Inodes    IUsed     IFree IUse% Mounted on
/dev/mapper/vg_master1-lv_root
                       1093440    64936   1028504    6% /
tmpfs                  1023965        2   1023963    1% /dev/shm
/dev/vda1               128016       38    127978    1% /boot
/dev/vdb             322122496 15507582 306614914    5% /data

<Slave side>
[root@slave1 geo-replication-slaves]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_slave1-lv_root
                      139G   20G  112G  16% /
tmpfs                 4.0G     0  4.0G   0% /dev/shm
/dev/vda1             477M   28M  424M   7% /boot
/dev/vdb              3.0T   35G  3.0T   2% /data
localhost:*******  3.0T   35G  3.0T   2% /mnt

[root@slave1 geo-replication-slaves]# df -i
Filesystem              Inodes   IUsed     IFree IUse% Mounted on
/dev/mapper/vg_slave1-lv_root
                       9224192   46456   9177736    1% /
tmpfs                  1023966       2   1023964    1% /dev/shm
/dev/vda1               128016      38    127978    1% /boot
/dev/vdb             322122496 5145872 316976624    2% /data
localhost:****** 322122496 5145872 316976624    2% /mnt <--- huge difference between master and slave.
~~~


<Master side>
[root@master1 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_master1-lv_root
                       17G  3.6G   12G  23% /
tmpfs                 4.0G     0  4.0G   0% /dev/shm
/dev/vda1             477M   28M  424M   7% /boot
/dev/vdb              3.0T   78G  3.0T   3% /data
~~~
~~~
<Slave side>
[root@slave1 geo-replication-slaves]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_slave1-lv_root
                      139G  7.5G  124G   6% /
tmpfs                 4.0G     0  4.0G   0% /dev/shm
/dev/vda1             477M   28M  424M   7% /boot
/dev/vdb              3.0T   29G  3.0T   1% /data
~~~

~~~
Status shows History Crawl.

--- Additional comment from Bipin Kunal on 2015-04-10 08:31:35 EDT ---

RCA done by Aravinda,

Hi Bipin,

Thanks for the setup. I root caused the issue.

Changelog processing is done in batch, for example if 10 changelogs 
available for processing then process all at once. Collect Entry, Meta 
and Data operations separately, All the entry operations like CREATE, 
MKDIR, MKNOD, LINK, UNLINK will be executed first then rsync will be 
triggered for whole batch. Stime will get updated once the complete 
batch is complete.

In this setup, large number of Changelogs available for processing 
during History Changelogs processing(Since batch size is automatic). 
Entry operations are complete, but while trying rsync, worker goes to 
faulty for some reason and restarted. Since stime is not updated, it has 
to process all the changelogs again. While processing same changelogs 
again, all CREATE will get EEXIST since all the files created in 
previous run.

To understand better we will consider three changelogs in batch,

CHANGELOG.1428375039, CHANGELOG.1428375054, CHANGELOG.1428375069 each 
having 850 creates and 850 data operations recorded. Geo-rep picks all 
the three files for processing, collect all the entry operations and 
data operations(2550 Creates and 2550 data). After processing 2550 
entries if worker failed, Geo-rep has to process all 2550 entries 
again(But gets EEXIST).

Solution:
Geo-rep should limit the batch size either based on number of changelogs 
or based on number of records to be processed. Once this smaller/Optimal 
batch processed, stime will be updated. Even if worker crashes, Geo-rep 
has to repeat only small delta instead of large set.

Workaround:
Not available. :(
Geo-rep is not failed, only thing is it is taking more time to come to 
the speed. If workers not crashed, it will complete sync. If workers 
crash in between, then again retry the same and time consuming :(

I will work on this as priority, will post the patch soon.

--
regards
Aravinda

--- Additional comment from Anand Avati on 2015-04-11 12:02:24 EDT ---

REVIEW: http://review.gluster.org/10202 (geo-rep: Limit number of changelogs to process in batch) posted (#1) for review on master by Aravinda VK (avishwan)

--- Additional comment from Anand Avati on 2015-04-27 07:23:44 EDT ---

REVIEW: http://review.gluster.org/10202 (geo-rep: Limit number of changelogs to process in batch) posted (#2) for review on master by Aravinda VK (avishwan)

--- Additional comment from Anand Avati on 2015-04-28 13:39:45 EDT ---

COMMIT: http://review.gluster.org/10202 committed in master by Vijay Bellur (vbellur) 
------
commit 428933dce2c87ea62b4f58af7d260064fade6a8b
Author: Aravinda VK <avishwan>
Date:   Sat Apr 11 20:03:47 2015 +0530

    geo-rep: Limit number of changelogs to process in batch
    
    Changelog processing is done in batch, for example if 10 changelogs
    available for processing then process all at once. Collect Entry, Meta
    and Data operations separately, All the entry operations like CREATE,
    MKDIR, MKNOD, LINK, UNLINK will be executed first then rsync will be
    triggered for whole batch. Stime will get updated once the complete
    batch is complete.
    
    In case of large number of Changelogs in a batch, If geo-rep fails after
    Entry operations, but before rsync then on restart, it again starts from the
    beginning since stime is not updated. It has to process all the changelogs
    again. While processing same changelogs again, all CREATE will get EEXIST
    since all the files created in previous run. Big hit for performance.
    
    With this patch, Geo-rep limits number of changelogs per batch based on
    Changelog file size. So that when geo-rep fails it has to retry only last batch
    changelogs since stime gets updated after each batch.
    
    BUG: 1210965
    Change-Id: I844448c4cdcce38a3a2e2cca7c9a50db8f5a9062
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/10202
    Reviewed-by: Kotresh HR <khiremat>
    Tested-by: NetBSD Build System
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Vijay Bellur <vbellur>

Comment 1 Anand Avati 2015-05-03 05:08:31 UTC

REVIEW: http://review.gluster.org/10499 (geo-rep: Limit number of changelogs to process in batch) posted (#1) for review on release-3.7 by Aravinda VK (avishwan)

Comment 2 Anand Avati 2015-05-05 07:39:59 UTC

REVIEW: http://review.gluster.org/10499 (geo-rep: Limit number of changelogs to process in batch) posted (#2) for review on release-3.7 by Aravinda VK (avishwan)

Comment 3 Anand Avati 2015-05-08 04:47:50 UTC

COMMIT: http://review.gluster.org/10499 committed in release-3.7 by Venky Shankar (vshankar) 
------
commit 613414e837cb5a09c3adbf2258ad691151f1c7e1
Author: Aravinda VK <avishwan>
Date:   Sat Apr 11 20:03:47 2015 +0530

    geo-rep: Limit number of changelogs to process in batch
    
    Changelog processing is done in batch, for example if 10 changelogs
    available for processing then process all at once. Collect Entry, Meta
    and Data operations separately, All the entry operations like CREATE,
    MKDIR, MKNOD, LINK, UNLINK will be executed first then rsync will be
    triggered for whole batch. Stime will get updated once the complete
    batch is complete.
    
    In case of large number of Changelogs in a batch, If geo-rep fails after
    Entry operations, but before rsync then on restart, it again starts from the
    beginning since stime is not updated. It has to process all the changelogs
    again. While processing same changelogs again, all CREATE will get EEXIST
    since all the files created in previous run. Big hit for performance.
    
    With this patch, Geo-rep limits number of changelogs per batch based on
    Changelog file size. So that when geo-rep fails it has to retry only last batch
    changelogs since stime gets updated after each batch.
    
    BUG: 1217930
    Change-Id: I844448c4cdcce38a3a2e2cca7c9a50db8f5a9062
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/10202
    Reviewed-by: Kotresh HR <khiremat>
    Reviewed-by: Vijay Bellur <vbellur>
    Reviewed-on: http://review.gluster.org/10499
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System
    Reviewed-by: Venky Shankar <vshankar>

Comment 4 Niels de Vos 2015-05-14 17:27:28 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:28:50 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 6 Niels de Vos 2015-05-14 17:35:21 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user