Bug 1000948

Summary:	Dist-geo-rep: Crawling + processing for 14 million pre-existing files take very long time
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Neependra Khare <nkhare>
Component:	geo-replication	Assignee:	Venky Shankar <vshankar>
Status:	CLOSED ERRATA	QA Contact:	Neependra Khare <nkhare>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	2.1	CC:	aavati, amarts, asriram, bengland, csaba, dshaks, kcleveng, kparthas, psriniva, racpatel, rhs-bugs, sdharane, vagarwal, vbhat, vkoppad
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0.39rhs	Doc Type:	Bug Fix
Doc Text:	Previously, when a Geo-replication session was started there were tens of millions of files on the master volume which took long time to observe the updates on the slave mount point.Now, with this update this has been fixed.	Story Points:	---
Clone Of:
Clones:	1024465 (view as bug list)		Environment:
Last Closed:	2013-11-27 15:33:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	957769, 1024465

Description Neependra Khare 2013-08-26 07:48:17 UTC

Description of problem:

At master site on 4x2 volume, 14 M files which were mix to small and large files
were created. Average file size for small files were 32 K and large files were size of 10 GB. After creation geo-rep has been started.

After the initial crawl, in XSYNC-CHANGELOG.1377249839 file on one of the master node 6281134 entries were created. This file was last modified on 23rd August at 16:26. And as of today on 26 August at 3:20 the geo-replication has not started transferring any file . As per my understanding and from discussion with Venky, during all this time the processing of the XSYNC based changelog is happening, which is nothing but "pick up an entry + stat + keep it in memory". And because of this 2 of the python processes are consuming ~5.5 GB in memory.

By looking at the gfid from strace output then grepping for the line number from the XSYNC-CHANGELOG.1377249839 file it looks till now ~60% files have been processed. Similarly if we look at the throughput for processing then it is ~10 files/sec.

Version-Release number of selected component (if applicable):
3.4.0.22rhs-2

How reproducible:
Create large number of files on a replicate volume and then start geo-rep.

Steps to Reproduce:
1.
2.
3.

Actual results:
- The geo-rep does not start transferring files after waiting for long time.
- The memory usage in the crawling + processing step in very high.
- There is no way to see the progress in this phase.

Expected results:
- The geo-rep should start transferring files without such long wait.
- The memory footprint at the crawling + processing stage should be less.
- There should be a way to see the progress in this phase.

Additional info:
- From the description of the problem the processing of XSYNC based changelog file is taking lot of time. From that it look the "stat" call would be taking the most if the time.

Comment 1 Neependra Khare 2013-08-26 08:04:02 UTC

- Sosreport from one if the master node is available at :- 
http://perf1.perf.lab.eng.bos.redhat.com/nkhare/bugzilla/1000948/sosreport-gprfs033.1000948-20130826035112-c79c.tar.xz

- strace output 
1. strace -s 500 -f -p  <pid> ouput 
http://perf1.perf.lab.eng.bos.redhat.com/nkhare/bugzilla/1000948/gsync.strace

2. strace -s 500 -fxvt -p <pid> ouput 
http://perf1.perf.lab.eng.bos.redhat.com/nkhare/bugzilla/1000948/gsync1.strace

3. XSYNC changelog file.
http://perf1.perf.lab.eng.bos.redhat.com/nkhare/bugzilla/1000948/XSYNC-CHANGELOG.1377249839.tar.gz

Comment 4 Ben England 2013-10-17 18:46:40 UTC

Kaleb Keithley suggested a different approach. If you really have a lot of data to move to a remote site for the initial geo-rep sync, maybe we shouldn't be shipping it over the WAN. Some storage vendors such as EMC physically transport the data, this sounds bizarre and old-fashioned, but when you have to move terabytes of data, it actually can be faster and cheaper in some cases.

This doesn't invalidate the suggestions above for enhancing the product, but the point is that there are physical limits to how much data you can transport over the WAN in an initial sync.

I think he was suggesting that you'd take a pair of servers intended for the remote site and ship them to the master site, attach them to the same network as the master (with much higher throughput because of that), make them a geo-rep slave, do the initial sync, then detach the 2 slave servers from the master, ship them to the remote site, reattach them, and then restart geo-rep. You can then add in the remaining slave nodes if any, and then run a rebalance on the slave volume.

There are a lot of little steps missing in this, but I think it's feasible and might be a more practical solution in cases where there really is a lot of data in a volume before we decide to geo-replicate it.

Can anyone see a reason why this wouldn't work? For example, in network configuration -- are Gluster volumes bound to particular set of IP addresses that aren't portable? Or can you re-locate a Gluster volume to a different set of IP addresses without destroying it?

Comment 5 Venky Shankar 2013-10-18 07:54:38 UTC

The patch for this bug which is available in glusterfs-3.4.0.35rhs does pipeline the sync and crawl, but individually crawl and sync are still single-threaded.

I've asked Neependra to test out this patch and working on parallelizing both crawl (and generating xsync changelogs) and syncing.

Over the next few days, I'll be updating this bug on the improvements and the current state of patches.

Comment 6 Anand Avati 2013-10-22 21:43:16 UTC

Are the crawler changes (incorporating xsync like crawling) being tracked in this same bug?

Comment 7 Neependra Khare 2013-11-12 12:04:48 UTC

I have tested with the smaller dataset and saw that data transfer starts as soon as Geo-Rep starts, rather than waiting for entire initial crawl to finish.

Comment 9 Amar Tumballi 2013-11-13 09:02:31 UTC

Improvements done as part of this bug:

* batched processing of xsync (or initial) crawl data set:
  -> working as per comment #7
* changes to the way changelog journaling is done, so we don't need to perform any 'stat()' or 'getxattr()' on the mountpoint. (but done directly on the brick).
* removed the code to perform the extra stat() on slave mount before entry creation.

The whole crawler code to become multi threaded, parallel crawling is not tracked as part of this bug. (bug 1029799 is filed for this enhancement)

Comment 10 Amar Tumballi 2013-11-13 09:56:22 UTC

(In reply to Ben England from comment #4)

> 
> There are a lot of little steps missing in this, but I think it's feasible
> and might be a more practical solution in cases where there really is a lot
> of data in a volume before we decide to geo-replicate it.
> 
> Can anyone see a reason why this wouldn't work?  For example, in network
> configuration -- are Gluster volumes bound to particular set of IP addresses
> that aren't portable?  Or can you re-locate a Gluster volume to a different
> set of IP addresses without destroying it?

Ben,

We surely tested this, and had this as an use-case. Bug 1005155 is filed for it, and the steps are documented.

-Amar

Comment 11 Vijaykumar Koppad 2013-11-18 07:26:13 UTC

Verified on the build glusterfs-3.4.0.43rhs

With this build, after the geo-rep is started with pre-populated data, it does xsync crawl and creates XSCNC-CHANGELOGS with ~8K  entries. These XSYNC-CHANGELOGS are processed and files are synced to slave. We don't have to wait for the whole file system to be crawled, to see the synced files to slave. Since this bug only related to batched syncing,moving it to verified, though this mode of syncing is not as fast as changelogs and that is being tracked with the Bug 1029799 as mentioned by amar in comment 9.

Comment 12 errata-xmlrpc 2013-11-27 15:33:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html