Bug 809675 - [FEAT] Asymptotic synchronization is not reached in an unreliable enviroment
[FEAT] Asymptotic synchronization is not reached in an unreliable enviroment
Product: GlusterFS
Classification: Community
Component: geo-replication (Show other bugs)
Unspecified Unspecified
unspecified Severity medium
: ---
: ---
Assigned To: Csaba Henk
: FutureFeature
Depends On:
  Show dependency treegraph
Reported: 2012-04-03 22:40 EDT by Csaba Henk
Modified: 2013-07-24 13:59 EDT (History)
4 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-07-24 13:59:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Csaba Henk 2012-04-03 22:40:48 EDT
Description of problem:

By "asymptotic synchronization" we mean that any particular change
on master side gets synchronized to slave at some time.

The geo-rep model theoretically delivers asymptotic synchronization,
but it's not robust: if the gsyncd worker is interrupted more
frequently than the time needed for a complete crawl (eg. due to
network failures, panicky slave, or aux glusterfs leaking up to
triggering the OOM killer), then some files will never get to
the slave end, due to the deterministic order of walking through
the file tree.

Solution is to randomize the walk.

Version-Release number of selected component (if applicable):

How reproducible:

Well reproducible, but deeming if the issue appears is not
easy to automate.

Steps to Reproduce:
1. create a file tree in some volume that's bigger than being possible to sync in a minute
2. start geo-rep with the above volume as master and empty slave
3. in each minute, stop and re-start geo-rep
Actual results:

Some files of master never appear on slave side.

Expected results:

Eventually all files of master should appear on slave side.

Additional info:
Comment 1 Csaba Henk 2012-04-04 21:51:42 EDT
To add, the above description, while gives a good insight to the issue at hand, is a bit of oversimplification. Assuming that the worker is always interrupted early:

- If we have a static file tree to sync over (as in the repro instructions),
it will be synced over even with a deterministic traversal.

- If there are ongoing changes in the tree, there might be deep locations in the file tree where the synchronization activity never reaches (or with very low probability), even with randomized traversal.

Regardless, randomized traversal will provide a more even distribution of synchronization, tending to a broader coverage, something like this:

Sync coverage with deterministic traversal:

   /**  \
  /**    \
 /*       \
/*         \

Sync coverage with random traversal:

  /** ** \
 /   *  * \
/          \

So in fact, the whole thing can be investigated only
heuristically, for which purpose the actual scenario
where this came to picture is quite suitable -- ie.
in a case where aux glusterfs always leaks till OOM,
untar a bunch of kernel trees under geo-rep.
Comment 2 Anand Avati 2012-04-05 08:13:06 EDT
CHANGE: http://review.gluster.com/3079 (geo-rep / gsyncd: shuffle directory entries in crawl) merged in master by Vijay Bellur (vijay@gluster.com)

Note You need to log in before you can comment on or make changes to this bug.