Red Hat Bugzilla – Bug 809675
[FEAT] Asymptotic synchronization is not reached in an unreliable enviroment
Last modified: 2013-07-24 13:59:47 EDT
Description of problem:
By "asymptotic synchronization" we mean that any particular change
on master side gets synchronized to slave at some time.
The geo-rep model theoretically delivers asymptotic synchronization,
but it's not robust: if the gsyncd worker is interrupted more
frequently than the time needed for a complete crawl (eg. due to
network failures, panicky slave, or aux glusterfs leaking up to
triggering the OOM killer), then some files will never get to
the slave end, due to the deterministic order of walking through
the file tree.
Solution is to randomize the walk.
Version-Release number of selected component (if applicable):
Well reproducible, but deeming if the issue appears is not
easy to automate.
Steps to Reproduce:
1. create a file tree in some volume that's bigger than being possible to sync in a minute
2. start geo-rep with the above volume as master and empty slave
3. in each minute, stop and re-start geo-rep
Some files of master never appear on slave side.
Eventually all files of master should appear on slave side.
To add, the above description, while gives a good insight to the issue at hand, is a bit of oversimplification. Assuming that the worker is always interrupted early:
- If we have a static file tree to sync over (as in the repro instructions),
it will be synced over even with a deterministic traversal.
- If there are ongoing changes in the tree, there might be deep locations in the file tree where the synchronization activity never reaches (or with very low probability), even with randomized traversal.
Regardless, randomized traversal will provide a more even distribution of synchronization, tending to a broader coverage, something like this:
Sync coverage with deterministic traversal:
Sync coverage with random traversal:
/** ** \
/ * * \
So in fact, the whole thing can be investigated only
heuristically, for which purpose the actual scenario
where this came to picture is quite suitable -- ie.
in a case where aux glusterfs always leaks till OOM,
untar a bunch of kernel trees under geo-rep.
CHANGE: http://review.gluster.com/3079 (geo-rep / gsyncd: shuffle directory entries in crawl) merged in master by Vijay Bellur (email@example.com)