Red Hat Bugzilla – Bug 763844
Support parallel rebalancing
Last modified: 2013-12-08 20:22:12 EST
Created attachment 383
The current approach to rebalancing is highly sub-optimal, since it only seems to run on the node where the command was issued and move all data through that node - twice, if the rebalance is on node A and a file needs to be relocated from B to C. It would be preferable to do the rebalancing in parallel, especially on larger sets of servers, and to ensure that data moves at most once instead of twice. I have put together a Proof Of Concept showing one way to do this, and would like to develop the idea further. The POC starts with a very small patch to dht_layout_search, which simply ensures that the temporary file created during the relocation process (which starts with ".dht.") will be hashed the same way as the final result. A parallel rebalance can then be done in two stages.
(1) Manually force re-evaluation of the layout attributes, essentially the same way as is done in the current rebalance code:
# find /mnt/dht -type d | xargs getfattr -n trusted.distribute.fix.layout
(2) For each brick, run a rebalance script giving the mountpoint root and the brick root:
# rebalance.sh /mnt/dht /bricks/dht
The rebalance script does the following for each regular file *in the brick*:
(a) Check that the file is in fact a file and not just a linkfile (by looking for the linkto xattr).
(b) Try to create the temporary copy.
(c) Check whether creating the copy *in the mountpoint* caused a file to be created *in the brick*, indicating that the brick is the proper final location and no relocation is necessary, and bail out if so.
(d) Copy and move the file through the mountpoint path, much like the current rebalance code.
I've tried this and it works, even in parallel, but there are a couple of deficiencies. The probe/check/remove/create sequence is a bit inefficient, but that can easily be reduced to probe/check sequence in a version that uses C/Python/whatever instead of bash. There's also a bit of inefficiency in that node A might relocate a bunch of files to node B, which will then end up checking it again, but at least B won't actually move it so this isn't too bad and there are probably ways (akin to the current unhashed-sticky-bit trick) to avoid it. The biggest gap in the POC is the absence of open-file and file-modified checks as in the current rebalance code, which make it unsafe for actual production use. Again, though, that's pretty easily fixed in a non-POC version.
While this POC is clearly not complete, it involves minimal disruption to core/translator code and could provide substantial speedups for people trying to rebalance millions of files across several servers.
This is tracked with Bug 2258 now. We are thinking of doing it inside distribute instead of keeping rebalance logic outside (from mount point).
Will keep you posted on updates on this.
CHANGE: http://review.gluster.com/343 (that way, we can share the rebalance state with other peers) merged in master by Vijay Bellur (firstname.lastname@example.org)
CHANGE: http://review.gluster.com/407 (there were bugs introduced due to parallelizing rebalance op.) merged in master by Vijay Bellur (email@example.com)
Planing to keep 3.4.x branch as "internal enhancements" release without any features. So moving these bugs to 3.4.0 target milestone.
CHANGE: http://review.gluster.com/2755 (posix: handle some internal behavior in posix_mknod()) merged in master by Anand Avati (firstname.lastname@example.org)
CHANGE: http://review.gluster.com/2540 (cluster/dht: Rebalance will be a new glusterfs process) merged in master by Vijay Bellur (email@example.com)
CHANGE: http://review.gluster.com/2737 (cluster/dht: Support for hardlink rebalance when decommissioning) merged in master by Vijay Bellur (firstname.lastname@example.org)
CHANGE: http://review.gluster.com/2873 (glusterd/rebalance: Bring in support for parallel rebalance) merged in master by Vijay Bellur (email@example.com)
*** Bug 795716 has been marked as a duplicate of this bug. ***
Now rebalance process will be started on all the nodes which are responsible for data movement.