Bug 763844 (GLUSTER-2112)

Summary: Support parallel rebalancing
Product: [Community] GlusterFS Reporter: Jeff Darcy <jdarcy>
Component: distributeAssignee: shishir gowda <sgowda>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: mainlineCC: amarts, gluster-bugs, joe, nsathyan, shmohan, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-07-24 18:00:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 763990    
Bug Blocks: 817967    
Attachments:
Description Flags
Patch for making original/copy hashes identical
none
Proof-of-concept rebalancing script none

Description Jeff Darcy 2010-11-15 19:03:17 UTC
Created attachment 383

Comment 1 Jeff Darcy 2010-11-15 22:02:09 UTC
The current approach to rebalancing is highly sub-optimal, since it only seems to run on the node where the command was issued and move all data through that node - twice, if the rebalance is on node A and a file needs to be relocated from B to C.  It would be preferable to do the rebalancing in parallel, especially on larger sets of servers, and to ensure that data moves at most once instead of twice.  I have put together a Proof Of Concept showing one way to do this, and would like to develop the idea further.  The POC starts with a very small patch to dht_layout_search, which simply ensures that the temporary file created during the relocation process (which starts with ".dht.") will be hashed the same way as the final result.  A parallel rebalance can then be done in two stages.

(1) Manually force re-evaluation of the layout attributes, essentially the same way as is done in the current rebalance code:

# find /mnt/dht -type d | xargs getfattr -n trusted.distribute.fix.layout

(2) For each brick, run a rebalance script giving the mountpoint root and the brick root:

# rebalance.sh /mnt/dht /bricks/dht

The rebalance script does the following for each regular file *in the brick*:

(a) Check that the file is in fact a file and not just a linkfile (by looking for the linkto xattr).

(b) Try to create the temporary copy.

(c) Check whether creating the copy *in the mountpoint* caused a file to be created *in the brick*, indicating that the brick is the proper final location and no relocation is necessary, and bail out if so.

(d) Copy and move the file through the mountpoint path, much like the current rebalance code.

I've tried this and it works, even in parallel, but there are a couple of deficiencies.  The probe/check/remove/create sequence is a bit inefficient, but that can easily be reduced to probe/check sequence in a version that uses C/Python/whatever instead of bash.  There's also a bit of inefficiency in that node A might relocate a bunch of files to node B, which will then end up checking it again, but at least B won't actually move it so this isn't too bad and there are probably ways (akin to the current unhashed-sticky-bit trick) to avoid it.  The biggest gap in the POC is the absence of open-file and file-modified checks as in the current rebalance code, which make it unsafe for actual production use.  Again, though, that's pretty easily fixed in a non-POC version.

While this POC is clearly not complete, it involves minimal disruption to core/translator code and could provide substantial speedups for people trying to rebalance millions of files across several servers.

Comment 2 Amar Tumballi 2011-06-22 04:21:45 UTC
This is tracked with Bug 2258 now. We are thinking of doing it inside distribute instead of keeping rebalance logic outside (from mount point). 

Will keep you posted on updates on this.

Comment 3 Anand Avati 2011-09-09 06:20:25 UTC
CHANGE: http://review.gluster.com/343 (that way, we can share the rebalance state with other peers) merged in master by Vijay Bellur (vijay)

Comment 4 Anand Avati 2011-09-13 13:55:28 UTC
CHANGE: http://review.gluster.com/407 (there were bugs introduced due to parallelizing rebalance op.) merged in master by Vijay Bellur (vijay)

Comment 5 Amar Tumballi 2011-09-27 05:50:02 UTC
Planing to keep 3.4.x branch as "internal enhancements" release without any features. So moving these bugs to 3.4.0 target milestone.

Comment 6 Anand Avati 2012-02-17 06:49:03 UTC
CHANGE: http://review.gluster.com/2755 (posix: handle some internal behavior in posix_mknod()) merged in master by Anand Avati (avati)

Comment 7 Anand Avati 2012-02-19 09:31:25 UTC
CHANGE: http://review.gluster.com/2540 (cluster/dht: Rebalance will be a new glusterfs process) merged in master by Vijay Bellur (vijay)

Comment 8 Anand Avati 2012-02-19 12:47:56 UTC
CHANGE: http://review.gluster.com/2737 (cluster/dht: Support for hardlink rebalance when decommissioning) merged in master by Vijay Bellur (vijay)

Comment 9 Anand Avati 2012-03-08 05:14:33 UTC
CHANGE: http://review.gluster.com/2873 (glusterd/rebalance: Bring in support for parallel rebalance) merged in master by Vijay Bellur (vijay)

Comment 10 shishir gowda 2012-03-08 06:29:20 UTC
*** Bug 795716 has been marked as a duplicate of this bug. ***

Comment 11 shylesh 2012-05-07 07:04:46 UTC
Now rebalance process will be started on all the nodes which are responsible for data movement.