Bug 763844 - (GLUSTER-2112) Support parallel rebalancing
Support parallel rebalancing
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: distribute (Show other bugs)
mainline
All Linux
low Severity low
: ---
: ---
Assigned To: shishir gowda
:
: 795716 (view as bug list)
Depends On: GLUSTER-2258
Blocks: 817967
  Show dependency treegraph
 
Reported: 2010-11-15 17:02 EST by Jeff Darcy
Modified: 2013-12-08 20:22 EST (History)
6 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-07-24 14:00:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Patch for making original/copy hashes identical (917 bytes, patch)
2010-11-15 14:02 EST, Jeff Darcy
no flags Details | Diff
Proof-of-concept rebalancing script (808 bytes, application/x-shellscript)
2010-11-15 14:03 EST, Jeff Darcy
no flags Details

  None (edit)
Description Jeff Darcy 2010-11-15 14:03:17 EST
Created attachment 383
Comment 1 Jeff Darcy 2010-11-15 17:02:09 EST
The current approach to rebalancing is highly sub-optimal, since it only seems to run on the node where the command was issued and move all data through that node - twice, if the rebalance is on node A and a file needs to be relocated from B to C.  It would be preferable to do the rebalancing in parallel, especially on larger sets of servers, and to ensure that data moves at most once instead of twice.  I have put together a Proof Of Concept showing one way to do this, and would like to develop the idea further.  The POC starts with a very small patch to dht_layout_search, which simply ensures that the temporary file created during the relocation process (which starts with ".dht.") will be hashed the same way as the final result.  A parallel rebalance can then be done in two stages.

(1) Manually force re-evaluation of the layout attributes, essentially the same way as is done in the current rebalance code:

# find /mnt/dht -type d | xargs getfattr -n trusted.distribute.fix.layout

(2) For each brick, run a rebalance script giving the mountpoint root and the brick root:

# rebalance.sh /mnt/dht /bricks/dht

The rebalance script does the following for each regular file *in the brick*:

(a) Check that the file is in fact a file and not just a linkfile (by looking for the linkto xattr).

(b) Try to create the temporary copy.

(c) Check whether creating the copy *in the mountpoint* caused a file to be created *in the brick*, indicating that the brick is the proper final location and no relocation is necessary, and bail out if so.

(d) Copy and move the file through the mountpoint path, much like the current rebalance code.

I've tried this and it works, even in parallel, but there are a couple of deficiencies.  The probe/check/remove/create sequence is a bit inefficient, but that can easily be reduced to probe/check sequence in a version that uses C/Python/whatever instead of bash.  There's also a bit of inefficiency in that node A might relocate a bunch of files to node B, which will then end up checking it again, but at least B won't actually move it so this isn't too bad and there are probably ways (akin to the current unhashed-sticky-bit trick) to avoid it.  The biggest gap in the POC is the absence of open-file and file-modified checks as in the current rebalance code, which make it unsafe for actual production use.  Again, though, that's pretty easily fixed in a non-POC version.

While this POC is clearly not complete, it involves minimal disruption to core/translator code and could provide substantial speedups for people trying to rebalance millions of files across several servers.
Comment 2 Amar Tumballi 2011-06-22 00:21:45 EDT
This is tracked with Bug 2258 now. We are thinking of doing it inside distribute instead of keeping rebalance logic outside (from mount point). 

Will keep you posted on updates on this.
Comment 3 Anand Avati 2011-09-09 02:20:25 EDT
CHANGE: http://review.gluster.com/343 (that way, we can share the rebalance state with other peers) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 4 Anand Avati 2011-09-13 09:55:28 EDT
CHANGE: http://review.gluster.com/407 (there were bugs introduced due to parallelizing rebalance op.) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 5 Amar Tumballi 2011-09-27 01:50:02 EDT
Planing to keep 3.4.x branch as "internal enhancements" release without any features. So moving these bugs to 3.4.0 target milestone.
Comment 6 Anand Avati 2012-02-17 01:49:03 EST
CHANGE: http://review.gluster.com/2755 (posix: handle some internal behavior in posix_mknod()) merged in master by Anand Avati (avati@redhat.com)
Comment 7 Anand Avati 2012-02-19 04:31:25 EST
CHANGE: http://review.gluster.com/2540 (cluster/dht: Rebalance will be a new glusterfs process) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 8 Anand Avati 2012-02-19 07:47:56 EST
CHANGE: http://review.gluster.com/2737 (cluster/dht: Support for hardlink rebalance when decommissioning) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 9 Anand Avati 2012-03-08 00:14:33 EST
CHANGE: http://review.gluster.com/2873 (glusterd/rebalance: Bring in support for parallel rebalance) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 10 shishir gowda 2012-03-08 01:29:20 EST
*** Bug 795716 has been marked as a duplicate of this bug. ***
Comment 11 shylesh 2012-05-07 03:04:46 EDT
Now rebalance process will be started on all the nodes which are responsible for data movement.

Note You need to log in before you can comment on or make changes to this bug.