763844 – (GLUSTER-2112) Support parallel rebalancing

Bug 763844 (GLUSTER-2112) - Support parallel rebalancing

Summary: Support parallel rebalancing

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-2112
Product:	GlusterFS
Classification:	Community
Component:	distribute
Sub Component:
Version:	mainline
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	shishir gowda
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	795716 (view as bug list)
Depends On:	GLUSTER-2258
Blocks:	817967
TreeView+	depends on / blocked

Reported:	2010-11-15 22:02 UTC by Jeff Darcy
Modified:	2013-12-09 01:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 18:00:06 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch for making original/copy hashes identical (917 bytes, patch) 2010-11-15 19:02 UTC, Jeff Darcy	no flags	Details \| Diff
Proof-of-concept rebalancing script (808 bytes, application/x-shellscript) 2010-11-15 19:03 UTC, Jeff Darcy	no flags	Details
View All

Description Jeff Darcy 2010-11-15 19:03:17 UTC

Created attachment 383

Comment 1 Jeff Darcy 2010-11-15 22:02:09 UTC

The current approach to rebalancing is highly sub-optimal, since it only seems to run on the node where the command was issued and move all data through that node - twice, if the rebalance is on node A and a file needs to be relocated from B to C. It would be preferable to do the rebalancing in parallel, especially on larger sets of servers, and to ensure that data moves at most once instead of twice. I have put together a Proof Of Concept showing one way to do this, and would like to develop the idea further. The POC starts with a very small patch to dht_layout_search, which simply ensures that the temporary file created during the relocation process (which starts with ".dht.") will be hashed the same way as the final result. A parallel rebalance can then be done in two stages.

(1) Manually force re-evaluation of the layout attributes, essentially the same way as is done in the current rebalance code:

# find /mnt/dht -type d | xargs getfattr -n trusted.distribute.fix.layout

(2) For each brick, run a rebalance script giving the mountpoint root and the brick root:

# rebalance.sh /mnt/dht /bricks/dht

The rebalance script does the following for each regular file *in the brick*:

(a) Check that the file is in fact a file and not just a linkfile (by looking for the linkto xattr).

(b) Try to create the temporary copy.

(c) Check whether creating the copy *in the mountpoint* caused a file to be created *in the brick*, indicating that the brick is the proper final location and no relocation is necessary, and bail out if so.

(d) Copy and move the file through the mountpoint path, much like the current rebalance code.

I've tried this and it works, even in parallel, but there are a couple of deficiencies. The probe/check/remove/create sequence is a bit inefficient, but that can easily be reduced to probe/check sequence in a version that uses C/Python/whatever instead of bash. There's also a bit of inefficiency in that node A might relocate a bunch of files to node B, which will then end up checking it again, but at least B won't actually move it so this isn't too bad and there are probably ways (akin to the current unhashed-sticky-bit trick) to avoid it. The biggest gap in the POC is the absence of open-file and file-modified checks as in the current rebalance code, which make it unsafe for actual production use. Again, though, that's pretty easily fixed in a non-POC version.

While this POC is clearly not complete, it involves minimal disruption to core/translator code and could provide substantial speedups for people trying to rebalance millions of files across several servers.

Comment 2 Amar Tumballi 2011-06-22 04:21:45 UTC

This is tracked with Bug 2258 now. We are thinking of doing it inside distribute instead of keeping rebalance logic outside (from mount point). 

Will keep you posted on updates on this.

Comment 3 Anand Avati 2011-09-09 06:20:25 UTC

CHANGE: http://review.gluster.com/343 (that way, we can share the rebalance state with other peers) merged in master by Vijay Bellur (vijay)

Comment 4 Anand Avati 2011-09-13 13:55:28 UTC

CHANGE: http://review.gluster.com/407 (there were bugs introduced due to parallelizing rebalance op.) merged in master by Vijay Bellur (vijay)

Comment 5 Amar Tumballi 2011-09-27 05:50:02 UTC

Planing to keep 3.4.x branch as "internal enhancements" release without any features. So moving these bugs to 3.4.0 target milestone.

Comment 6 Anand Avati 2012-02-17 06:49:03 UTC

CHANGE: http://review.gluster.com/2755 (posix: handle some internal behavior in posix_mknod()) merged in master by Anand Avati (avati)

Comment 7 Anand Avati 2012-02-19 09:31:25 UTC

CHANGE: http://review.gluster.com/2540 (cluster/dht: Rebalance will be a new glusterfs process) merged in master by Vijay Bellur (vijay)

Comment 8 Anand Avati 2012-02-19 12:47:56 UTC

CHANGE: http://review.gluster.com/2737 (cluster/dht: Support for hardlink rebalance when decommissioning) merged in master by Vijay Bellur (vijay)

Comment 9 Anand Avati 2012-03-08 05:14:33 UTC

CHANGE: http://review.gluster.com/2873 (glusterd/rebalance: Bring in support for parallel rebalance) merged in master by Vijay Bellur (vijay)

Comment 10 shishir gowda 2012-03-08 06:29:20 UTC

*** Bug 795716 has been marked as a duplicate of this bug. ***

Comment 11 shylesh 2012-05-07 07:04:46 UTC

Now rebalance process will be started on all the nodes which are responsible for data movement.

Note You need to log in before you can comment on or make changes to this bug.