Bug 763844 (GLUSTER-2112)

Summary:

Support parallel rebalancing

Product:

[Community] GlusterFS

Reporter:

Jeff Darcy <jdarcy>

Component:

distribute

Assignee:

shishir gowda <sgowda>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

low

Docs Contact:

Priority:

low

Version:

mainline

CC:

amarts, gluster-bugs, joe, nsathyan, shmohan, vijay

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

glusterfs-3.4.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-07-24 18:00:06 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

763990

Bug Blocks:

817967

Attachments:

Description	Flags
Patch for making original/copy hashes identical	none
Proof-of-concept rebalancing script	none

Description Jeff Darcy 2010-11-15 19:03:17 UTC

Created attachment 383

Comment 1 Jeff Darcy 2010-11-15 22:02:09 UTC

The current approach to rebalancing is highly sub-optimal, since it only seems to run on the node where the command was issued and move all data through that node - twice, if the rebalance is on node A and a file needs to be relocated from B to C. It would be preferable to do the rebalancing in parallel, especially on larger sets of servers, and to ensure that data moves at most once instead of twice. I have put together a Proof Of Concept showing one way to do this, and would like to develop the idea further. The POC starts with a very small patch to dht_layout_search, which simply ensures that the temporary file created during the relocation process (which starts with ".dht.") will be hashed the same way as the final result. A parallel rebalance can then be done in two stages.

(1) Manually force re-evaluation of the layout attributes, essentially the same way as is done in the current rebalance code:

# find /mnt/dht -type d | xargs getfattr -n trusted.distribute.fix.layout

(2) For each brick, run a rebalance script giving the mountpoint root and the brick root:

# rebalance.sh /mnt/dht /bricks/dht

The rebalance script does the following for each regular file *in the brick*:

(a) Check that the file is in fact a file and not just a linkfile (by looking for the linkto xattr).

(b) Try to create the temporary copy.

(c) Check whether creating the copy *in the mountpoint* caused a file to be created *in the brick*, indicating that the brick is the proper final location and no relocation is necessary, and bail out if so.

(d) Copy and move the file through the mountpoint path, much like the current rebalance code.

I've tried this and it works, even in parallel, but there are a couple of deficiencies. The probe/check/remove/create sequence is a bit inefficient, but that can easily be reduced to probe/check sequence in a version that uses C/Python/whatever instead of bash. There's also a bit of inefficiency in that node A might relocate a bunch of files to node B, which will then end up checking it again, but at least B won't actually move it so this isn't too bad and there are probably ways (akin to the current unhashed-sticky-bit trick) to avoid it. The biggest gap in the POC is the absence of open-file and file-modified checks as in the current rebalance code, which make it unsafe for actual production use. Again, though, that's pretty easily fixed in a non-POC version.

While this POC is clearly not complete, it involves minimal disruption to core/translator code and could provide substantial speedups for people trying to rebalance millions of files across several servers.

Comment 2 Amar Tumballi 2011-06-22 04:21:45 UTC

This is tracked with Bug 2258 now. We are thinking of doing it inside distribute instead of keeping rebalance logic outside (from mount point). 

Will keep you posted on updates on this.

Comment 3 Anand Avati 2011-09-09 06:20:25 UTC

CHANGE: http://review.gluster.com/343 (that way, we can share the rebalance state with other peers) merged in master by Vijay Bellur (vijay)

Comment 4 Anand Avati 2011-09-13 13:55:28 UTC

CHANGE: http://review.gluster.com/407 (there were bugs introduced due to parallelizing rebalance op.) merged in master by Vijay Bellur (vijay)

Comment 5 Amar Tumballi 2011-09-27 05:50:02 UTC

Planing to keep 3.4.x branch as "internal enhancements" release without any features. So moving these bugs to 3.4.0 target milestone.

Comment 6 Anand Avati 2012-02-17 06:49:03 UTC

CHANGE: http://review.gluster.com/2755 (posix: handle some internal behavior in posix_mknod()) merged in master by Anand Avati (avati)

Comment 7 Anand Avati 2012-02-19 09:31:25 UTC

CHANGE: http://review.gluster.com/2540 (cluster/dht: Rebalance will be a new glusterfs process) merged in master by Vijay Bellur (vijay)

Comment 8 Anand Avati 2012-02-19 12:47:56 UTC

CHANGE: http://review.gluster.com/2737 (cluster/dht: Support for hardlink rebalance when decommissioning) merged in master by Vijay Bellur (vijay)

Comment 9 Anand Avati 2012-03-08 05:14:33 UTC

CHANGE: http://review.gluster.com/2873 (glusterd/rebalance: Bring in support for parallel rebalance) merged in master by Vijay Bellur (vijay)

Comment 10 shishir gowda 2012-03-08 06:29:20 UTC

*** Bug 795716 has been marked as a duplicate of this bug. ***

Comment 11 shylesh 2012-05-07 07:04:46 UTC

Now rebalance process will be started on all the nodes which are responsible for data movement.