Bug 1366817 - AFR returns the node uuid of the same node for every file in the replica
Summary: AFR returns the node uuid of the same node for every file in the replica
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
Assignee: Karthik U S
QA Contact:
URL:
Whiteboard:
Depends On: 1315781
Blocks: 1451561 1451573 1487042
TreeView+ depends on / blocked
 
Reported: 2016-08-13 02:01 UTC by Pranith Kumar K
Modified: 2017-09-05 17:24 UTC (History)
12 users (show)

Fixed In Version: glusterfs-3.12.0
Clone Of: 1315781
: 1451561 1451573 1487042 (view as bug list)
Environment:
Last Closed: 2017-09-05 17:24:45 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Comment 1 Worker Ant 2017-04-19 12:53:35 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#1) for review on master by Karthik U S (ksubrahm)

Comment 2 Worker Ant 2017-04-20 07:17:20 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#2) for review on master by Karthik U S (ksubrahm)

Comment 3 Worker Ant 2017-04-20 09:15:15 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#3) for review on master by Karthik U S (ksubrahm)

Comment 4 Worker Ant 2017-04-20 17:09:09 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#4) for review on master by Karthik U S (ksubrahm)

Comment 5 Worker Ant 2017-05-01 15:23:19 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#5) for review on master by N Balachandran (nbalacha)

Comment 6 Worker Ant 2017-05-01 15:23:24 UTC
REVIEW: https://review.gluster.org/17144 (cluster/dht: rebalance fixes) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 7 Worker Ant 2017-05-09 18:44:54 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#6) for review on master by Karthik U S (ksubrahm)

Comment 8 Worker Ant 2017-05-09 18:50:45 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#7) for review on master by Karthik U S (ksubrahm)

Comment 9 Worker Ant 2017-05-10 16:07:30 UTC
REVIEW: https://review.gluster.org/17239 (cluster/dht: Rebalance on all nodes migrate files) posted (#1) for review on master by N Balachandran (nbalacha)

Comment 10 Worker Ant 2017-05-10 16:19:01 UTC
REVIEW: https://review.gluster.org/17239 (cluster/dht: Rebalance on all nodes should migrate files) posted (#2) for review on master by N Balachandran (nbalacha)

Comment 11 Worker Ant 2017-05-11 09:59:29 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#8) for review on master by Karthik U S (ksubrahm)

Comment 12 Worker Ant 2017-05-12 06:48:56 UTC
REVIEW: https://review.gluster.org/17239 (cluster/dht: Rebalance on all nodes should migrate files) posted (#3) for review on master by N Balachandran (nbalacha)

Comment 13 Xavi Hernandez 2017-05-12 07:38:17 UTC
What are the requisites of the returned list of UUIDs ?

I see that they must have a space as a separator, but is the order important ? if a brick is down or known to be in an inconsistent state, should it be returned ? I see that AFR returns a NULL UUID for down bricks. Is this required for bad and/or down bricks or can they simply be omitted ?

Comment 14 Nithya Balachandran 2017-05-12 10:16:57 UTC
Before answering Xavi's questions, here is an overview of the new approach.

The current DHT rebalance design is as follows:

1. DHT rebalance runs on all nodes, each of which has a node-uuid
2. On startup, each rebalance process determines the list of local subvols for the node (this is so each rebalance process reduces network reads by migrating only those files on the same node) by getting node-uuids for the brick root and comparing it against its own. As only the node uuid of the first brick is returned, only the node on which the first brick exists will actually have any local subvols.
4. Each rebalance process lists files on its local nodes and migrates them if required.



The new approach is as follows:

1. DHT rebalance runs on all nodes, each of which has a node-uuid
2. On startup, each rebalance process determines the list of local subvols for the node (this is so each rebalance process reduces network reads by migrating only those files on the same node) by getting node-uuids for the brick root and comparing it against its own. As all node uuids of every brick in the replica set are returned, all the nodes will now have local subvols. The node-uuids for each locla subvols are saved in an array. 
4. Each rebalance process:
- lists files on its local nodes,
- hashes the file gfid for each file
- hash % count of node-uuids to get an index into the node-uuid array. 
- if the node-uuid at that index matches that of the current process, the process will migrate the file


This requires that all node-uuids be saved in the same order in DHT on all nodes otherwise there is no guarantee that a particular file will always be assigned to the same node (we do not want multiple nodes trying to migrate the same file).

It makes it easier if AFR/EC can return the uuids in the same order as that of the bricks so it can be extended to use notifications to determine when a brick goes down.

If a brick is down, but the node is up, I would say return the node-uuid if available as the other node can still migrate files. I am not sure of how easy it will be to determine if a node is up.

A node-uuid must be returned for bad bricks in order to keep the selection algo consistent and allow that node to migrate files.

Comment 15 Worker Ant 2017-05-12 10:18:40 UTC
REVIEW: https://review.gluster.org/17239 (cluster/dht: Rebalance on all nodes should migrate files) posted (#4) for review on master by N Balachandran (nbalacha)

Comment 16 Pranith Kumar K 2017-05-13 03:41:01 UTC
(In reply to Nithya Balachandran from comment #14)
> Before answering Xavi's questions, here is an overview of the new approach.
> 
> The current DHT rebalance design is as follows:
> 
> 1. DHT rebalance runs on all nodes, each of which has a node-uuid
> 2. On startup, each rebalance process determines the list of local subvols
> for the node (this is so each rebalance process reduces network reads by
> migrating only those files on the same node) by getting node-uuids for the
> brick root and comparing it against its own. As only the node uuid of the
> first brick is returned, only the node on which the first brick exists will
> actually have any local subvols.
> 4. Each rebalance process lists files on its local nodes and migrates them
> if required.
> 
> 
> 
> The new approach is as follows:
> 
> 1. DHT rebalance runs on all nodes, each of which has a node-uuid
> 2. On startup, each rebalance process determines the list of local subvols
> for the node (this is so each rebalance process reduces network reads by
> migrating only those files on the same node) by getting node-uuids for the
> brick root and comparing it against its own. As all node uuids of every
> brick in the replica set are returned, all the nodes will now have local
> subvols. The node-uuids for each locla subvols are saved in an array. 
> 4. Each rebalance process:
> - lists files on its local nodes,
> - hashes the file gfid for each file
> - hash % count of node-uuids to get an index into the node-uuid array. 
> - if the node-uuid at that index matches that of the current process, the
> process will migrate the file
> 
> 
> This requires that all node-uuids be saved in the same order in DHT on all
> nodes otherwise there is no guarantee that a particular file will always be
> assigned to the same node (we do not want multiple nodes trying to migrate
> the same file).
> 
> It makes it easier if AFR/EC can return the uuids in the same order as that
> of the bricks so it can be extended to use notifications to determine when a
> brick goes down.
> 
> If a brick is down, but the node is up, I would say return the node-uuid if
> available as the other node can still migrate files. I am not sure of how
> easy it will be to determine if a node is up.

At the moment we just give all-zero uuid if getxattr from that brick either fails or it is down from the beginning.

> 
> A node-uuid must be returned for bad bricks in order to keep the selection
> algo consistent and allow that node to migrate files.

Thanks for the detailed explanation Nitya, I don't think I could have given such a good response :-).

Xavi,
    We are trying to get this in by 17th May, so one of us will pick this up for implementation. I am pretty sure you can send this patch in an hour, so if you have the time to send the patch, just update the bz that you are and we will be happy to get this in.

Pranith

Comment 17 Worker Ant 2017-05-13 13:51:23 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#9) for review on master by Karthik U S (ksubrahm)

Comment 18 Xavi Hernandez 2017-05-15 08:20:06 UTC
Pranith, sorry for the late answer. It has been a really busy weekend...

I'll try to make the patch this morning.

Comment 19 Worker Ant 2017-05-15 09:40:58 UTC
REVIEW: https://review.gluster.org/17239 (cluster/dht: Rebalance on all nodes should migrate files) posted (#5) for review on master by N Balachandran (nbalacha)

Comment 20 Worker Ant 2017-05-15 12:59:57 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#1) for review on master by Xavier Hernandez (xhernandez)

Comment 21 Worker Ant 2017-05-15 13:02:36 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#2) for review on master by Xavier Hernandez (xhernandez)

Comment 22 Worker Ant 2017-05-15 20:33:43 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#3) for review on master by Xavier Hernandez (xhernandez)

Comment 23 Worker Ant 2017-05-15 21:06:26 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#4) for review on master by Xavier Hernandez (xhernandez)

Comment 24 Worker Ant 2017-05-15 21:13:08 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#5) for review on master by Xavier Hernandez (xhernandez)

Comment 25 Worker Ant 2017-05-15 21:24:38 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#6) for review on master by Xavier Hernandez (xhernandez)

Comment 26 Worker Ant 2017-05-15 21:33:50 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#7) for review on master by Xavier Hernandez (xhernandez)

Comment 27 Worker Ant 2017-05-15 21:49:24 UTC
REVIEW: https://review.gluster.org/17297 (cluster/ec: return all node uuids from all subvolumes) posted (#8) for review on master by Xavier Hernandez (xhernandez)

Comment 28 Worker Ant 2017-05-16 16:01:45 UTC
COMMIT: https://review.gluster.org/17239 committed in master by Shyamsundar Ranganathan (srangana) 
------
commit b23bd3dbc2c153171d0bb1205e6804afe022a55f
Author: N Balachandran <nbalacha>
Date:   Wed May 10 21:26:28 2017 +0530

    cluster/dht: Rebalance on all nodes should migrate files
    
    Problem:
    Rebalance compares the node-uuid of a file against its own
    to and migrates a file only if they match. However, the
    current behaviour in both AFR and EC is to return
    the node-uuid of the first brick in a replica set for all
    files. This means a single node ends up migrating all
    the files if the first brick of every replica set is on the
    same node.
    
    Fix:
    AFR and EC will return all node-uuids for the replica set.
    The rebalance process will divide the files to be migrated
    among all the nodes by hashing the gfid of the file and
    using that value to select a node to perform the migration.
    This patch makes the required DHT and tiering changes.
    
    Some tests in rebal-all-nodes-migrate.t will need to be
    uncommented once the AFR and EC changes are merged.
    
    Change-Id: I5ce41600f5ba0e244ddfd986e2ba8fa23329ff0c
    BUG: 1366817
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: https://review.gluster.org/17239
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Amar Tumballi <amarts>
    Reviewed-by: Jeff Darcy <jeff.us>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

Comment 29 Worker Ant 2017-05-17 12:42:24 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#10) for review on master by Ravishankar N (ravishankar)

Comment 30 Worker Ant 2017-05-17 13:15:33 UTC
REVIEW: https://review.gluster.org/17084 (cluster/afr: Return the list of node_uuids for the subvolume) posted (#11) for review on master by Ravishankar N (ravishankar)

Comment 31 Worker Ant 2017-05-17 17:49:56 UTC
COMMIT: https://review.gluster.org/17297 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit bcc34ce05c1be76dae42838d55c15d3af5f80e48
Author: Xavier Hernandez <xhernandez>
Date:   Fri May 12 09:23:47 2017 +0200

    cluster/ec: return all node uuids from all subvolumes
    
    EC was retuning the UUID of the brick with smaller value. This had
    the side effect of not evenly balancing the load between bricks on
    rebalance operations.
    
    This patch modifies the common functions that combine multiple subvolume
    values into a single result to take into account the subvolume order
    and, optionally, other subvolumes that could be damaged.
    
    This makes easier to add future features where brick order is important.
    It also makes possible to easily identify the originating brick of each
    answer, in case some brick will have an special meaning in the future.
    
    Change-Id: Iee0a4da710b41224a6dc8e13fa8dcddb36c73a2f
    BUG: 1366817
    Signed-off-by: Xavier Hernandez <xhernandez>
    Reviewed-on: https://review.gluster.org/17297
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Ashish Pandey <aspandey>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 32 Worker Ant 2017-05-17 18:45:55 UTC
COMMIT: https://review.gluster.org/17084 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 0a50167c0a8f950f5a1c76442b6c9abea466200d
Author: karthik-us <ksubrahm>
Date:   Wed Apr 19 18:04:46 2017 +0530

    cluster/afr: Return the list of node_uuids for the subvolume
    
    Problem:
    AFR was returning the node uuid of the first node for every file if
    the replica set was healthy, which was resulting in only one node
    migrating all the files.
    
    Fix:
    With this patch AFR returns the list of node_uuids to the upper layer,
    so that they can decide on which node to migrate which files, resulting
    in improved performance. Ordering of node uuids will be maintained based
    on the ordering of the bricks. If a brick is down, then the node uuid
    for that will be set to all zeros.
    
    Change-Id: I73ee0f9898ae473584fdf487a2980d7a6db22f31
    BUG: 1366817
    Signed-off-by: karthik-us <ksubrahm>
    Reviewed-on: https://review.gluster.org/17084
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 33 Worker Ant 2017-05-17 18:52:41 UTC
REVIEW: https://review.gluster.org/17320 (tests: Uncomment multi-node rebalance tests) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 34 Shyamsundar 2017-09-05 17:24:45 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.