Bug 1374135 - Rebalance is not considering the brick sizes while fixing the layout
Summary: Rebalance is not considering the brick sizes while fixing the layout
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.8
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On: 1257182 1366494
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-08 03:58 UTC by Nithya Balachandran
Modified: 2016-10-20 14:02 UTC (History)
1 user (show)

Fixed In Version: glusterfs-3.8.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1366494
Environment:
Last Closed: 2016-10-20 14:02:35 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Nithya Balachandran 2016-09-08 03:58:37 UTC
+++ This bug was initially created as a clone of Bug #1366494 +++

+++ This bug was initially created as a clone of Bug #1257182 +++

Problem statement:
============================

Rebalance is not considering the brick sizes while fixing the layout of the volume

Steps/procedure:

1. create a distribute volume using one brick of 100GB .
2. Mount it on the client using FUSE and create directory and 1000 files
3. add brick of 200GB from the another node and run the rebalance from the same node


Actual results:
================
Though Brick2 is of 200GB, it is holding only 327 and another brick has 676. Direcotry ranges are given below 




[root@rhs-client9 dht4]# getfattr -d -m . -e hex /rhs/brick2/dht4/data
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick2/dht4/data
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x2bcf9f94144a4decb533a419885784cc
trusted.glusterfs.dht=0x0000000100000000aaa972d0ffffffff (200 GB Brick)


[root@rhs-client4 dht4]# getfattr -d -m . -e hex /rhs/brick1/dht4/data
getfattr: Removing leading '/' from absolute path names
# file: rhs/brick1/dht4/data
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x2bcf9f94144a4decb533a419885784cc
trusted.glusterfs.dht=0x000000010000000000000000aaa972cf (100 GB Brick)



Expected results:
==================
while fixing the layout re-balance  should consider the brick sizes 


Output:
===================
[root@rhs-client4 dht4]# gluster vol status dht4
Status of volume: dht4
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhs-client4.lab.eng.blr.redhat.com:/r
hs/brick1/dht4                              49158     0          Y       20117
Brick rhs-client9.lab.eng.blr.redhat.com:/r
hs/brick2/dht4                              49157     0          Y       29628
NFS Server on localhost                     2049      0          Y       20301
NFS Server on rhs-client39.lab.eng.blr.redh
at.com                                      N/A       N/A        N       N/A  
NFS Server on rhs-client9.lab.eng.blr.redha
t.com                                       N/A       N/A        N       N/A  
 
Task Status of Volume dht4
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : b93f08b3-e59c-4e30-bd0f-b405e553bdb3
Status               : completed        

[root@rhs-client9 dht4]# df -h | grep brick2
/dev/mapper/rhel_rhs--client9-vol1  200G   60M  200G   1% /rhs/brick2

[root@rhs-client4 dht4]# df -h | grep brick1
/dev/mapper/rhgs_rhs--client4-vol1  100G   84M  100G   1% /rhs/brick1

--- Additional comment from Red Hat Bugzilla Rules Engine on 2015-08-26 08:33:08 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Raghavendra G on 2016-06-28 02:42:03 EDT ---

The ranges allocated are:

>>> 0xffffffff - 0xaaa972d0
1431735599
>>> 0xaaa972cf
2863231695

Though the ranges are in the ratio 1:2, they are allocated to wrong bricks. Large range is allocated to smaller brick. Need to fix it.

--- Additional comment from John Skeoch on 2016-07-13 18:35:18 EDT ---

User rmekala's account has been closed

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-08-09 07:17:24 EDT ---

Since this bug has been approved for the RHGS 3.2.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.2.0+', and through the Internal Whiteboard entry of '3.2.0', the Target Release is being automatically set to 'RHGS 3.2.0'

--- Additional comment from Nithya Balachandran on 2016-08-16 00:51:50 EDT ---

RCA:

The volume was created with a single brick. On adding a second much larger brick and running a rebalance, the layout is recalculated for all existing directories by calling dht_fix_layout_of_directory (). This function generates a new weighted layout in dht_selfheal_layout_new_directory () but then calls dht_selfheal_layout_maximize_overlap () on the newly generated layout. This function does not consider the relative brick sizes and as the original brick had a complete layout (0x00000000-0xffffffff), the layout is swapped to maximize the overlap with the old layout.

--- Additional comment from Jeff Darcy on 2016-08-16 09:00:27 EDT ---

Nithya's analysis is correct.  We generate a new layout based on brick sizes, then attempt to optimize it for maximum overlap with the current layout.  That optimization is important to minimize data movement, but unfortunately it's broken in this case because it doesn't account properly for where each range already resides.  I wrote that function BTW, so it's my fault.  For now, we should probably just disable the optimization phase when we're weighting by brick size.  Longer term, what we need to do is fix dht_selfheal_layout_maximize_overlap.  There's a place where it tries to determine whether a particular swap would be an improvement or not.  That particular calculation needs to be enhance to account for the *actual* current and proposed locations for a range, instead of (effectively) inferring those locations from ordinal positions.

--- Additional comment from Worker Ant on 2016-09-06 01:48:47 EDT ---

REVIEW: http://review.gluster.org/15403 (cluster/dht: Skip layout overlap maximization on weighted rebalance) posted (#1) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2016-09-06 13:12:56 EDT ---

REVIEW: http://review.gluster.org/15403 (cluster/dht: Skip layout overlap maximization on weighted rebalance) posted (#2) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2016-09-07 04:11:50 EDT ---

REVIEW: http://review.gluster.org/15403 (cluster/dht: Skip layout overlap maximization on weighted rebalance) posted (#3) for review on master by N Balachandran (nbalacha)

--- Additional comment from Worker Ant on 2016-09-07 12:49:46 EDT ---

REVIEW: http://review.gluster.org/15403 (cluster/dht: Skip layout overlap maximization on weighted rebalance) posted (#4) for review on master by N Balachandran (nbalacha)

Comment 1 Worker Ant 2016-09-08 04:24:08 UTC
REVIEW: http://review.gluster.org/15422 (cluster/dht: Skip layout overlap maximization on weighted rebalance) posted (#1) for review on release-3.8 by N Balachandran (nbalacha)

Comment 2 Worker Ant 2016-09-09 12:27:56 UTC
REVIEW: http://review.gluster.org/15422 (cluster/dht: Skip layout overlap maximization on weighted rebalance) posted (#2) for review on release-3.8 by N Balachandran (nbalacha)

Comment 3 Niels de Vos 2016-09-12 05:39:05 UTC
All 3.8.x bugs are now reported against version 3.8 (without .x). For more information, see http://www.gluster.org/pipermail/gluster-devel/2016-September/050859.html

Comment 4 Worker Ant 2016-09-14 05:20:49 UTC
COMMIT: http://review.gluster.org/15422 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 31b045060478e1c5066f6f9c4321970fbff398de
Author: N Balachandran <nbalacha>
Date:   Thu Sep 8 09:34:46 2016 +0530

    cluster/dht: Skip layout overlap maximization on weighted rebalance
    
    During a fix-layout, dht_selfheal_layout_maximize_overlap () does not
    consider chunk sizes while calculating layout overlaps, causing smaller
    bricks to sometimes get larger ranges than larger bricks. Temporarily
    enabling this operation if only if weighted rebalance is disabled
    or all bricks are the same size.
    
    > Change-Id: I5ed16cdff2551b826a1759ca8338921640bfc7b3
    > BUG: 1366494
    > Signed-off-by: N Balachandran <nbalacha>
    > Reviewed-on: http://review.gluster.org/15403
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Raghavendra G <rgowdapp>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    
    (cherry picked from commit b93692cce603006d9cb6750e08183bca742792ac)
    
    Change-Id: Icf0dd83f36912e721982bcf818a06c4b339dc974
    BUG: 1374135
    Signed-off-by: N Balachandran <nbalacha>
    Reviewed-on: http://review.gluster.org/15422
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>

Comment 5 Niels de Vos 2016-10-20 14:02:35 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.5, please open a new bug report.

glusterfs-3.8.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/announce/2016-October/000061.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.