Bug 1221656 - rebalance failing on one of the node
Summary: rebalance failing on one of the node
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: 3.7.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
: 1225997 (view as bug list)
Depends On: 1221696 1225997
Blocks: glusterfs-3.7.2 1227262
TreeView+ depends on / blocked
 
Reported: 2015-05-14 13:55 UTC by SATHEESARAN
Modified: 2015-06-20 09:48 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.7.2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1221696 1227262 (view as bug list)
Environment:
Last Closed: 2015-06-20 09:48:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description SATHEESARAN 2015-05-14 13:55:38 UTC
Description of problem:
-----------------------
I was using 2 RHEL 6.6 machines and installed glusterfs-3.7.0beta2 builds. Each node has 3 bricks. After creating a cluster of these 2 nodes, by peer probing, I created a distributed replicate volume of 2X2 bricks.

Adding a pair of another bricks to this volume and rebalancing resulted in rebalance failing in one of the node.


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
glusterfs-3.7.0beta2 build

How reproducible:
------------------
Always

Steps to Reproduce:
-------------------
1. Create a 2 node cluster with 3 bricks per node
2. Create a distributed-replicate volume of 2X2 
3. Start the volume
4. Mount the volume ( fuse, nfs )
5. Create few files on the mount
6. Add a pair of bricks to the volume
7. Perform rebalance

Actual results:
---------------
Rebalance failed on the second node

Expected results:
-----------------
Rebalance should complete successfully

Comment 1 SATHEESARAN 2015-05-14 13:57:44 UTC
[root@~]# gluster volume rebalance vmstore start
volume rebalance: vmstore: success: Rebalance on vmstore has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 9372b71c-e6f4-44fb-a2e4-9707443f3457

[root@ ~]# gluster volume rebalance vmstore status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost                0        0Bytes             3             0             1            completed               1.00
                             10.70.37.58                0        0Bytes             0             3             0               failed               0.00
volume rebalance: vmstore: success:

<snip_rebalance_logs>
[2015-05-14 19:17:41.419890] I [dht-rebalance.c:2112:gf_defrag_process_dir] 0-vmstore-dht: migrate data called on /
[2015-05-14 19:17:41.424661] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht: fixing the layout of /.trashcan
[2015-05-14 19:17:41.424688] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 0 (vmstore-replicate-0): 101834 chunks
[2015-05-14 19:17:41.424699] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 1 (vmstore-replicate-1): 101834 chunks
[2015-05-14 19:17:41.424708] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 2 (vmstore-replicate-2): 101834 chunks
[2015-05-14 19:17:41.434411] I [dht-rebalance.c:2112:gf_defrag_process_dir] 0-vmstore-dht: migrate data called on /.trashcan
[2015-05-14 19:17:41.446254] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht: fixing the layout of /.trashcan/internal_op
[2015-05-14 19:17:41.446279] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 0 (vmstore-replicate-0): 101834 chunks
[2015-05-14 19:17:41.446290] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 1 (vmstore-replicate-1): 101834 chunks
[2015-05-14 19:17:41.446298] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 2 (vmstore-replicate-2): 101834 chunks
[2015-05-14 19:17:41.453365] I [dht-rebalance.c:2112:gf_defrag_process_dir] 0-vmstore-dht: migrate data called on /.trashcan/internal_op
[2015-05-14 19:17:41.458214] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht: fixing the layout of /.trashcan/internal_op
[2015-05-14 19:17:41.458542] E [dht-rebalance.c:2368:gf_defrag_settle_hash] 0-vmstore-dht: fix layout on /.trashcan/internal_op failed
[2015-05-14 19:17:41.458824] E [MSGID: 109016] [dht-rebalance.c:2528:gf_defrag_fix_layout] 0-vmstore-dht: Fix layout failed for /.trashcan
</snip_rebalance_logs>

Comment 2 SATHEESARAN 2015-05-14 13:59:17 UTC
Following is the mail conversation from Nithya to gluster-devel for this issue :

<snip>
The rebalance failure is due to the interaction of the lookup-unhashed changes and rebalance local crawl changes.
</snip>

Comment 3 Anand Avati 2015-05-15 05:21:34 UTC
REVIEW: http://review.gluster.org/10788 (dht/rebalance : Fixed rebalance failure) posted (#1) for review on release-3.7 by N Balachandran (nbalacha)

Comment 4 Anand Avati 2015-05-28 17:50:16 UTC
REVIEW: http://review.gluster.org/10788 (dht/rebalance : Fixed rebalance failure) posted (#3) for review on release-3.7 by Shyamsundar Ranganathan (srangana)

Comment 5 Shyamsundar 2015-05-28 17:50:55 UTC
*** Bug 1225997 has been marked as a duplicate of this bug. ***

Comment 6 Niels de Vos 2015-06-02 08:20:19 UTC
The required changes to fix this bug have not made it into glusterfs-3.7.1. This bug is now getting tracked for glusterfs-3.7.2.

Comment 7 Anand Avati 2015-06-02 12:30:22 UTC
REVIEW: http://review.gluster.org/10788 (dht/rebalance : Fixed rebalance failure) posted (#5) for review on release-3.7 by N Balachandran (nbalacha)

Comment 8 Anand Avati 2015-06-03 18:26:03 UTC
REVIEW: http://review.gluster.org/10788 (dht/rebalance : Fixed rebalance failure) posted (#6) for review on release-3.7 by Shyamsundar Ranganathan (srangana)

Comment 9 Anand Avati 2015-06-04 12:23:10 UTC
COMMIT: http://review.gluster.org/10788 committed in release-3.7 by Raghavendra G (rgowdapp) 
------
commit 3e8f9c1da61bf70ed635a655e966df574d1e15cd
Author: Nithya Balachandran <nbalacha>
Date:   Thu May 14 19:33:44 2015 +0530

    dht/rebalance : Fixed rebalance failure
    
    The rebalance process determines the local subvols for the
    node it is running on and only acts on files in those subvols.
    If a dist-rep or dist-disperse volume is created on 2 nodes by
    dividing the bricks equally across the nodes, one process might
    determine it has no local_subvols.
    
    When trying to update the commit hash, the function attempts to
    lock all local subvols. On the node with no local_subvols the dht
    inode lock operation fails, in turn causing the rebalance to fail.
    
    In a dist-rep volume with 2 nodes, if brick 0 of each replica
    set is on node1 and brick 1 is on node2, node2 will find that it has
    no local subvols.
    
    Change-Id: I7d73b5b4bf1c822eae6df2e6f79bd6a1606f4d1c
    BUG:  1221656
    Signed-off-by: Nithya Balachandran <nbalacha>
    Reviewed-on-master: http://review.gluster.org/10786
    Reviewed-by: Shyamsundar Ranganathan <srangana>
    Reviewed-by: Susant Palai <spalai>
    Reviewed-on: http://review.gluster.org/10788
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: Raghavendra G <rgowdapp>
    Tested-by: Raghavendra G <rgowdapp>

Comment 10 Niels de Vos 2015-06-20 09:48:17 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.2, please reopen this bug report.

glusterfs-3.7.2 has been announced on the Gluster Packaging mailinglist [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/packaging/2015-June/000006.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user


Note You need to log in before you can comment on or make changes to this bug.