+++ This bug was initially created as a clone of Bug #1221696 +++ +++ This bug was initially created as a clone of Bug #1221656 +++ Description of problem: ----------------------- I was using 2 RHEL 6.6 machines and installed glusterfs-3.7.0beta2 builds. Each node has 3 bricks. After creating a cluster of these 2 nodes, by peer probing, I created a distributed replicate volume of 2X2 bricks. Adding a pair of another bricks to this volume and rebalancing resulted in rebalance failing in one of the node. Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.7.0beta2 build How reproducible: ------------------ Always Steps to Reproduce: ------------------- 1. Create a 2 node cluster with 3 bricks per node 2. Create a distributed-replicate volume of 2X2 3. Start the volume 4. Mount the volume ( fuse, nfs ) 5. Create few files on the mount 6. Add a pair of bricks to the volume 7. Perform rebalance Actual results: --------------- Rebalance failed on the second node Expected results: ----------------- Rebalance should complete successfully --- Additional comment from SATHEESARAN on 2015-05-14 09:57:44 EDT --- [root@~]# gluster volume rebalance vmstore start volume rebalance: vmstore: success: Rebalance on vmstore has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 9372b71c-e6f4-44fb-a2e4-9707443f3457 [root@ ~]# gluster volume rebalance vmstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 3 0 1 completed 1.00 10.70.37.58 0 0Bytes 0 3 0 failed 0.00 volume rebalance: vmstore: success: <snip_rebalance_logs> [2015-05-14 19:17:41.419890] I [dht-rebalance.c:2112:gf_defrag_process_dir] 0-vmstore-dht: migrate data called on / [2015-05-14 19:17:41.424661] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht: fixing the layout of /.trashcan [2015-05-14 19:17:41.424688] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 0 (vmstore-replicate-0): 101834 chunks [2015-05-14 19:17:41.424699] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 1 (vmstore-replicate-1): 101834 chunks [2015-05-14 19:17:41.424708] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 2 (vmstore-replicate-2): 101834 chunks [2015-05-14 19:17:41.434411] I [dht-rebalance.c:2112:gf_defrag_process_dir] 0-vmstore-dht: migrate data called on /.trashcan [2015-05-14 19:17:41.446254] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht: fixing the layout of /.trashcan/internal_op [2015-05-14 19:17:41.446279] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 0 (vmstore-replicate-0): 101834 chunks [2015-05-14 19:17:41.446290] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 1 (vmstore-replicate-1): 101834 chunks [2015-05-14 19:17:41.446298] I [dht-selfheal.c:1494:dht_fix_layout_of_directory] 0-vmstore-dht: subvolume 2 (vmstore-replicate-2): 101834 chunks [2015-05-14 19:17:41.453365] I [dht-rebalance.c:2112:gf_defrag_process_dir] 0-vmstore-dht: migrate data called on /.trashcan/internal_op [2015-05-14 19:17:41.458214] I [dht-common.c:3539:dht_setxattr] 0-vmstore-dht: fixing the layout of /.trashcan/internal_op [2015-05-14 19:17:41.458542] E [dht-rebalance.c:2368:gf_defrag_settle_hash] 0-vmstore-dht: fix layout on /.trashcan/internal_op failed [2015-05-14 19:17:41.458824] E [MSGID: 109016] [dht-rebalance.c:2528:gf_defrag_fix_layout] 0-vmstore-dht: Fix layout failed for /.trashcan </snip_rebalance_logs> --- Additional comment from SATHEESARAN on 2015-05-14 09:59:17 EDT --- Following is the mail conversation from Nithya to gluster-devel for this issue : <snip> The rebalance failure is due to the interaction of the lookup-unhashed changes and rebalance local crawl changes. </snip> --- Additional comment from Anand Avati on 2015-05-14 11:09:36 EDT --- REVIEW: http://review.gluster.org/10786 (dht/rebalance : Fixed rebalance failure) posted (#1) for review on master by N Balachandran (nbalacha) --- Additional comment from Anand Avati on 2015-05-14 11:17:36 EDT --- REVIEW: http://review.gluster.org/10786 (dht/rebalance : Fixed rebalance failure) posted (#2) for review on master by N Balachandran (nbalacha) --- Additional comment from Anand Avati on 2015-05-14 15:13:45 EDT --- COMMIT: http://review.gluster.org/10786 committed in master by Shyamsundar Ranganathan (srangana) ------ commit 1cabc769c7b636f89f6f28aaa0d534401a82d4a8 Author: Nithya Balachandran <nbalacha> Date: Thu May 14 19:33:44 2015 +0530 dht/rebalance : Fixed rebalance failure The rebalance process determines the local subvols for the node it is running on and only acts on files in those subvols. If a dist-rep or dist-disperse volume is created on 2 nodes by dividing the bricks equally across the nodes, one process might determine it has no local_subvols. When trying to update the commit hash, the function attempts to lock all local subvols. On the node with no local_subvols the dht inode lock operation fails, in turn causing the rebalance to fail. In a dist-rep volume with 2 nodes, if brick 0 of each replica set is on node1 and brick 1 is on node2, node2 will find that it has no local subvols. Change-Id: I7d73b5b4bf1c822eae6df2e6f79bd6a1606f4d1c BUG: 1221696 Signed-off-by: Nithya Balachandran <nbalacha> Reviewed-on: http://review.gluster.org/10786 Reviewed-by: Shyamsundar Ranganathan <srangana> Reviewed-by: Susant Palai <spalai> Tested-by: Gluster Build System <jenkins.com>
*** This bug has been marked as a duplicate of bug 1221656 ***