Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1243815

Summary: DHT: REBALANCE - Rebalance crawl on a directory will never visit peer directories if fix-layout fails for any of the descendant directories
Product: [Community] GlusterFS Reporter: Sakshi <sabansal>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: mainlineCC: bugs, nbalacha, nlevinki, rgowdapp, rkavunga, shmohan, smohan, storage-qa-internal, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: dht-failed-rebalance
Fixed In Version: glusterfs-4.1.3 (or later) Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1064481 Environment:
Last Closed: 2018-08-29 03:37:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1064481, 1237059    
Bug Blocks:    

Description Sakshi 2015-07-16 12:07:01 UTC
+++ This bug was initially created as a clone of Bug #1064481 +++

Description of problem:
While rebalance crawls in a depth-first fashion , if for a directory fix-layout fails on any of its descendants then rebalance will exit and never visits the remaining directories at higher levels (peers of the directory in question).

Version-Release number of selected component (if applicable):
3.4.0.59rhs-1.el6rhs.x86_64

How reproducible:
always

Steps to Reproduce:
1.created a 3 brick distribute volume
2.create deep directories say  level 100 and directories and files in each level

for i in {1..100}
do
 mkdir $i
 cd $i
 for j in {1..100}
 do
   mkdir $j
   touch file.$j
 done
done

4.added 2 more bricks and ran rebalance

5. while migration is in progress say crawling is at directory depth 50 (this can be found by monitoring rebalance log) from the mount point delete the directory 50 

rm -rf 50/

6.after some time rebalance got some failures saying fix-layout failed for some directory .



Actual results:
Once the fix-layout fails for directory rebalance process will exit and never bothered about processing the remaining directories at higher level since it does depth first crawl there may be so many directories at the top level which were never visited hence no data migration happens from those directories

Expected results:
Once fix-layout fails for any directory rebalance should continue to fix other directories.

Additional info:
Volume Name: dht1
Type: Distribute
Volume ID: c0abd5ee-2f93-4de8-a287-178fde6e2283
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 10.70.35.187:/rhs/brick1/d1
Brick2: 10.70.35.187:/rhs/brick1/d2
Brick3: 10.70.35.228:/rhs/brick1/d1
Brick4: 10.70.35.228:/rhs/brick1/212
Brick5: 10.70.35.212:/rhs/brick1/d1


cluster info
----------------
10.70.35.187
10.70.35.212
10.70.35.228


rebalance logs
--------------
[2014-02-12 09:04:58.185772] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-dht1-dht: migrate data called on /mv7/8/24/25/27/28/29/30/31/32/34/35
/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.212112] E [dht-rebalance.c:1217:gf_defrag_migrate_data] 0-dht1-dht: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/4
4/45/46/47/48/etc8/libreport/events/report_Kerneloops.xml lookup failed
[2014-02-12 09:04:58.244667] I [dht-common.c:1119:dht_lookup_linkfile_cbk] 0-dht1-dht: lookup of /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41
/42/43/44/45/46/47/48/etc8/libreport/events/report_Mailx.xml on dht1-client-0 (following linkfile) failed (No such file or directory)
[2014-02-12 09:04:58.245925] E [dht-rebalance.c:1217:gf_defrag_migrate_data] 0-dht1-dht: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/4
4/45/46/47/48/etc8/libreport/events/report_Mailx.xml lookup failed
[2014-02-12 09:04:58.249012] I [dht-rebalance.c:1345:gf_defrag_migrate_data] 0-dht1-dht: Migration operation on dir /mv7/8/24/25/27/28/29/30/31/32/3
4/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events took 0.06 secs
[2014-02-12 09:04:58.249687] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-4: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250141] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-0: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250195] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-3: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250247] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-1: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.291056] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-2: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.291136] E [dht-rebalance.c:1407:gf_defrag_fix_layout] 0-dht1-dht: Failed to open dir /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38
/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.291158] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.291341] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport
[2014-02-12 09:04:58.291519] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8
[2014-02-12 09:04:58.291847] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48
[2014-02-12 09:04:58.292138] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47
[2014-02-12 09:04:58.292315] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46
[2014-02-12 09:04:58.292573] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45
[2014-02-12 09:04:58.292707] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44
[2014-02-12 09:04:58.293231] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43
[2014-02-12 09:04:58.293455] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42
[2014-02-12 09:04:58.293836] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41
[2014-02-12 09:04:58.293914] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40
[2014-02-12 09:04:58.294245] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39
[2014-02-12 09:04:58.294444] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38
[2014-02-12 09:04:58.294859] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
[2014-02-12 09:04:58.295116] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35
[2014-02-12 09:04:58.295419] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34
[2014-02-12 09:04:58.295672] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32
[2014-02-12 09:04:58.296050] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31
[2014-02-12 09:04:58.296328] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30
[2014-02-12 09:04:58.296598] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29
[2014-02-12 09:04:58.298708] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28
[2014-02-12 09:04:58.299179] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27
[2014-02-12 09:04:58.299522] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25
[2014-02-12 09:04:58.300027] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24
[2014-02-12 09:04:58.300687] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8
[2014-02-12 09:04:58.300908] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7
[2014-02-12 09:04:58.301004] I [dht-rebalance.c:1783:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 5084.00 secs
[2014-02-12 09:04:58.301015] I [dht-rebalance.c:1786:gf_defrag_status_get] 0-glusterfs: Files migrated: 52862, size: 1036401138, lookups: 172572, failures: 27, skipped: 3
[2014-02-12 09:04:58.366534] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3c312e894d] (-->/lib64/libpthread.so.0() [0x3c31607851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4052fd]))) 0-: received signum (15), shutting down

Comment 1 Anand Avati 2015-07-16 12:10:15 UTC
REVIEW: http://review.gluster.org/11697 (dht: Continue rebalance crawl if fix-layout fails for any one descendant directory) posted (#1) for review on master by Sakshi Bansal (sabansal)

Comment 2 Mike McCune 2016-03-28 23:31:34 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Amar Tumballi 2018-08-29 03:37:46 UTC
This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.