Bug 1243815 - DHT: REBALANCE - Rebalance crawl on a directory will never visit peer directories if fix-layout fails for any of the descendant directories
Summary: DHT: REBALANCE - Rebalance crawl on a directory will never visit peer directo...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard: dht-failed-rebalance
Depends On: 1064481 1237059
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-16 12:07 UTC by Sakshi
Modified: 2018-08-29 03:37 UTC (History)
9 users (show)

Fixed In Version: glusterfs-4.1.3 (or later)
Doc Type: Bug Fix
Doc Text:
Clone Of: 1064481
Environment:
Last Closed: 2018-08-29 03:37:46 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Sakshi 2015-07-16 12:07:01 UTC
+++ This bug was initially created as a clone of Bug #1064481 +++

Description of problem:
While rebalance crawls in a depth-first fashion , if for a directory fix-layout fails on any of its descendants then rebalance will exit and never visits the remaining directories at higher levels (peers of the directory in question).

Version-Release number of selected component (if applicable):
3.4.0.59rhs-1.el6rhs.x86_64

How reproducible:
always

Steps to Reproduce:
1.created a 3 brick distribute volume
2.create deep directories say  level 100 and directories and files in each level

for i in {1..100}
do
 mkdir $i
 cd $i
 for j in {1..100}
 do
   mkdir $j
   touch file.$j
 done
done

4.added 2 more bricks and ran rebalance

5. while migration is in progress say crawling is at directory depth 50 (this can be found by monitoring rebalance log) from the mount point delete the directory 50 

rm -rf 50/

6.after some time rebalance got some failures saying fix-layout failed for some directory .



Actual results:
Once the fix-layout fails for directory rebalance process will exit and never bothered about processing the remaining directories at higher level since it does depth first crawl there may be so many directories at the top level which were never visited hence no data migration happens from those directories

Expected results:
Once fix-layout fails for any directory rebalance should continue to fix other directories.

Additional info:
Volume Name: dht1
Type: Distribute
Volume ID: c0abd5ee-2f93-4de8-a287-178fde6e2283
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 10.70.35.187:/rhs/brick1/d1
Brick2: 10.70.35.187:/rhs/brick1/d2
Brick3: 10.70.35.228:/rhs/brick1/d1
Brick4: 10.70.35.228:/rhs/brick1/212
Brick5: 10.70.35.212:/rhs/brick1/d1


cluster info
----------------
10.70.35.187
10.70.35.212
10.70.35.228


rebalance logs
--------------
[2014-02-12 09:04:58.185772] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-dht1-dht: migrate data called on /mv7/8/24/25/27/28/29/30/31/32/34/35
/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.212112] E [dht-rebalance.c:1217:gf_defrag_migrate_data] 0-dht1-dht: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/4
4/45/46/47/48/etc8/libreport/events/report_Kerneloops.xml lookup failed
[2014-02-12 09:04:58.244667] I [dht-common.c:1119:dht_lookup_linkfile_cbk] 0-dht1-dht: lookup of /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41
/42/43/44/45/46/47/48/etc8/libreport/events/report_Mailx.xml on dht1-client-0 (following linkfile) failed (No such file or directory)
[2014-02-12 09:04:58.245925] E [dht-rebalance.c:1217:gf_defrag_migrate_data] 0-dht1-dht: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/4
4/45/46/47/48/etc8/libreport/events/report_Mailx.xml lookup failed
[2014-02-12 09:04:58.249012] I [dht-rebalance.c:1345:gf_defrag_migrate_data] 0-dht1-dht: Migration operation on dir /mv7/8/24/25/27/28/29/30/31/32/3
4/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events took 0.06 secs
[2014-02-12 09:04:58.249687] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-4: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250141] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-0: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250195] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-3: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250247] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-1: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.291056] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-2: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.291136] E [dht-rebalance.c:1407:gf_defrag_fix_layout] 0-dht1-dht: Failed to open dir /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38
/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.291158] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.291341] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport
[2014-02-12 09:04:58.291519] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8
[2014-02-12 09:04:58.291847] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48
[2014-02-12 09:04:58.292138] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47
[2014-02-12 09:04:58.292315] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46
[2014-02-12 09:04:58.292573] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45
[2014-02-12 09:04:58.292707] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44
[2014-02-12 09:04:58.293231] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43
[2014-02-12 09:04:58.293455] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42
[2014-02-12 09:04:58.293836] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41
[2014-02-12 09:04:58.293914] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40
[2014-02-12 09:04:58.294245] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39
[2014-02-12 09:04:58.294444] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38
[2014-02-12 09:04:58.294859] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
[2014-02-12 09:04:58.295116] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35
[2014-02-12 09:04:58.295419] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34
[2014-02-12 09:04:58.295672] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32
[2014-02-12 09:04:58.296050] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31
[2014-02-12 09:04:58.296328] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30
[2014-02-12 09:04:58.296598] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29
[2014-02-12 09:04:58.298708] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28
[2014-02-12 09:04:58.299179] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27
[2014-02-12 09:04:58.299522] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25
[2014-02-12 09:04:58.300027] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24
[2014-02-12 09:04:58.300687] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8
[2014-02-12 09:04:58.300908] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7
[2014-02-12 09:04:58.301004] I [dht-rebalance.c:1783:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 5084.00 secs
[2014-02-12 09:04:58.301015] I [dht-rebalance.c:1786:gf_defrag_status_get] 0-glusterfs: Files migrated: 52862, size: 1036401138, lookups: 172572, failures: 27, skipped: 3
[2014-02-12 09:04:58.366534] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3c312e894d] (-->/lib64/libpthread.so.0() [0x3c31607851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4052fd]))) 0-: received signum (15), shutting down

Comment 1 Anand Avati 2015-07-16 12:10:15 UTC
REVIEW: http://review.gluster.org/11697 (dht: Continue rebalance crawl if fix-layout fails for any one descendant directory) posted (#1) for review on master by Sakshi Bansal (sabansal)

Comment 2 Mike McCune 2016-03-28 23:31:34 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 5 Amar Tumballi 2018-08-29 03:37:46 UTC
This update is done in bulk based on the state of the patch and the time since last activity. If the issue is still seen, please reopen the bug.


Note You need to log in before you can comment on or make changes to this bug.