Bug 1064481

Summary: DHT: REBALANCE - Rebalance crawl on a directory will never visit peer directories if fix-layout fails for any of the descendant directories
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED DUPLICATE QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1CC: nlevinki, spalai, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1243815 (view as bug list) Environment:
Last Closed: 2015-11-27 12:06:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1243815    

Description shylesh 2014-02-12 16:54:22 UTC
Description of problem:
While rebalance crawls in a depth-first fashion , if for a directory fix-layout fails on any of its descendants then rebalance will exit and never visits the remaining directories at higher levels (peers of the directory in question).

Version-Release number of selected component (if applicable):
3.4.0.59rhs-1.el6rhs.x86_64

How reproducible:
always

Steps to Reproduce:
1.created a 3 brick distribute volume
2.create deep directories say  level 100 and directories and files in each level

for i in {1..100}
do
 mkdir $i
 cd $i
 for j in {1..100}
 do
   mkdir $j
   touch file.$j
 done
done

4.added 2 more bricks and ran rebalance

5. while migration is in progress say crawling is at directory depth 50 (this can be found by monitoring rebalance log) from the mount point delete the directory 50 

rm -rf 50/

6.after some time rebalance got some failures saying fix-layout failed for some directory .



Actual results:
Once the fix-layout fails for directory rebalance process will exit and never bothered about processing the remaining directories at higher level since it does depth first crawl there may be so many directories at the top level which were never visited hence no data migration happens from those directories

Expected results:
Once fix-layout fails for any directory rebalance should continue to fix other directories.

Additional info:
Volume Name: dht1
Type: Distribute
Volume ID: c0abd5ee-2f93-4de8-a287-178fde6e2283
Status: Started
Number of Bricks: 5
Transport-type: tcp
Bricks:
Brick1: 10.70.35.187:/rhs/brick1/d1
Brick2: 10.70.35.187:/rhs/brick1/d2
Brick3: 10.70.35.228:/rhs/brick1/d1
Brick4: 10.70.35.228:/rhs/brick1/212
Brick5: 10.70.35.212:/rhs/brick1/d1


cluster info
----------------
10.70.35.187
10.70.35.212
10.70.35.228


rebalance logs
--------------
[2014-02-12 09:04:58.185772] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-dht1-dht: migrate data called on /mv7/8/24/25/27/28/29/30/31/32/34/35
/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.212112] E [dht-rebalance.c:1217:gf_defrag_migrate_data] 0-dht1-dht: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/4
4/45/46/47/48/etc8/libreport/events/report_Kerneloops.xml lookup failed
[2014-02-12 09:04:58.244667] I [dht-common.c:1119:dht_lookup_linkfile_cbk] 0-dht1-dht: lookup of /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41
/42/43/44/45/46/47/48/etc8/libreport/events/report_Mailx.xml on dht1-client-0 (following linkfile) failed (No such file or directory)
[2014-02-12 09:04:58.245925] E [dht-rebalance.c:1217:gf_defrag_migrate_data] 0-dht1-dht: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/4
4/45/46/47/48/etc8/libreport/events/report_Mailx.xml lookup failed
[2014-02-12 09:04:58.249012] I [dht-rebalance.c:1345:gf_defrag_migrate_data] 0-dht1-dht: Migration operation on dir /mv7/8/24/25/27/28/29/30/31/32/3
4/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events took 0.06 secs
[2014-02-12 09:04:58.249687] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-4: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250141] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-0: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250195] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-3: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.250247] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-1: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.291056] W [client-rpc-fops.c:2523:client3_3_opendir_cbk] 0-dht1-client-2: remote operation failed: No such file or directory. P
ath: /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events (358296be-cf50-4722-8127-87ca87d53e3b)
[2014-02-12 09:04:58.291136] E [dht-rebalance.c:1407:gf_defrag_fix_layout] 0-dht1-dht: Failed to open dir /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38
/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.291158] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport/events
[2014-02-12 09:04:58.291341] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8/libreport
[2014-02-12 09:04:58.291519] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48/etc8
[2014-02-12 09:04:58.291847] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47/48
[2014-02-12 09:04:58.292138] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
/38/39/40/41/42/43/44/45/46/47
[2014-02-12 09:04:58.292315] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45/46
[2014-02-12 09:04:58.292573] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44/45
[2014-02-12 09:04:58.292707] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43/44
[2014-02-12 09:04:58.293231] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42/43
[2014-02-12 09:04:58.293455] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41/42
[2014-02-12 09:04:58.293836] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40/41
[2014-02-12 09:04:58.293914] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39/40
[2014-02-12 09:04:58.294245] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38/39
[2014-02-12 09:04:58.294444] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37/38
[2014-02-12 09:04:58.294859] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35/37
[2014-02-12 09:04:58.295116] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34/35
[2014-02-12 09:04:58.295419] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32/34
[2014-02-12 09:04:58.295672] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31/32
[2014-02-12 09:04:58.296050] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30/31
[2014-02-12 09:04:58.296328] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29/30
[2014-02-12 09:04:58.296598] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28/29
[2014-02-12 09:04:58.298708] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27/28
[2014-02-12 09:04:58.299179] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25/27
[2014-02-12 09:04:58.299522] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24/25
[2014-02-12 09:04:58.300027] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8/24
[2014-02-12 09:04:58.300687] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7/8
[2014-02-12 09:04:58.300908] E [dht-rebalance.c:1498:gf_defrag_fix_layout] 0-dht1-dht: Fix layout failed for /mv7
[2014-02-12 09:04:58.301004] I [dht-rebalance.c:1783:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 5084.00 secs
[2014-02-12 09:04:58.301015] I [dht-rebalance.c:1786:gf_defrag_status_get] 0-glusterfs: Files migrated: 52862, size: 1036401138, lookups: 172572, failures: 27, skipped: 3
[2014-02-12 09:04:58.366534] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3c312e894d] (-->/lib64/libpthread.so.0() [0x3c31607851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x4052fd]))) 0-: received signum (15), shutting down



attached the sosreports

Comment 3 Susant Kumar Palai 2015-11-27 12:06:59 UTC

*** This bug has been marked as a duplicate of bug 1237059 ***