Bug 1002521

Summary: Rebalance failures on Distribute Volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED WONTFIX QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: poelstra, rhs-bugs, spalai, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-27 12:08:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
script to create directories and files on mount point none

Description senaik 2013-08-29 11:38:22 UTC
Created attachment 791726 [details]
script to create directories and files on mount point

Description of problem:
========================
Rebalance lists failures with errors : 

 failed to get statfs of /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a0 on Volume1-client-0 (No such file or directory)


 [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-Volume1-dht: setattr of uid/gid on /TestDir0/TestDir0/TestDir4/TestDir4/TestDir3/a4 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)


Version-Release number of selected component (if applicable):
============================================================= 
3.4.0.24rhs-1.el6rhs.x86_64


How reproducible:
================ 
Not tried 


Steps to Reproduce:
==================== 
1.Create a distributed volume with 4 bricks and start it  

2.NFS mount the volume and create files using the attached script (CreateDirAndFileTree.pl) with input values 5 5 10 5 5 
calculate are-equal checksum on mount point

3.Add 3 bricks to the volume and start rebalance 

4. Check rebalance status 

5. While rebalance is in progress , stop the volume 

6. Checked status and stoped rebalance 3-4 times 

 1129  gluster v rebalance Volume1 status
 1130  gluster v rebalance Volume1 stop
 1131  gluster v rebalance Volume1 status
 1132  gluster v rebalance Volume1 stop
 1133  gluster v rebalance Volume1 status
 1134  gluster v rebalance Volume1 start
 1135  gluster v rebalance Volume1 status

gluster v rebalance Volume1 status

Node  Rebalanced-files  size  scanned  failures  skipped status run time in secs

----  ---------------- -----  -------- -------- -------- ------  ---------------
localhost      2748    18.9MB   22425    0        916    completed     183.00
10.70.34.88    2890    19.8MB   22031    0        1526   completed     183.00
10.70.34.86    17      124.0KB  20139    2        4543   completed     183.00
10.70.34.87    2700    18.5MB   22162    1        1652   completed     183.00
volume rebalance: Volume1: success:


Actual results:
=============== 
Rebalance fails 


Expected results:
================ 
Rebalance should not fail 


Additional info:
================= 

----------------------Part of log from 10.70.34.86 ----------------------------
 
[2013-08-29 10:54:49.300175] E [dht-rebalance.c:357:__dht_check_free_space] 0-Volume1-dht: failed to get statfs of /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a0 on Volume1-client-0 (No such file or directory)
[2013-08-29 10:54:49.304562] I [dht-rebalance.c:672:dht_migrate_file] 0-Volume1-dht: /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a1: attempting to move from Volume1-client-1 to Volume1-client-0
[2013-08-29 10:54:49.308574] W [dht-rebalance.c:374:__dht_check_free_space] 0-Volume1-dht: data movement attempted from node (Volume1-client-1) with higher disk space to a node (Volume1-client-0) with lesser disk space (/TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a1)
[2013-08-29 10:54:49.308808] E [dht-rebalance.c:1283:gf_defrag_migrate_data] 0-Volume1-dht: migrate-data failed for /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a1
[2013-08-29 10:54:49.311805] I [dht-rebalance.c:672:dht_migrate_file] 0-Volume1-dht: /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a3: attempting to move from Volume1-client-1 to Volume1-client-6




---------------------Part of log from 10.70.34.87--------------------- 

[2013-08-29 08:31:31.702990] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-Volume1-client-6: remote operation failed: No such file or directory
[2013-08-29 08:31:31.703015] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-Volume1-dht: setattr of uid/gid on /TestDir0/TestDir0/TestDir4/TestDir4/TestDir3/a4 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
[2013-08-29 08:31:31.719263] I [dht-rebalance.c:1333:gf_defrag_migrate_data] 0-Volume1-dht: Migration operation on dir /TestDir0/TestDir0/TestDir4/TestDir4/TestDir3 took 0.02 secs

Comment 3 senaik 2013-08-29 13:33:47 UTC
Correction in 'Steps to reproduce' in step 5 : 

5.While rebalance is in progress , stop the rebalance process (incorrectly mentioned as stop the volume)

Comment 4 senaik 2013-08-30 08:47:31 UTC
Version : 3.4.0.24rhs-1.el6rhs.x86_64
======== 

Able to reproduce the issue . Steps followed :

- Created a distributed volume with 5 bricks 

- NFS mount the volume and run the attached script (CreateDirAndFileTree.pl) with input values 5 5 10 5 5  to create deep directories and files
 
- Add 3 bricks and start rebalance and check rebalance status 3-4 times 

- While rebalance is in progress , stop rebalance process

-Check Rebalance status again and execute rebalance stop command , failures are listed in the output 


Node  Rebalanced-files  size  scanned  failures  skipped status run time in secs

----  ---------------- -----  -------- -------- -------- ------  ---------------
localhost      433    2.9MB   3812    5        178    stopped     35.00
10.70.34.88    575    3.9MB   3657    4        235    stopped     36.00
10.70.34.86    15     102.0KB 3584    5        777    stopped     35.00
10.70.34.87    481    3.3MB   3820    5        318    stopped     35.00
volume rebalance: vol11: success:



-----------------Part of log from 10.70.34.85------------------ 
[2013-08-30 06:35:34.909491] E [dht-common.c:1974:dht_vgetxattr_cbk] 0-vol11-dht: Subvolume vol11-client-1 returned -1 (No such file or directory)
[2013-08-30 06:35:34.909563] E [dht-rebalance.c:1220:gf_defrag_migrate_data] 0-vol11-dht: Failed to get node-uuid for /TestDir0/TestDir0/TestDir0/TestDir4/TestDir4/a4
.
.
[2013-08-30 06:35:42.013105] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol11-dht: setattr of uid/gid on /TestDir0/TestDir1/TestDir0/TestDir0/TestDir2/a3 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
.
.
[2013-08-30 06:36:08.770925] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2/TestDir1
[2013-08-30 06:36:08.771074] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2
[2013-08-30 06:36:08.771216] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1
[2013-08-30 06:36:08.771330] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4
[2013-08-30 06:36:08.771448] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0

---------------------part of log from 10.70.34.86-------------------- 

[2013-08-30 06:35:53.198372] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol11-dht: setattr of uid/gid on /TestDir0/TestDir2/TestDir1/TestDir4/a0 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)


[2013-08-30 06:36:08.742209] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2/TestDir1
[2013-08-30 06:36:08.742565] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2
[2013-08-30 06:36:08.742887] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1
[2013-08-30 06:36:08.743192] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4
[2013-08-30 06:36:08.743500] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0
[2013-08-30 06:36:08.743892] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 35.00 secs
[2013-08-30 06:36:08.743906] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 15, size: 104448, lookups: 3584, failures: 5, skipped: 777


-------------Part of log from 10.70.34.87------------------------- 
[2013-08-30 04:12:25.123270] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol11-dht: setattr of uid/gid on /TestDir0/TestDir1/TestDir0/TestDir1/a2 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
.
.
[2013-08-30 04:12:51.707652] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2
[2013-08-30 04:12:51.707776] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1
[2013-08-30 04:12:51.707880] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4
[2013-08-30 04:12:51.708015] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0
[2013-08-30 04:12:51.708145] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 35.00 secs
[2013-08-30 04:12:51.708156] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 481, size: 3431424, lookups: 3820, failures: 5, skipped: 318


----------------part of log form 10.70.34.88----------------------- 



[2013-08-30 04:12:52.229875] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir0/TestDir4
[2013-08-30 04:12:52.230010] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir0
[2013-08-30 04:12:52.230150] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4
[2013-08-30 04:12:52.230318] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0
[2013-08-30 04:12:52.230463] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 36.00 secs
[2013-08-30 04:12:52.230476] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 575, size: 4132864, lookups: 3657, failures: 4, skipped: 235

Comment 6 senaik 2013-09-07 12:54:14 UTC
Version : glusterfs 3.4.0.32rhs
=======

Faced rebalance failures on doing a rebalance stop while rebalance is running . Mount point had 1 directory and 500 files each with 10MB size 

Issue is seen quite often on stopping rebalance while rebalance is in progress

Steps followed : 
---------------
1) Create a distribute volume and start it 

2) Fuse mount the volume and create a directory and some files in it 
for i in {1..500} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done

3) Add brick and start rebalance , while rebalance is running stop rebalance 
gluster v rebalance vol5 stop

Node Rebalanced-files size  scanned  failures   skipped status run time in secs
localhost     31     310.0MB  158      1         0       stopped     15.00
10.70.34.86   38     380.0MB  53       1         0       stopped     15.00
10.70.34.88   29     290.0MB  236      1         0       stopped     15.00
10.70.34.89   25     250.0MB  358      1         0       stopped     15.00
volume rebalance: vol5: success: rebalance process may be in the middle of a file migration.
The process will be fully stopped once the migration of the file is complete.
Please check rebalance process for completion before doing any further brick related tasks on the volume.

4) execute rebalance stop and status command 3-4 times 

--------------part of log from 10.70.34.85-------------------

[2013-09-07 15:25:23.445101] I [dht-rebalance.c:881:dht_migrate_file] 0-vol5-dht: completed migration of /dir1/f228 from subvolume vol5-client-1 to vo
l5-client-3
[2013-09-07 15:25:23.445733] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 15:25:23.445947] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs

--------------Part of log from 10.70.34.86-----------------------------

[2013-09-07 09:55:23.114692] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 37, size: 387973120, lookups: 53, failures: 0,
 skipped: 0
[2013-09-07 09:55:23.237779] I [dht-rebalance.c:881:dht_migrate_file] 0-vol5-dht: completed migration of /dir1/f237 from subvolume vol5-client-0 to vo
l5-client-2
[2013-09-07 09:55:23.238390] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 09:55:23.238612] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
[2013-09-07 09:55:23.238634] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 38, size: 398458880, lookups: 53, failures: 1, skipped: 0

-------------part of log from 10.70.34.88-------------------------

[2013-09-07 09:54:36.336859] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-vol5-client-2: remote operation failed: No such file or directory
[2013-09-07 09:54:36.336888] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol5-dht: setattr of uid/gid on /dir1/f274 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)


[2013-09-07 09:55:23.243013] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 09:55:23.243195] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
[2013-09-07 09:55:23.243214] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 29, size: 304087040, lookups: 236, failures: 1, skipped: 0

------------------part of log from 10.70.34.89-----------------------

[2013-09-07 09:54:36.302279] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol5-dht: setattr of uid/gid on /dir1/f266 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
[2013-09-07 09:54:36.310836] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-vol5-client-2: remote operation failed: No such file or directory
[2013-09-07 09:54:36.310864] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol5-dht: setattr of uid/gid on /dir1/f274 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
[2013-09-07 09:54:36.346799] W [client-rpc-fops.c:256:client3_3_mknod_cbk] 0-vol5-client-2: remote operation failed: File exists. Path: /dir1/f332
[2013-09-07 09:54:36.349005] W [client-rpc-fops.c:256:client3_3_mknod_cbk] 0-vol5-client-2: remote operation failed: File exists. Path: /dir1/f345
[2013-09-07 09:54:36.353140] I [dht-common.c:1035:dht_lookup_everywhere_cbk] 0-vol5-dht: deleting stale linkfile /dir1/f361 on vol5-client-2
[2013-09-07 09:54:36.357879] I [dht-common.c:1035:dht_lookup_everywhere_cbk] 0-vol5-dht: deleting stale linkfile /dir1/f363 on vol5-client-2


[2013-09-07 09:55:23.395009] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 09:55:23.395287] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
[2013-09-07 09:55:23.395334] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 25, size: 262144000, lookups: 358, failures: 1, skipped: 0

---------------------------------------------------------------