| Summary: | Rebalance failures on Distribute Volume | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | senaik | ||||
| Component: | distribute | Assignee: | Nithya Balachandran <nbalacha> | ||||
| Status: | CLOSED WONTFIX | QA Contact: | storage-qa-internal <storage-qa-internal> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 2.1 | CC: | poelstra, rhs-bugs, spalai, vbellur | ||||
| Target Milestone: | --- | Keywords: | ZStream | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2015-11-27 12:08:25 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
Correction in 'Steps to reproduce' in step 5 : 5.While rebalance is in progress , stop the rebalance process (incorrectly mentioned as stop the volume) Version : 3.4.0.24rhs-1.el6rhs.x86_64 ======== Able to reproduce the issue . Steps followed : - Created a distributed volume with 5 bricks - NFS mount the volume and run the attached script (CreateDirAndFileTree.pl) with input values 5 5 10 5 5 to create deep directories and files - Add 3 bricks and start rebalance and check rebalance status 3-4 times - While rebalance is in progress , stop rebalance process -Check Rebalance status again and execute rebalance stop command , failures are listed in the output Node Rebalanced-files size scanned failures skipped status run time in secs ---- ---------------- ----- -------- -------- -------- ------ --------------- localhost 433 2.9MB 3812 5 178 stopped 35.00 10.70.34.88 575 3.9MB 3657 4 235 stopped 36.00 10.70.34.86 15 102.0KB 3584 5 777 stopped 35.00 10.70.34.87 481 3.3MB 3820 5 318 stopped 35.00 volume rebalance: vol11: success: -----------------Part of log from 10.70.34.85------------------ [2013-08-30 06:35:34.909491] E [dht-common.c:1974:dht_vgetxattr_cbk] 0-vol11-dht: Subvolume vol11-client-1 returned -1 (No such file or directory) [2013-08-30 06:35:34.909563] E [dht-rebalance.c:1220:gf_defrag_migrate_data] 0-vol11-dht: Failed to get node-uuid for /TestDir0/TestDir0/TestDir0/TestDir4/TestDir4/a4 . . [2013-08-30 06:35:42.013105] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol11-dht: setattr of uid/gid on /TestDir0/TestDir1/TestDir0/TestDir0/TestDir2/a3 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory) . . [2013-08-30 06:36:08.770925] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2/TestDir1 [2013-08-30 06:36:08.771074] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2 [2013-08-30 06:36:08.771216] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1 [2013-08-30 06:36:08.771330] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4 [2013-08-30 06:36:08.771448] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0 ---------------------part of log from 10.70.34.86-------------------- [2013-08-30 06:35:53.198372] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol11-dht: setattr of uid/gid on /TestDir0/TestDir2/TestDir1/TestDir4/a0 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory) [2013-08-30 06:36:08.742209] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2/TestDir1 [2013-08-30 06:36:08.742565] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2 [2013-08-30 06:36:08.742887] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1 [2013-08-30 06:36:08.743192] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4 [2013-08-30 06:36:08.743500] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0 [2013-08-30 06:36:08.743892] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 35.00 secs [2013-08-30 06:36:08.743906] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 15, size: 104448, lookups: 3584, failures: 5, skipped: 777 -------------Part of log from 10.70.34.87------------------------- [2013-08-30 04:12:25.123270] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol11-dht: setattr of uid/gid on /TestDir0/TestDir1/TestDir0/TestDir1/a2 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory) . . [2013-08-30 04:12:51.707652] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1/TestDir2 [2013-08-30 04:12:51.707776] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir1 [2013-08-30 04:12:51.707880] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4 [2013-08-30 04:12:51.708015] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0 [2013-08-30 04:12:51.708145] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 35.00 secs [2013-08-30 04:12:51.708156] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 481, size: 3431424, lookups: 3820, failures: 5, skipped: 318 ----------------part of log form 10.70.34.88----------------------- [2013-08-30 04:12:52.229875] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir0/TestDir4 [2013-08-30 04:12:52.230010] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4/TestDir0 [2013-08-30 04:12:52.230150] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0/TestDir4 [2013-08-30 04:12:52.230318] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol11-dht: Fix layout failed for /TestDir0 [2013-08-30 04:12:52.230463] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 36.00 secs [2013-08-30 04:12:52.230476] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 575, size: 4132864, lookups: 3657, failures: 4, skipped: 235 sosreports for comment 4 : http://rhsqe-repo.lab.eng.blr.redhat.com/bugs_necessary_info/1002521/1002521_30_Aug/ Version : glusterfs 3.4.0.32rhs
=======
Faced rebalance failures on doing a rebalance stop while rebalance is running . Mount point had 1 directory and 500 files each with 10MB size
Issue is seen quite often on stopping rebalance while rebalance is in progress
Steps followed :
---------------
1) Create a distribute volume and start it
2) Fuse mount the volume and create a directory and some files in it
for i in {1..500} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done
3) Add brick and start rebalance , while rebalance is running stop rebalance
gluster v rebalance vol5 stop
Node Rebalanced-files size scanned failures skipped status run time in secs
localhost 31 310.0MB 158 1 0 stopped 15.00
10.70.34.86 38 380.0MB 53 1 0 stopped 15.00
10.70.34.88 29 290.0MB 236 1 0 stopped 15.00
10.70.34.89 25 250.0MB 358 1 0 stopped 15.00
volume rebalance: vol5: success: rebalance process may be in the middle of a file migration.
The process will be fully stopped once the migration of the file is complete.
Please check rebalance process for completion before doing any further brick related tasks on the volume.
4) execute rebalance stop and status command 3-4 times
--------------part of log from 10.70.34.85-------------------
[2013-09-07 15:25:23.445101] I [dht-rebalance.c:881:dht_migrate_file] 0-vol5-dht: completed migration of /dir1/f228 from subvolume vol5-client-1 to vo
l5-client-3
[2013-09-07 15:25:23.445733] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 15:25:23.445947] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
--------------Part of log from 10.70.34.86-----------------------------
[2013-09-07 09:55:23.114692] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 37, size: 387973120, lookups: 53, failures: 0,
skipped: 0
[2013-09-07 09:55:23.237779] I [dht-rebalance.c:881:dht_migrate_file] 0-vol5-dht: completed migration of /dir1/f237 from subvolume vol5-client-0 to vo
l5-client-2
[2013-09-07 09:55:23.238390] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 09:55:23.238612] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
[2013-09-07 09:55:23.238634] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 38, size: 398458880, lookups: 53, failures: 1, skipped: 0
-------------part of log from 10.70.34.88-------------------------
[2013-09-07 09:54:36.336859] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-vol5-client-2: remote operation failed: No such file or directory
[2013-09-07 09:54:36.336888] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol5-dht: setattr of uid/gid on /dir1/f274 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
[2013-09-07 09:55:23.243013] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 09:55:23.243195] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
[2013-09-07 09:55:23.243214] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 29, size: 304087040, lookups: 236, failures: 1, skipped: 0
------------------part of log from 10.70.34.89-----------------------
[2013-09-07 09:54:36.302279] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol5-dht: setattr of uid/gid on /dir1/f266 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
[2013-09-07 09:54:36.310836] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-vol5-client-2: remote operation failed: No such file or directory
[2013-09-07 09:54:36.310864] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-vol5-dht: setattr of uid/gid on /dir1/f274 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory)
[2013-09-07 09:54:36.346799] W [client-rpc-fops.c:256:client3_3_mknod_cbk] 0-vol5-client-2: remote operation failed: File exists. Path: /dir1/f332
[2013-09-07 09:54:36.349005] W [client-rpc-fops.c:256:client3_3_mknod_cbk] 0-vol5-client-2: remote operation failed: File exists. Path: /dir1/f345
[2013-09-07 09:54:36.353140] I [dht-common.c:1035:dht_lookup_everywhere_cbk] 0-vol5-dht: deleting stale linkfile /dir1/f361 on vol5-client-2
[2013-09-07 09:54:36.357879] I [dht-common.c:1035:dht_lookup_everywhere_cbk] 0-vol5-dht: deleting stale linkfile /dir1/f363 on vol5-client-2
[2013-09-07 09:55:23.395009] E [dht-rebalance.c:1483:gf_defrag_fix_layout] 0-vol5-dht: Fix layout failed for /dir1
[2013-09-07 09:55:23.395287] I [dht-rebalance.c:1766:gf_defrag_status_get] 0-glusterfs: Rebalance is stopped. Time taken is 15.00 secs
[2013-09-07 09:55:23.395334] I [dht-rebalance.c:1769:gf_defrag_status_get] 0-glusterfs: Files migrated: 25, size: 262144000, lookups: 358, failures: 1, skipped: 0
---------------------------------------------------------------
|
Created attachment 791726 [details] script to create directories and files on mount point Description of problem: ======================== Rebalance lists failures with errors : failed to get statfs of /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a0 on Volume1-client-0 (No such file or directory) [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-Volume1-dht: setattr of uid/gid on /TestDir0/TestDir0/TestDir4/TestDir4/TestDir3/a4 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory) Version-Release number of selected component (if applicable): ============================================================= 3.4.0.24rhs-1.el6rhs.x86_64 How reproducible: ================ Not tried Steps to Reproduce: ==================== 1.Create a distributed volume with 4 bricks and start it 2.NFS mount the volume and create files using the attached script (CreateDirAndFileTree.pl) with input values 5 5 10 5 5 calculate are-equal checksum on mount point 3.Add 3 bricks to the volume and start rebalance 4. Check rebalance status 5. While rebalance is in progress , stop the volume 6. Checked status and stoped rebalance 3-4 times 1129 gluster v rebalance Volume1 status 1130 gluster v rebalance Volume1 stop 1131 gluster v rebalance Volume1 status 1132 gluster v rebalance Volume1 stop 1133 gluster v rebalance Volume1 status 1134 gluster v rebalance Volume1 start 1135 gluster v rebalance Volume1 status gluster v rebalance Volume1 status Node Rebalanced-files size scanned failures skipped status run time in secs ---- ---------------- ----- -------- -------- -------- ------ --------------- localhost 2748 18.9MB 22425 0 916 completed 183.00 10.70.34.88 2890 19.8MB 22031 0 1526 completed 183.00 10.70.34.86 17 124.0KB 20139 2 4543 completed 183.00 10.70.34.87 2700 18.5MB 22162 1 1652 completed 183.00 volume rebalance: Volume1: success: Actual results: =============== Rebalance fails Expected results: ================ Rebalance should not fail Additional info: ================= ----------------------Part of log from 10.70.34.86 ---------------------------- [2013-08-29 10:54:49.300175] E [dht-rebalance.c:357:__dht_check_free_space] 0-Volume1-dht: failed to get statfs of /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a0 on Volume1-client-0 (No such file or directory) [2013-08-29 10:54:49.304562] I [dht-rebalance.c:672:dht_migrate_file] 0-Volume1-dht: /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a1: attempting to move from Volume1-client-1 to Volume1-client-0 [2013-08-29 10:54:49.308574] W [dht-rebalance.c:374:__dht_check_free_space] 0-Volume1-dht: data movement attempted from node (Volume1-client-1) with higher disk space to a node (Volume1-client-0) with lesser disk space (/TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a1) [2013-08-29 10:54:49.308808] E [dht-rebalance.c:1283:gf_defrag_migrate_data] 0-Volume1-dht: migrate-data failed for /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a1 [2013-08-29 10:54:49.311805] I [dht-rebalance.c:672:dht_migrate_file] 0-Volume1-dht: /TestDir0/TestDir1/TestDir0/TestDir0/TestDir3/a3: attempting to move from Volume1-client-1 to Volume1-client-6 ---------------------Part of log from 10.70.34.87--------------------- [2013-08-29 08:31:31.702990] W [client-rpc-fops.c:1983:client3_3_setattr_cbk] 0-Volume1-client-6: remote operation failed: No such file or directory [2013-08-29 08:31:31.703015] E [dht-linkfile.c:287:dht_linkfile_setattr_cbk] 0-Volume1-dht: setattr of uid/gid on /TestDir0/TestDir0/TestDir4/TestDir4/TestDir3/a4 :<gfid:00000000-0000-0000-0000-000000000000> failed (No such file or directory) [2013-08-29 08:31:31.719263] I [dht-rebalance.c:1333:gf_defrag_migrate_data] 0-Volume1-dht: Migration operation on dir /TestDir0/TestDir0/TestDir4/TestDir4/TestDir3 took 0.02 secs