Bug 1063606 - DHT: REBALANCE- spurious data movements upon starting rebalance
Summary: DHT: REBALANCE- spurious data movements upon starting rebalance
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: distribute
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Nithya Balachandran
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1286129
TreeView+ depends on / blocked
 
Reported: 2014-02-11 06:31 UTC by shylesh
Modified: 2015-11-27 11:40 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286129 (view as bug list)
Environment:
Last Closed: 2015-11-27 11:40:33 UTC
Embargoed:


Attachments (Terms of Use)

Description shylesh 2014-02-11 06:31:31 UTC
Description of problem:
Rebalance is trying migrate some of the files from source---> dest---->source

Version-Release number of selected component (if applicable):
3.4.0.59rhs-1.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. created a single brick distribute volume 
2. created 10 files on the mount point say 1 to 10 (all the files will be on same brick)
3. added one more brick and start rebalance
gluster volume rebalance <vol> start 
4. Once the migration is complete(there may be some skipped files because of space issue) check the file distribution in the backend 


Actual results:
a file has been moved from brick0 to brick1 and again rebalance from brick1's node is trying to move the same file to brick0's node
 


Additional info:
[root@rhs-client9 ser]# gluster v info ser
 
Volume Name: ser
Type: Distribute
Volume ID: 186b3c85-81d4-4d09-81bd-847cbd4178d3
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhs-client9.lab.eng.blr.redhat.com:/home/ser0  -->initial brick
Brick2: rhs-client39.lab.eng.blr.redhat.com:/home/ser1 --> brick added later

initially ser0 had all 10 files
after rebalance ls from bricks 


brick0
--------
[root@rhs-client9 ser]# ll /home/ser0
total 5120
---------T 2 root root 1048576 Feb 11 11:04 1
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 10
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 2
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 3
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 4
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 6


brick1
=======
[root@rhs-client39 ser1]# ll
total 5120
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 1
---------T 2 root root       0 Feb 11 11:04 10
---------T 2 root root       0 Feb 11 11:04 2
---------T 2 root root       0 Feb 11 11:04 3
---------T 2 root root       0 Feb 11 11:04 4
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 5
---------T 2 root root       0 Feb 11 11:04 6
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 7
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 8
-rw-r--r-- 2 root root 1048576 Feb 11 11:02 9

why all the files has been shown on brick1 whereas it is supposed to contain few files ?

Rebalance log from rhs-client9.lab.eng.blr.redhat.com which had first brick
=======================================

[2014-02-11 05:34:01.030319] I [dht-rebalance.c:672:dht_migrate_file] 0-ser-dht: /1: attempting to move from ser-client-0 to ser-client-1
[2014-02-11 05:34:01.115913] I [dht-rebalance.c:881:dht_migrate_file] 0-ser-dht: completed migration of /1 from subvolume ser-client-0 to ser-client-1


this means file 1 has been migrated to other brick


Rebalance log from rhs-client39.lab.eng.blr.redhat.com which has second brick
---------------------------------------
[2014-02-11 05:34:05.037926] I [dht-rebalance.c:1121:gf_defrag_migrate_data] 0-ser-dht: migrate data called on /
[2014-02-11 05:34:05.117622] I [dht-rebalance.c:672:dht_migrate_file] 0-ser-dht: /1: attempting to move from ser-client-1 to ser-client-0
[2014-02-11 05:34:05.121093] W [dht-rebalance.c:374:__dht_check_free_space] 0-ser-dht: data movement attempted from node (ser-client-1) with higher disk space to a node (ser-client-0) with lesser disk space (/1)


once the file 1 has got migrated why again its been trying to migrate back ?


cluster info
==============
rhs-client9.lab.eng.blr.redhat.com
rhs-client39.lab.eng.blr.redhat.com

mount point
----------
rhs-client9.lab.eng.blr.redhat.com:/ser


attaching the sosreports

Comment 3 vsomyaju 2014-02-12 08:24:05 UTC
From the debugging found that:

Link File Creation is happening becase of not able to store the maximum overlap in a variable. if overlap range is from 0x00000000 to 0xffffffff, total length
which is 0xffffffff + 1 can not be stored in uint32_t.
Different rebalance process will set different layout, and ends up with link file creation.

Comment 4 shylesh 2014-07-30 17:41:24 UTC
I could reproduce this bug easily on 50 node setup.

Created a 46 brick distibute volume , created 100 files on the mount point 
touch f{1..100}, added 4 more bricks and did rebalance.

log snippet
===========
Node 1
=======
[2014-07-30 10:30:33.793257] I [dht-rebalance.c:672:dht_migrate_file] 0-newvol-dht: /f28: attempting to move from newvol-client-28 to newvol-client-4
[2014-07-30 10:30:33.820710] I [dht-rebalance.c:881:dht_migrate_file] 0-newvol-dht: completed migration of /f28 from subvolume newvol-client-28 to newvol-client-4


Node2
=====
[2014-07-30 10:30:32.425980] I [dht-rebalance.c:672:dht_migrate_file] 0-newvol-dht: /f28: attempting to move from newvol-client-29 to newvol-client-28
[2014-07-30 10:30:32.453402] I [dht-rebalance.c:881:dht_migrate_file] 0-newvol-dht: completed migration of /f28 from subvolume newvol-client-29 to newvol-client-28

Node2 migrated the file from client-29 to client-28 , again later Node1 migrated the same file from client-28 to client-4.

Above log shows that file will be migrated again and again which has serious performance impact.

Comment 5 shylesh 2014-07-30 17:42:28 UTC
Build used for the testing in comment4 is 3.4.0.59rhs-1.el6rhs.x86_64

Comment 6 Susant Kumar Palai 2015-11-27 11:40:33 UTC
Cloning this to 3.1. To be fixed in future.


Note You need to log in before you can comment on or make changes to this bug.