911534 – Rebalance Process does not finish on Distributed Replicated Volume

Bug 911534 - Rebalance Process does not finish on Distributed Replicated Volume

Summary: Rebalance Process does not finish on Distributed Replicated Volume

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	shishir gowda
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-15 09:38 UTC by senaik
Modified:	2015-09-03 14:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-19 06:14:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Rebalance logs , brick logs and mount logs are attached . (74.33 MB, application/x-gzip) 2013-02-15 09:38 UTC, senaik	no flags	Details
View All

Description senaik 2013-02-15 09:38:19 UTC

Created attachment 697648 [details]
Rebalance logs , brick logs and mount logs are attached .

Description of problem: Rebalance Process does not finish on Distributed Replicated Volume


Version-Release number of selected component (if applicable):


How reproducible:
Quite Often 

Steps to Reproduce:
----------------------
1.Create a 2x2 Distributed-Replicated Volume

2.mount the volume and fill up with files with holes 
dd if=/dev/urandom of=file_with_holes bs=1M count=500 seek=100M

3.Add 2 bricks and initiate rebalance 
  
Actual results:
Rebalance Process does not finish 

Volume Info : 
--------------
Volume Name: Dist_Repl
Type: Distributed-Replicate
Volume ID: 6d1c0d01-1135-4bfb-8e52-43becb7c52ac
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.34.105:/export/B1
Brick2: 10.70.34.85:/export/B2
Brick3: 10.70.34.105:/export/B3
Brick4: 10.70.34.85:/export/B4
Brick5: 10.70.34.85:/export/B5
Brick6: 10.70.34.105:/export/B6
Options Reconfigured:
diagnostics.client-log-level: DEBUG

According to the logs , file was migrating from replica-1 child to replica-2 child .
When we check the getfattr on the bricks , it shows trusted.glusterfs.dht.linkto is set to replica pair 1 on one brick and replica pair 2 on the second brick and so the rebalance process hangs 

From Replica-1 child : 
-----------------------

[root@fillmore ~]# getfattr -m . -d -e text /export/B*/file_*
getfattr: Removing leading '/' from absolute path names
# file: export/B3/file_with_holes
trusted.afr.Dist_Repl-client-2="\000\000\000\000\000\000\000\000\000\000\000"
trusted.afr.Dist_Repl-client-3="\000\000\000\000\000\000\000\000\000\000\000"
trusted.gfid="��I
                 LV���dB
                         ��"
trusted.glusterfs.dht.linkto="Dist_Repl-replicate-2"

From Replica-2 child : 
------------------------

# file: export/B6/file_with_holes
trusted.gfid="��I
                 LV���dB
                         ��"
trusted.glusterfs.dht.linkto="Dist_Repl-replicate-1"


Expected results:



Additional info:

Comment 2 senaik 2013-02-15 11:24:29 UTC

This issue was found on gluster version 3.3.0.5rhs-40.el6rhs.x86_64

Comment 3 Amar Tumballi 2013-02-18 04:12:56 UTC

files having 'linkto' xattr is fine, but question is why didn't the process didn't complete.

Comment 4 shishir gowda 2013-02-19 06:14:19 UTC

The file created is of 101TB.

[root@localhost ~]# gluster volume info
 
Volume Name: sng
Type: Distributed-Replicate
Volume ID: 999d7043-d36e-4201-bd8b-67298cbfffd6
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: vm1:/export/dir1
Brick2: vm1:/export/dir2
Brick3: vm1:/export/dir3
Brick4: vm1:/export/dir4
Brick5: vm1:/export/dir5
Brick6: vm1:/export/dir6
Options Reconfigured:
diagnostics.client-log-level: INFO

[root@localhost ~]# ls -lh /mnt/test/file_with_holes 
-rw-r--r--. 1 root root 101T Feb 18 10:47 /mnt/test/file_with_holes

I attached to the running rebalance process, and found that it was progressing 

Breakpoint 1, syncop_readv (subvol=0x20cbad0, fd=0x7f1d6c001114, size=131072, off=17886891999232, 
    flags=0, vector=0x7f1d7377bf90, count=0x7f1d7377bfac, iobref=0x7f1d7377bf88) at syncop.c:1069

gluster> volume rebalance sng status
                                    Node Rebalanced-files          size       scanned      failures         status run time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             1             0    in progress         64718.00
volume rebalance: sng: success: 


As of now it has migrated around 17TB(offset field) of data in 64718 seconds, which translates to around ~270MBps.

Rebalance process is continuing as expected.

Note You need to log in before you can comment on or make changes to this bug.