961632 – [RHEV-RHS] Remove-brick start does not migrates any data in RHEV-RHS setup

Bug 961632 - [RHEV-RHS] Remove-brick start does not migrates any data in RHEV-RHS setup

Summary: [RHEV-RHS] Remove-brick start does not migrates any data in RHEV-RHS setup

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	shishir gowda
QA Contact:	Rejy M Cyriac
Docs Contact:
URL:
Whiteboard:
Depends On:	963896
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-10 06:47 UTC by shylesh
Modified:	2013-12-09 01:36 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.4.0.9rhs
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:	virt rhev integration
Last Closed:	2013-09-23 22:29:50 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description shylesh 2013-05-10 06:47:38 UTC

Description of problem:
Remove-brick start on a distributed-replicate volume of 6x2 configuration doestn't migrate any data

Version-Release number of selected component (if applicable):
[root@rhs1-bb rpm]# rpm -qa | grep gluster
glusterfs-fuse-3.4.0.5rhs-1.el6rhs.x86_64
glusterfs-3.4.0.5rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.5rhs-1.el6rhs.x86_64
glusterfs-debuginfo-3.4.0.5rhs-1.el6rhs.x86_64


How reproducible:
always

Steps to Reproduce:
1.created a 6x2 distributed-replicate volume
2.created 4 vms on this volume
3.selected first pair of bricks for removal
gluster volume remove-brick <vol> brick1 brick2 start
  
4. upon checking the status it says completed within 2 seconds
no migration happens
 
 
Additional info:
=================

1. RHEVM hostname
=============
buzz.lab.eng.blr.redhat.com

2. RHEL (hypervisor) hostname
===============================
rhs-gp-srv4.lab.eng.blr.redhat.com

3. RHS nodes (hostname and IP address)
======================================
10.70.37.76
10.70.37.59
10.70.37.133
10.70.37.134

4. RHS node from where the gluster commands were executed
====================================================
10.70.37.76

6. Volume name
=================
Volume Name: drep
Type: Distributed-Replicate
Volume ID: 678f4caa-84b5-4c0d-8df3-87479520ed14
Status: Started
Number of Bricks: 6 x 2 = 12

7. Mount point on the clients
============================
rhs-gp-srv4.lab.eng.blr.redhat.com:/rhev/data-center/mnt/10.70.37.76:drep

8. Tentative date and time when the issue was hit
=================================================

2013-05-10 06:08 UTC




[root@rhs1-bb rpm]# gluster v info
 
Volume Name: drep
Type: Distributed-Replicate
Volume ID: 678f4caa-84b5-4c0d-8df3-87479520ed14
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.76:/brick1/drr1   -->decommissioned
Brick2: 10.70.37.59:/brick1/drr1   -->decommissioned
Brick3: 10.70.37.133:/brick1/drr2
Brick4: 10.70.37.134:/brick1/drr2
Brick5: 10.70.37.76:/brick2/drr3
Brick6: 10.70.37.59:/brick2/drr3
Brick7: 10.70.37.133:/brick2/drr4
Brick8: 10.70.37.134:/brick2/drr4
Brick9: 10.70.37.76:/brick3/drr5
Brick10: 10.70.37.59:/brick3/drr5
Brick11: 10.70.37.133:/brick4/drr6
Brick12: 10.70.37.134:/brick4/drr6
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off


command executed on 10.70.37.76

rebalance logs
=============
5-10 06:08:10.218311] I [client-handshake.c:450:client_set_lk_version_cbk] 0-drep-client-7: Server lk version = 1
[2013-05-10 06:08:10.218579] I [client-handshake.c:450:client_set_lk_version_cbk] 0-drep-client-11: Server lk version = 1
[2013-05-10 06:08:12.054031] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2013-05-10 06:08:12.058344] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2013-05-10 06:08:12.058481] I [glusterfsd-mgmt.c:56:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2013-05-10 06:08:12.059017] I [glusterfsd-mgmt.c:1544:mgmt_getspec_cbk] 0-drep-client-1: No change in volfile, continuing
[2013-05-10 06:08:12.059084] I [glusterfsd-mgmt.c:1544:mgmt_getspec_cbk] 0-drep-client-5: No change in volfile, continuing
[2013-05-10 06:08:12.059200] I [glusterfsd-mgmt.c:1544:mgmt_getspec_cbk] 0-drep-client-9: No change in volfile, continuing
[2013-05-10 06:08:17.388607] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-drep-client-1: changing port to 49152 (from 0)
[2013-05-10 06:08:17.388895] W [socket.c:515:__socket_rwv] 0-drep-client-1: readv on 10.70.37.59:24007 failed (No data available)
[2013-05-10 06:08:17.400763] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-drep-client-5: changing port to 49153 (from 0)
[2013-05-10 06:08:17.400848] I [rpc-clnt.c:1648:rpc_clnt_reconfig] 0-drep-client-9: changing port to 49154 (from 0)
[2013-05-10 06:08:17.400888] W [socket.c:515:__socket_rwv] 0-drep-client-5: readv on 10.70.37.59:24007 failed (No data available)
[2013-05-10 06:08:17.407824] W [socket.c:515:__socket_rwv] 0-drep-client-9: readv on 10.70.37.59:24007 failed (No data available)
[2013-05-10 06:08:17.415060] I [client-handshake.c:1658:select_server_supported_programs] 0-drep-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2013-05-10 06:08:17.415910] I [client-handshake.c:1658:select_server_supported_programs] 0-drep-client-9: Using Program GlusterFS 3.3


fuse volfile has decommissioned entry
===================================

volume drep-dht
    type cluster/distribute
    option decommissioned-bricks drep-replicate-0
    subvolumes drep-replicate-0 drep-replicate-1 drep-replicate-2 drep-replicate-3 drep-replicate-4 drep-replicate-5
end-volume


attaching the sosreport

Comment 5 shishir gowda 2013-05-16 11:47:25 UTC

From the logs, it looks like no files needed to be migrated in this instance:

rhs1,rhs2-
[2013-05-10 06:08:17.629345] I [dht-common.c:2563:dht_setxattr] 0-drep-dht: fixing the layout of /
[2013-05-10 06:08:17.647246] I [dht-rebalance.c:1106:gf_defrag_migrate_data] 0-drep-dht: migrate data called on /
[2013-05-10 06:08:17.659776] I [dht-rebalance.c:1311:gf_defrag_migrate_data] 0-drep-dht: Migration operation on dir / took 0.01 secs

[2013-05-10 06:08:17.672548] I [dht-rebalance.c:1733:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 0.00 secs
[2013-05-10 06:08:17.672574] I [dht-rebalance.c:1736:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0

rhs4:
[2013-05-10 06:08:19.879331] I [dht-rebalance.c:1733:gf_defrag_status_get] 0-glusterfs: Rebalance is completed. Time taken is 2.00 secs
[2013-05-10 06:08:19.879371] I [dht-rebalance.c:1736:gf_defrag_status_get] 0-glusterfs: Files migrated: 0, size: 0, lookups: 21, failures: 0

Can you please prove a ls -l output(before and after migration) from the brick that was being decommissioned which points to no files being migrated from subvolume-0 even after remove-brick command was issued.

Comment 6 shylesh 2013-05-17 07:47:22 UTC

(In reply to comment #5)
> From the logs, it looks like no files needed to be migrated in this instance:
> 
> rhs1,rhs2-
> [2013-05-10 06:08:17.629345] I [dht-common.c:2563:dht_setxattr] 0-drep-dht:
> fixing the layout of /
> [2013-05-10 06:08:17.647246] I [dht-rebalance.c:1106:gf_defrag_migrate_data]
> 0-drep-dht: migrate data called on /
> [2013-05-10 06:08:17.659776] I [dht-rebalance.c:1311:gf_defrag_migrate_data]
> 0-drep-dht: Migration operation on dir / took 0.01 secs
> 
> [2013-05-10 06:08:17.672548] I [dht-rebalance.c:1733:gf_defrag_status_get]
> 0-glusterfs: Rebalance is completed. Time taken is 0.00 secs
> [2013-05-10 06:08:17.672574] I [dht-rebalance.c:1736:gf_defrag_status_get]
> 0-glusterfs: Files migrated: 0, size: 0, lookups: 0, failures: 0
> 
> rhs4:
> [2013-05-10 06:08:19.879331] I [dht-rebalance.c:1733:gf_defrag_status_get]
> 0-glusterfs: Rebalance is completed. Time taken is 2.00 secs
> [2013-05-10 06:08:19.879371] I [dht-rebalance.c:1736:gf_defrag_status_get]
> 0-glusterfs: Files migrated: 0, size: 0, lookups: 21, failures: 0
> 
> Can you please prove a ls -l output(before and after migration) from the
> brick that was being decommissioned which points to no files being migrated
> from subvolume-0 even after remove-brick command was issued.

1. RHEVM hostname
=============
buzz.lab.eng.blr.redhat.com

2. RHEL (hypervisor) hostname
===============================
rhs-gp-srv4.lab.eng.blr.redhat.com

3. RHS nodes (hostname and IP address)
======================================
10.70.37.76
10.70.37.59
10.70.37.133
10.70.37.134

4. RHS node from where the gluster commands were executed
====================================================
10.70.37.76


5. Mount point on the clients
============================
rhs-gp-srv4.lab.eng.blr.redhat.com:/rhev/data-center/mnt/10.70.37.76:drep

volume info
===========
Volume Name: drep
Type: Distributed-Replicate
Volume ID: 678f4caa-84b5-4c0d-8df3-87479520ed14
Status: Started
Number of Bricks: 16 x 2 = 32
Transport-type: tcp   
Bricks:
Brick1: 10.70.37.76:/brick1/drr1
Brick2: 10.70.37.59:/brick1/drr1
Brick3: 10.70.37.133:/brick1/drr2
Brick4: 10.70.37.134:/brick1/drr2
Brick5: 10.70.37.76:/brick2/drr3
Brick6: 10.70.37.59:/brick2/drr3
Brick7: 10.70.37.133:/brick2/drr4
Brick8: 10.70.37.134:/brick2/drr4
Brick9: 10.70.37.76:/brick3/drr5
Brick10: 10.70.37.59:/brick3/drr5
Brick11: 10.70.37.133:/brick4/drr6
Brick12: 10.70.37.134:/brick4/drr6
Brick13: 10.70.37.133:/brick5/drr9
Brick14: 10.70.37.134:/brick5/drr9
Brick15: 10.70.37.76:/brick1/drr10
Brick16: 10.70.37.59:/brick1/drr10
Brick17: 10.70.37.133:/brick1/drr11
Brick18: 10.70.37.134:/brick1/drr11
Brick19: 10.70.37.76:/brick5/drr12
Brick20: 10.70.37.59:/brick5/drr12
Brick21: 10.70.37.133:/brick6/drr13
Brick22: 10.70.37.134:/brick6/drr13
Brick23: 10.70.37.133:/brick7/drr14
Brick24: 10.70.37.134:/brick7/drr14
Brick25: 10.70.37.133:/brick6/drr15
Brick26: 10.70.37.134:/brick6/drr15
Brick27: 10.70.37.133:/brick6/drr7
Brick28: 10.70.37.134:/brick6/drr7
Brick29: 10.70.37.76:/brick6/drr18
Brick30: 10.70.37.59:/brick6/drr18
Brick31: 10.70.37.76:/brick6/drr19 ===> decommissioned bricks
Brick32: 10.70.37.59:/brick6/drr19 ===>
Options Reconfigured: 
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-uid: 36 
storage.owner-gid: 36 



Before remove-brick 
==================

[root@rhs1-bb drr19]# ls -l
total 0
drwxr-xr-x 5 vdsm kvm 45 May 16 15:22 e8a9faf9-439a-47e6-a38c-27d23b8c976f


/brick6/drr19


[root@rhs1-bb drr19]# du -sh *
4.0G    e8a9faf9-439a-47e6-a38c-27d23b8c976f
[root@rhs1-bb drr19]# pwd
/brick6/drr19
[root@rhs1-bb drr19]# du -sh .


[root@rhs1-bb drr19]# du -h *
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/dom_md
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/a3329d3c-0842-4710-bd2a-47335400a94f
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/a4222b66-18f7-4182-8893-9cbc1be00e18
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/fdf2a3ed-f477-4445-8b5c-098270778aed
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/cf2a98a9-ec3c-45a1-a3bc-ce90b5585e97
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/158c117e-71f4-43a2-8faf-a215d39363ce
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/8bb569b9-b7ad-4724-b14d-1df02f7d86bf
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/tasks
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master
1.0M    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/32697760-8ae2-424e-87f8-296f6827ae3a
2.1G    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/2d7b0c6e-ec09-4226-9097-34fc57dbc85d
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/7e596c1c-82ac-410c-b64f-c7b1199e274c
1.9G    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/44671929-6924-48a4-a86f-bd253e4a8303
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/75cc0479-be7f-4310-bc30-f97311d7242b
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/b4e46d10-64dd-48b7-b4f2-8a9354c8b40e
4.0G    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images
4.0G    e8a9faf9-439a-47e6-a38c-27d23b8c976f




[root@rhs1-bb drr19]# getfattr -d -m . -e hex .
# file: .
trusted.afr.drep-client-30=0x000000000000000000000000
trusted.afr.drep-client-31=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x0000000100000000bbbbbbbbcccccccb
trusted.glusterfs.volume-id=0x678f4caa84b54c0d8df387479520ed14



[root@rhs1-bb drr19]# gluster v remove-brick drep 10.70.37.76:/brick6/drr19 10.70.37.59:/brick6/drr19 start
volume remove-brick start: success
ID: 9ac04502-34f2-4523-8035-bafeb345d0c5
[root@rhs1-bb drr19]# gluster v remove-brick drep 10.70.37.76:/brick6/drr19 10.70.37.59:/brick6/drr19 status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes             0             0      completed             0.00
                               localhost                0        0Bytes             0             0      completed             0.00
                               localhost                0        0Bytes             0             0      completed             0.00
                               localhost                0        0Bytes             0             0      completed             0.00
                            10.70.37.134                0        0Bytes             0             0    not started             0.00



localhost is  10.70.37.76


output from another peer (where we have replica pair) : 10.70.37.59
[root@rhs4-bb brick6]# gluster v remove-brick drep 10.70.37.76:/brick6/drr19 10.70.37.59:/brick6/drr19 status
                                    Node Rebalanced-files          size       scanned      failures         status run-time in secs
                               ---------      -----------   -----------   -----------   -----------   ------------   --------------
                               localhost                0        0Bytes            29             0      completed             2.00
                               localhost                0        0Bytes            29             0      completed             2.00
                               localhost                0        0Bytes            29             0      completed             2.00
                               localhost                0        0Bytes            29             0      completed             2.00
                            10.70.37.134                0        0Bytes             0             0    not started             0.00



After remove-brick start 
==================
[root@rhs1-bb drr19]# getfattr -d -m . -e hex `pwd`
getfattr: Removing leading '/' from absolute path names
# file: brick6/drr19
trusted.afr.drep-client-30=0x000000000000000000000000
trusted.afr.drep-client-31=0x000000000000000000000000
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.dht=0x00000001000000000000000000000000
trusted.glusterfs.volume-id=0x678f4caa84b54c0d8df387479520ed14

[root@rhs1-bb drr19]# ls -l
total 0
drwxr-xr-x 5 vdsm kvm 45 May 16 15:22 e8a9faf9-439a-47e6-a38c-27d23b8c976f



[root@rhs1-bb drr19]# du -h *
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/dom_md
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/a3329d3c-0842-4710-bd2a-47335400a94f
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/a4222b66-18f7-4182-8893-9cbc1be00e18
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/fdf2a3ed-f477-4445-8b5c-098270778aed
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/cf2a98a9-ec3c-45a1-a3bc-ce90b5585e97
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/158c117e-71f4-43a2-8faf-a215d39363ce
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms/8bb569b9-b7ad-4724-b14d-1df02f7d86bf
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/vms
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master/tasks
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/master
1.0M    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/32697760-8ae2-424e-87f8-296f6827ae3a
2.1G    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/2d7b0c6e-ec09-4226-9097-34fc57dbc85d
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/7e596c1c-82ac-410c-b64f-c7b1199e274c
1.9G    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/44671929-6924-48a4-a86f-bd253e4a8303
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/75cc0479-be7f-4310-bc30-f97311d7242b
0       e8a9faf9-439a-47e6-a38c-27d23b8c976f/images/b4e46d10-64dd-48b7-b4f2-8a9354c8b40e
4.0G    e8a9faf9-439a-47e6-a38c-27d23b8c976f/images
4.0G    e8a9faf9-439a-47e6-a38c-27d23b8c976f



attaching the latest sosreport

Comment 8 shishir gowda 2013-06-04 10:14:24 UTC

Looks like a duplicate of bug 963896 (fix merged downstream), where in-correct brick/subvolume was marked as being decommissioned.

Comment 9 Rejy M Cyriac 2013-08-22 13:15:27 UTC

Verified that the remove-brick operation now migrates the data as expected.


Test Environment versions:
RHS - glusterfs-server-3.4.0.21rhs-1.el6rhs.x86_64
6X2 Distribute-Replicate Volume used a Storage Domain
Red Hat Enterprise Virtualization Manager Version: 3.2.2-0.41.el6ev
RHEVH-6.4 Hypervisor with glusterfs-3.4.0.21rhs-1.el6_4.x86_64
RHEL-6.4 Hypervisor with glusterfs-3.4.0.21rhs-1.el6_4.x86_64

Comment 10 Scott Haines 2013-09-23 22:29:50 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.