Bug 1061066 - remove-brick : "No data available" error on files from mount point after remove-brick operation
Summary: remove-brick : "No data available" error on files from mount point after remo...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: 2.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1286134
TreeView+ depends on / blocked
 
Reported: 2014-02-04 10:35 UTC by spandura
Modified: 2015-11-27 11:44 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286134 (view as bug list)
Environment:
Last Closed: 2015-11-27 11:44:30 UTC
Embargoed:


Attachments (Terms of Use)

Description spandura 2014-02-04 10:35:49 UTC
Description of problem:
===========================
On AWS, we had 4 x 3 distribute-replicate volume. (12 nodes and 1 brick per node). For these nodes SSH, glusterd port (24007) , glusterfsd ports (49152-49200) were enabled. The disk usage of the volume was almost 99%. 

Hence, Added 3 more EC2 instances to the storage pool. On these 3 EC2 instances only SSH, glusterd (24007) ports were enabled. The glusterfsd ports (49152-49200) were blocked during the time of  creation of the instance. 

Added 3 bricks from each of these nodes to the volume. Started rebalance. After some time the rebalance failed on all the bricks of replicate-0 , replicate-1 , replicate-2 and replicate-3. 

Rebalance got completed on replicate-4. However there were no migration of any data. Only link files were created on the bricks of replicate-4 sub volume. 

Since bricks from replicate-0, replicate-1, replicate-2 and replicate-3 cannot access bricks on replicate-4 rebalance fails on all of bricks on subvolumes 0-3. 

Rebalance process on client-0, client-1 and client-2 of replicate-4 subvolume will perform lookup's and create linkto files on it own brick for all the files which hashes to it's own subvolume. 

Also, since each node in the subvolume is not able to reach other bricks in the subvolume each brick  marks the afr extended attributes for data, meta-data self-heal for self-healing onto other bricks leading the files to split-brain state. Number of files in split-brain on client-0 and client-1 are the same. But client-2 has lesser number of files in split-brain. When lookup happens on files and rebalance create link-to files it is expected to create the same number of files on all the 3 bricks of the sub-volume. But on client-2 of replicate-4 subvolume has lesser number of files in split-brain. On these files only the sticky bit was set but the extended attribute "trusted.glusterfs.dht.linkto" and  the change logs were not set. 

Added 3 more bricks to the volume. Removed the bricks of replicate-4 subvolume. remove-brick started rebalance process on replicate-4 subvolume's brick storage nodes and remove-brick got successfully completed. 


From mount point executed "find" . find failed with the following error :

root@ip-10-182-134-186 [Feb-04-2014- 6:50:11] >find /mnt/exporter/ -type f | wc
find: `/mnt/exporter/user27/TestDir0/TestDir0/TestDir0/a4': No data available
find: `/mnt/exporter/user27/TestDir0/TestDir0/TestDir1/a1': No data available
find: `/mnt/exporter/user27/TestDir0/TestDir2/TestDir3/a6': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir0/TestDir0/a6': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir0/a4': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir3/TestDir1/a7': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir3/TestDir4/a6': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir3/TestDir4/a9': No data available
 310113  310113 20731483

Version-Release number of selected component (if applicable):
================================================================
glusterfs 3.4.0.57rhs built on Jan 13 2014 06:59:05

How reproducible:


Steps to Reproduce:
==========================
1. Create 4 x 3 distribute volume. Start the volume. Create huge set of files and directories from fuse mount. 

2. Add 3 more nodes to the storage pool . Block the glusterfsd INBOUND ports for these 3 nodes. i.e All incoming request on brick ports should be blocked. Any process running on the 3 nodes can access only it's own local bricks and the 3 nodes can access all other bricks in replicate-0, replicate-1, replicate-2 and replicate-3 subvolumes. 

3. Add 3 bricks from the newly added 3 nodes to the volume changing the volume type to 5 x 3. 

4. Start rebalance on the volume. rebalance fails . Refer to bug https://bugzilla.redhat.com/show_bug.cgi?id=1059551

5. Enable the ports "49152-49200" on replicate-4 storage nodes. 

6. Restart the rebalance. Rebalance is successfully complete. But still the split-brain files exist. 

7. get the file/dir count from the mount point.( find <mount_point> | wc )  

7. Add 3 more nodes to the storage pool. Add 3 bricks from the newly added 3 nodes changing the volume type to 6 x 3.

8. Remove all the bricks of the replicate-4 subvolume.This will start the rebalance on all the nodes of the replicate-4 subvolume. 

9. After the remove-brick operation is successfully complete , "commit" the remove-brick operation. 

10. From mount get the file/dir count ( find <mount_point> | wc )

Actual results:
========================
root@ip-10-182-134-186 [Feb-04-2014- 6:50:11] >find /mnt/exporter/ -type f | wc
find: `/mnt/exporter/user27/TestDir0/TestDir0/TestDir0/a4': No data available
find: `/mnt/exporter/user27/TestDir0/TestDir0/TestDir1/a1': No data available
find: `/mnt/exporter/user27/TestDir0/TestDir2/TestDir3/a6': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir0/TestDir0/a6': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir0/a4': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir3/TestDir1/a7': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir3/TestDir4/a6': No data available
find: `/mnt/exporter/user24/TestDir0/TestDir3/TestDir4/a9': No data available
 310113  310113 20731483

Mount log messages :
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2014-02-04 10:06:09.775660] W [fuse-bridge.c:1134:fuse_attr_cbk] 0-glusterfs-fuse: 8757859588: STAT() /user24/TestDir0/TestDir0/a4 => -1 (No data available)
[2014-02-04 10:06:09.780251] E [dht-helper.c:777:dht_migration_complete_check_task] 12-exporter-dht: /user24/TestDir0/TestDir0/a4: failed to get the 'linkto' xattr No data available
[2014-02-04 10:06:09.784987] E [dht-helper.c:777:dht_migration_complete_check_task] 12-exporter-dht: /user24/TestDir0/TestDir0/a4: failed to get the 'linkto' xattr No data available
[2014-02-04 10:06:27.845210] E [dht-helper.c:777:dht_migration_complete_check_task] 12-exporter-dht: /user24/TestDir0/TestDir3/TestDir1/a7: failed to get the 'linkto' xattr No data available
[2014-02-04 10:06:27.850242] E [dht-helper.c:777:dht_migration_complete_check_task] 12-exporter-dht: /user24/TestDir0/TestDir3/TestDir1/a7: failed to get the 'linkto' xattr No data available

Expected results:
=====================
There should not be any failures in accessing the files. 

Additional info:
======================
root@domU-12-31-39-0A-99-B2 [Feb-03-2014-11:46:38] >gluster v info 
 
Volume Name: exporter
Type: Distributed-Replicate
Volume ID: 31e01742-36c4-4fbf-bffb-bc9ae98920a7
Status: Started
Number of Bricks: 5 x 3 = 15
Transport-type: tcp
Bricks:
Brick1: domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/bricks/exporter
Brick2: ip-10-194-111-63.ec2.internal:/rhs/bricks/exporter
Brick3: ip-10-182-165-181.ec2.internal:/rhs/bricks/exporter
Brick4: ip-10-46-226-179.ec2.internal:/rhs/bricks/exporter
Brick5: ip-10-83-5-197.ec2.internal:/rhs/bricks/exporter
Brick6: ip-10-159-26-108.ec2.internal:/rhs/bricks/exporter
Brick7: domU-12-31-39-07-74-A5.compute-1.internal:/rhs/bricks/exporter
Brick8: ip-10-80-109-233.ec2.internal:/rhs/bricks/exporter
Brick9: ip-10-181-128-26.ec2.internal:/rhs/bricks/exporter
Brick10: domU-12-31-39-0B-DC-01.compute-1.internal:/rhs/bricks/exporter
Brick11: ip-10-34-105-112.ec2.internal:/rhs/bricks/exporter
Brick12: ip-10-232-7-75.ec2.internal:/rhs/bricks/exporter
Brick13: domU-12-31-39-14-3E-21.compute-1.internal:/rhs/bricks/exporter
Brick14: ip-10-38-175-12.ec2.internal:/rhs/bricks/exporter
Brick15: ip-10-182-160-197.ec2.internal:/rhs/bricks/exporter

root@domU-12-31-39-0A-99-B2 [Feb-03-2014-18:01:36] >gluster v status
Status of volume: exporter
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/br
icks/exporter						49152	Y	19405
Brick ip-10-194-111-63.ec2.internal:/rhs/bricks/exporte
r							49152	Y	3812
Brick ip-10-182-165-181.ec2.internal:/rhs/bricks/export
er							49152	Y	3954
Brick ip-10-46-226-179.ec2.internal:/rhs/bricks/exporte
r							49152	Y	3933
Brick ip-10-83-5-197.ec2.internal:/rhs/bricks/exporter	49152	Y	8705
Brick ip-10-159-26-108.ec2.internal:/rhs/bricks/exporte
r							49152	Y	20196
Brick domU-12-31-39-07-74-A5.compute-1.internal:/rhs/br
icks/exporter						49152	Y	6553
Brick ip-10-80-109-233.ec2.internal:/rhs/bricks/exporte
r							49152	Y	6450
Brick ip-10-181-128-26.ec2.internal:/rhs/bricks/exporte
r							49152	Y	8569
Brick domU-12-31-39-0B-DC-01.compute-1.internal:/rhs/br
icks/exporter						49152	Y	7145
Brick ip-10-34-105-112.ec2.internal:/rhs/bricks/exporte
r							49152	Y	7123
Brick ip-10-232-7-75.ec2.internal:/rhs/bricks/exporter	49152	Y	3935
Brick domU-12-31-39-14-3E-21.compute-1.internal:/rhs/br
icks/exporter						49152	Y	9540
Brick ip-10-38-175-12.ec2.internal:/rhs/bricks/exporter	49152	Y	9084
Brick ip-10-182-160-197.ec2.internal:/rhs/bricks/export
er							49152	Y	9075
NFS Server on localhost					2049	Y	7543
Self-heal Daemon on localhost				N/A	Y	7550
NFS Server on domU-12-31-39-07-74-A5.compute-1.internal	2049	Y	10208
Self-heal Daemon on domU-12-31-39-07-74-A5.compute-1.in
ternal							N/A	Y	10215
NFS Server on ip-10-113-129-125.ec2.internal		2049	Y	7508
Self-heal Daemon on ip-10-113-129-125.ec2.internal	N/A	Y	7515
NFS Server on ip-10-181-128-26.ec2.internal		2049	Y	8401
Self-heal Daemon on ip-10-181-128-26.ec2.internal	N/A	Y	8408
NFS Server on ip-10-46-226-179.ec2.internal		2049	Y	12138
Self-heal Daemon on ip-10-46-226-179.ec2.internal	N/A	Y	12145
NFS Server on ip-10-182-165-181.ec2.internal		2049	Y	10799
Self-heal Daemon on ip-10-182-165-181.ec2.internal	N/A	Y	10806
NFS Server on ip-10-232-7-75.ec2.internal		2049	Y	11144
Self-heal Daemon on ip-10-232-7-75.ec2.internal		N/A	Y	11151
NFS Server on ip-10-159-26-108.ec2.internal		2049	Y	11936
Self-heal Daemon on ip-10-159-26-108.ec2.internal	N/A	Y	11943
NFS Server on ip-10-235-52-58.ec2.internal		2049	Y	16295
Self-heal Daemon on ip-10-235-52-58.ec2.internal	N/A	Y	16300
NFS Server on domU-12-31-39-0B-DC-01.compute-1.internal	2049	Y	1750
Self-heal Daemon on domU-12-31-39-0B-DC-01.compute-1.in
ternal							N/A	Y	1757
NFS Server on ip-10-80-109-233.ec2.internal		2049	Y	10757
Self-heal Daemon on ip-10-80-109-233.ec2.internal	N/A	Y	10764
NFS Server on ip-10-83-5-197.ec2.internal		2049	Y	12427
Self-heal Daemon on ip-10-83-5-197.ec2.internal		N/A	Y	12434
NFS Server on ip-10-224-6-47.ec2.internal		2049	Y	17879
Self-heal Daemon on ip-10-224-6-47.ec2.internal		N/A	Y	17883
NFS Server on domU-12-31-39-14-3E-21.compute-1.internal	2049	Y	9592
Self-heal Daemon on domU-12-31-39-14-3E-21.compute-1.in
ternal							N/A	Y	9599
NFS Server on ip-10-182-160-197.ec2.internal		2049	Y	9129
Self-heal Daemon on ip-10-182-160-197.ec2.internal	N/A	Y	9136
NFS Server on ip-10-194-111-63.ec2.internal		2049	Y	27361
Self-heal Daemon on ip-10-194-111-63.ec2.internal	N/A	Y	27368
NFS Server on ip-10-38-175-12.ec2.internal		2049	Y	9144
Self-heal Daemon on ip-10-38-175-12.ec2.internal	N/A	Y	9151
NFS Server on ip-10-34-105-112.ec2.internal		2049	Y	1447
Self-heal Daemon on ip-10-34-105-112.ec2.internal	N/A	Y	1454
 
Task Status of Volume exporter
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 421d576b-150c-4ff5-be6b-9d3acbc7c2c6
Status               : completed  

root@domU-12-31-39-0A-99-B2 [Feb-03-2014-18:01:42] >gluster volume add-brick exporter replica 3 ip-10-113-129-125.ec2.internal:/rhs/bricks/exporter ip-10-224-6-47.ec2.internal:/rhs/bricks/exporter ip-10-235-52-58.ec2.internal:/rhs/bricks/exporter ;

root@domU-12-31-39-0A-99-B2 [Feb-03-2014-18:02:34] >gluster v status
Status of volume: exporter
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/br
icks/exporter						49152	Y	19405
Brick ip-10-194-111-63.ec2.internal:/rhs/bricks/exporte
r							49152	Y	3812
Brick ip-10-182-165-181.ec2.internal:/rhs/bricks/export
er							49152	Y	3954
Brick ip-10-46-226-179.ec2.internal:/rhs/bricks/exporte
r							49152	Y	3933
Brick ip-10-83-5-197.ec2.internal:/rhs/bricks/exporter	49152	Y	8705
Brick ip-10-159-26-108.ec2.internal:/rhs/bricks/exporte
r							49152	Y	20196
Brick domU-12-31-39-07-74-A5.compute-1.internal:/rhs/br
icks/exporter						49152	Y	6553
Brick ip-10-80-109-233.ec2.internal:/rhs/bricks/exporte
r							49152	Y	6450
Brick ip-10-181-128-26.ec2.internal:/rhs/bricks/exporte
r							49152	Y	8569
Brick domU-12-31-39-0B-DC-01.compute-1.internal:/rhs/br
icks/exporter						49152	Y	7145
Brick ip-10-34-105-112.ec2.internal:/rhs/bricks/exporte
r							49152	Y	7123
Brick ip-10-232-7-75.ec2.internal:/rhs/bricks/exporter	49152	Y	3935
Brick domU-12-31-39-14-3E-21.compute-1.internal:/rhs/br
icks/exporter						49152	Y	9540
Brick ip-10-38-175-12.ec2.internal:/rhs/bricks/exporter	49152	Y	9084
Brick ip-10-182-160-197.ec2.internal:/rhs/bricks/export
er							49152	Y	9075
Brick ip-10-113-129-125.ec2.internal:/rhs/bricks/export
er							49152	Y	1120
Brick ip-10-224-6-47.ec2.internal:/rhs/bricks/exporter	49152	Y	21947
Brick ip-10-235-52-58.ec2.internal:/rhs/bricks/exporter	49152	Y	18979
NFS Server on localhost					2049	Y	28574
Self-heal Daemon on localhost				N/A	Y	28581
NFS Server on domU-12-31-39-07-74-A5.compute-1.internal	2049	Y	29276
Self-heal Daemon on domU-12-31-39-07-74-A5.compute-1.in
ternal							N/A	Y	29283
NFS Server on domU-12-31-39-0B-DC-01.compute-1.internal	2049	Y	17679
Self-heal Daemon on domU-12-31-39-0B-DC-01.compute-1.in
ternal							N/A	Y	17686
NFS Server on ip-10-113-129-125.ec2.internal		2049	Y	1191
Self-heal Daemon on ip-10-113-129-125.ec2.internal	N/A	Y	1198
NFS Server on ip-10-159-26-108.ec2.internal		2049	Y	852
Self-heal Daemon on ip-10-159-26-108.ec2.internal	N/A	Y	859
NFS Server on ip-10-181-128-26.ec2.internal		2049	Y	21256
Self-heal Daemon on ip-10-181-128-26.ec2.internal	N/A	Y	21263
NFS Server on ip-10-232-7-75.ec2.internal		2049	Y	5528
Self-heal Daemon on ip-10-232-7-75.ec2.internal		N/A	Y	5535
NFS Server on ip-10-182-165-181.ec2.internal		2049	Y	22788
Self-heal Daemon on ip-10-182-165-181.ec2.internal	N/A	Y	22795
NFS Server on ip-10-235-52-58.ec2.internal		2049	Y	19036
Self-heal Daemon on ip-10-235-52-58.ec2.internal	N/A	Y	19043
NFS Server on ip-10-46-226-179.ec2.internal		2049	Y	353
Self-heal Daemon on ip-10-46-226-179.ec2.internal	N/A	Y	360
NFS Server on ip-10-194-111-63.ec2.internal		2049	Y	32731
Self-heal Daemon on ip-10-194-111-63.ec2.internal	N/A	Y	32738
NFS Server on ip-10-182-160-197.ec2.internal		2049	Y	10939
Self-heal Daemon on ip-10-182-160-197.ec2.internal	N/A	Y	10946
NFS Server on ip-10-34-105-112.ec2.internal		2049	Y	17594
Self-heal Daemon on ip-10-34-105-112.ec2.internal	N/A	Y	17602
NFS Server on domU-12-31-39-14-3E-21.compute-1.internal	2049	Y	4470
Self-heal Daemon on domU-12-31-39-14-3E-21.compute-1.in
ternal							N/A	Y	4477
NFS Server on ip-10-80-109-233.ec2.internal		2049	Y	26277
Self-heal Daemon on ip-10-80-109-233.ec2.internal	N/A	Y	26284
NFS Server on ip-10-224-6-47.ec2.internal		2049	Y	22001
Self-heal Daemon on ip-10-224-6-47.ec2.internal		N/A	Y	22008
NFS Server on ip-10-38-175-12.ec2.internal		2049	Y	23147
Self-heal Daemon on ip-10-38-175-12.ec2.internal	N/A	Y	23154
NFS Server on ip-10-83-5-197.ec2.internal		2049	Y	11216
Self-heal Daemon on ip-10-83-5-197.ec2.internal		N/A	Y	11223
 
Task Status of Volume exporter
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 421d576b-150c-4ff5-be6b-9d3acbc7c2c6
Status               : completed           

root@domU-12-31-39-0A-99-B2 [Feb-03-2014-18:02:54] >gluster volume remove-brick exporter domU-12-31-39-14-3E-21.compute-1.internal:/rhs/bricks/exporter ip-10-38-175-12.ec2.internal:/rhs/bricks/exporter  ip-10-182-160-197.ec2.internal:/rhs/bricks/exporter start ;

volume remove-brick start: success
ID: 6b5b8c2b-0d3e-45b0-8ad6-e04704e7b88b

root@domU-12-31-39-0A-99-B2 [Feb-04-2014- 6:24:44] >gluster volume remove-brick exporter domU-12-31-39-14-3E-21.compute-1.internal:/rhs/bricks/exporter ip-10-38-175-12.ec2.internal:/rhs/bricks/exporter  ip-10-182-160-197.ec2.internal:/rhs/bricks/exporter status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status   run time in secs
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
domU-12-31-39-14-3E-21.compute-1.internal               21       109.5MB        310130             0             0            completed            9136.00
            ip-10-38-175-12.ec2.internal                0        0Bytes        310121             0             0            completed            9136.00
          ip-10-182-160-197.ec2.internal                0        0Bytes        333267             0             0            completed            9136.00
root@domU-12-31-39-0A-99-B2 [Feb-04-2014- 6:24:50] >

root@domU-12-31-39-0A-99-B2 [Feb-04-2014- 6:46:06] >gluster volume remove-brick importer domU-12-31-39-14-3E-21.compute-1.internal:/rhs/bricks/importer ip-10-38-175-12.ec2.internal:/rhs/bricks/importer  ip-10-182-160-197.ec2.internal:/rhs/bricks/importer commit
Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y
volume remove-brick commit: success
root@domU-12-31-39-0A-99-B2 [Feb-04-2014- 6:46:54] >
root@domU-12-31-39-0A-99-B2 [Feb-04-2014- 6:46:56] >gluster v info exporter
 
Volume Name: exporter
Type: Distributed-Replicate
Volume ID: 31e01742-36c4-4fbf-bffb-bc9ae98920a7
Status: Started
Number of Bricks: 5 x 3 = 15
Transport-type: tcp
Bricks:
Brick1: domU-12-31-39-0A-99-B2.compute-1.internal:/rhs/bricks/exporter
Brick2: ip-10-194-111-63.ec2.internal:/rhs/bricks/exporter
Brick3: ip-10-182-165-181.ec2.internal:/rhs/bricks/exporter
Brick4: ip-10-46-226-179.ec2.internal:/rhs/bricks/exporter
Brick5: ip-10-83-5-197.ec2.internal:/rhs/bricks/exporter
Brick6: ip-10-159-26-108.ec2.internal:/rhs/bricks/exporter
Brick7: domU-12-31-39-07-74-A5.compute-1.internal:/rhs/bricks/exporter
Brick8: ip-10-80-109-233.ec2.internal:/rhs/bricks/exporter
Brick9: ip-10-181-128-26.ec2.internal:/rhs/bricks/exporter
Brick10: domU-12-31-39-0B-DC-01.compute-1.internal:/rhs/bricks/exporter
Brick11: ip-10-34-105-112.ec2.internal:/rhs/bricks/exporter
Brick12: ip-10-232-7-75.ec2.internal:/rhs/bricks/exporter
Brick13: ip-10-113-129-125.ec2.internal:/rhs/bricks/exporter
Brick14: ip-10-224-6-47.ec2.internal:/rhs/bricks/exporter
Brick15: ip-10-235-52-58.ec2.internal:/rhs/bricks/exporter

Comment 3 Susant Kumar Palai 2015-11-27 11:44:30 UTC
Cloning this to 3.1. To be fixed in future.


Note You need to log in before you can comment on or make changes to this bug.