Bug 867351

Summary:	migrated data with "remove-brick start" unavailable until commit
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vidya Sakar <vinaraya>
Component:	glusterd	Assignee:	shishir gowda <sgowda>
Status:	CLOSED WORKSFORME	QA Contact:	SATHEESARAN <sasundar>
Severity:	unspecified	Docs Contact:
Priority:	medium
Version:	2.0	CC:	amarts, gluster-bugs, nsathyan, redhat, rfortier, rhs-bugs, sgowda, shaines, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	862332	Environment:
Last Closed:	2012-12-11 12:37:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	862332
Bug Blocks:

Description Vidya Sakar 2012-10-17 11:11:48 UTC

+++ This bug was initially created as a clone of Bug #862332 +++

Description of problem:
In order to retire one or more bricks from a volume, you must do a 'remove-brick start' operation, followed by 'remove-brick commit' when the migration is complete.  When doing this, each file that gets migrated becomes unavailable to the clients.  Issuing the commit operation makes all migrated files available again.

Steps to Reproduce:
1. Begin migrating data off a brick with the remove-brick start command.
2. Check the $volume-rebalance log to find a file that has been migrated.
3. Try to access the file found in step 2.  Access will fail.
4. Wait for the migration to complete.
5. If there are failures 
6. Issue the remove-brick commit operation.
7. Try to access the file again.  It will succeed.

Actual results:
Each migrated file is unavailable from the time it gets migrated until the commit operation is performed.

Expected results:
Each file should remain available after it gets migrated.  The commit operation should not be required to continue to access data.  The commit operation should simply finalize the removal, or (when it might be required) force removal with data loss if no migration has been done.


Additional info:

Bug 770346 is similar, though apparently with that bug, the data was completely lost even after the commit.

The migration seems to be prone to failures on individual files.  No failure notification is made other than a number on the 'status' screen that such failures have occurred.  Such failures are guaranteed when the available disk space on one or more bricks is less than the amount of used space on the brick that is being removed, even if the volume as a whole has plenty of space.  I will file a separate bug for that problem.

I did my tests with a 4x2 distribute-replicate volume living on two nodes (each with 4 bricks), removing both replicas of the last brick.  It is likely that the same problem would happen on a pure distribute volume, but I have not tested it.

I expect to start off with 4TB drives, one brick per drive, and each brick will contain several million files.  Migrating the data off such a brick will take several hours.  We cannot afford to have that much data be unavailable for that much time.  Someday the servers with the 4TB drives will be ancient, ready for retirement.

--- Additional comment from redhat on 2012-10-02 13:10:23 EDT ---

If the volume starts out more than half full, you are likely to run into Bug 862347 at step 4.

--- Additional comment from sgowda on 2012-10-03 01:42:29 EDT ---

Hi Shawn,

Please attach the client logs (mount process) where the look up of such files fail. The remove-brick logs related to the files in question would also help.

--- Additional comment from redhat on 2012-10-03 15:02:03 EDT ---

As noted on Bug 862347, I completed one remove-brick run and did not run into this bug.  As of the end of that first run, all migrated files seem to be still accessible.  I will see what happens during subsequent remove-brick runs.

--- Additional comment from redhat on 2012-10-03 15:38:36 EDT ---

During the first round of testing for Bug 862347, I did not run into this bug at all.  I have no idea what's different between this run and the one where everything was unavailable.

I do plan to do another round of testing after completely deleting the volume and starting over.

Comment 2 Amar Tumballi 2012-12-11 12:37:47 UTC

marking it same as source bug.