867351 – migrated data with "remove-brick start" unavailable until commit

Bug 867351 - migrated data with "remove-brick start" unavailable until commit

Summary: migrated data with "remove-brick start" unavailable until commit

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	shishir gowda
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	862332
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-17 11:11 UTC by Vidya Sakar
Modified:	2013-12-09 01:34 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	862332
Environment:
Last Closed:	2012-12-11 12:37:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vidya Sakar 2012-10-17 11:11:48 UTC

+++ This bug was initially created as a clone of Bug #862332 +++

Description of problem:
In order to retire one or more bricks from a volume, you must do a 'remove-brick start' operation, followed by 'remove-brick commit' when the migration is complete.  When doing this, each file that gets migrated becomes unavailable to the clients.  Issuing the commit operation makes all migrated files available again.

Steps to Reproduce:
1. Begin migrating data off a brick with the remove-brick start command.
2. Check the $volume-rebalance log to find a file that has been migrated.
3. Try to access the file found in step 2.  Access will fail.
4. Wait for the migration to complete.
5. If there are failures 
6. Issue the remove-brick commit operation.
7. Try to access the file again.  It will succeed.

Actual results:
Each migrated file is unavailable from the time it gets migrated until the commit operation is performed.

Expected results:
Each file should remain available after it gets migrated.  The commit operation should not be required to continue to access data.  The commit operation should simply finalize the removal, or (when it might be required) force removal with data loss if no migration has been done.


Additional info:

Bug 770346 is similar, though apparently with that bug, the data was completely lost even after the commit.

The migration seems to be prone to failures on individual files.  No failure notification is made other than a number on the 'status' screen that such failures have occurred.  Such failures are guaranteed when the available disk space on one or more bricks is less than the amount of used space on the brick that is being removed, even if the volume as a whole has plenty of space.  I will file a separate bug for that problem.

I did my tests with a 4x2 distribute-replicate volume living on two nodes (each with 4 bricks), removing both replicas of the last brick.  It is likely that the same problem would happen on a pure distribute volume, but I have not tested it.

I expect to start off with 4TB drives, one brick per drive, and each brick will contain several million files.  Migrating the data off such a brick will take several hours.  We cannot afford to have that much data be unavailable for that much time.  Someday the servers with the 4TB drives will be ancient, ready for retirement.

--- Additional comment from redhat on 2012-10-02 13:10:23 EDT ---

If the volume starts out more than half full, you are likely to run into Bug 862347 at step 4.

--- Additional comment from sgowda on 2012-10-03 01:42:29 EDT ---

Hi Shawn,

Please attach the client logs (mount process) where the look up of such files fail. The remove-brick logs related to the files in question would also help.

--- Additional comment from redhat on 2012-10-03 15:02:03 EDT ---

As noted on Bug 862347, I completed one remove-brick run and did not run into this bug.  As of the end of that first run, all migrated files seem to be still accessible.  I will see what happens during subsequent remove-brick runs.

--- Additional comment from redhat on 2012-10-03 15:38:36 EDT ---

During the first round of testing for Bug 862347, I did not run into this bug at all.  I have no idea what's different between this run and the one where everything was unavailable.

I do plan to do another round of testing after completely deleting the volume and starting over.

Comment 2 Amar Tumballi 2012-12-11 12:37:47 UTC

marking it same as source bug.

Note You need to log in before you can comment on or make changes to this bug.