Red Hat Bugzilla – Bug 867351
migrated data with "remove-brick start" unavailable until commit
Last modified: 2013-12-08 20:34:08 EST
+++ This bug was initially created as a clone of Bug #862332 +++
Description of problem:
In order to retire one or more bricks from a volume, you must do a 'remove-brick start' operation, followed by 'remove-brick commit' when the migration is complete. When doing this, each file that gets migrated becomes unavailable to the clients. Issuing the commit operation makes all migrated files available again.
Steps to Reproduce:
1. Begin migrating data off a brick with the remove-brick start command.
2. Check the $volume-rebalance log to find a file that has been migrated.
3. Try to access the file found in step 2. Access will fail.
4. Wait for the migration to complete.
5. If there are failures
6. Issue the remove-brick commit operation.
7. Try to access the file again. It will succeed.
Each migrated file is unavailable from the time it gets migrated until the commit operation is performed.
Each file should remain available after it gets migrated. The commit operation should not be required to continue to access data. The commit operation should simply finalize the removal, or (when it might be required) force removal with data loss if no migration has been done.
Bug 770346 is similar, though apparently with that bug, the data was completely lost even after the commit.
The migration seems to be prone to failures on individual files. No failure notification is made other than a number on the 'status' screen that such failures have occurred. Such failures are guaranteed when the available disk space on one or more bricks is less than the amount of used space on the brick that is being removed, even if the volume as a whole has plenty of space. I will file a separate bug for that problem.
I did my tests with a 4x2 distribute-replicate volume living on two nodes (each with 4 bricks), removing both replicas of the last brick. It is likely that the same problem would happen on a pure distribute volume, but I have not tested it.
I expect to start off with 4TB drives, one brick per drive, and each brick will contain several million files. Migrating the data off such a brick will take several hours. We cannot afford to have that much data be unavailable for that much time. Someday the servers with the 4TB drives will be ancient, ready for retirement.
--- Additional comment from email@example.com on 2012-10-02 13:10:23 EDT ---
If the volume starts out more than half full, you are likely to run into Bug 862347 at step 4.
--- Additional comment from firstname.lastname@example.org on 2012-10-03 01:42:29 EDT ---
Please attach the client logs (mount process) where the look up of such files fail. The remove-brick logs related to the files in question would also help.
--- Additional comment from email@example.com on 2012-10-03 15:02:03 EDT ---
As noted on Bug 862347, I completed one remove-brick run and did not run into this bug. As of the end of that first run, all migrated files seem to be still accessible. I will see what happens during subsequent remove-brick runs.
--- Additional comment from firstname.lastname@example.org on 2012-10-03 15:38:36 EDT ---
During the first round of testing for Bug 862347, I did not run into this bug at all. I have no idea what's different between this run and the one where everything was unavailable.
I do plan to do another round of testing after completely deleting the volume and starting over.
marking it same as source bug.