Description of problem: ----------------------- After starting remove-brick on a volume, the bricks were brought down. The status of the remove-brick operations is now shown as stopped. [root@rhs ~]# gluster v status Status of volume: test_dis Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.147:/rhs/brick1/b1 N/A N N/A Brick 10.70.37.147:/rhs/brick1/b2 N/A N N/A NFS Server on localhost 2049 Y 7094 Task Status of Volume test_dis ------------------------------------------------------------------------------ Task : Remove brick ID : b3b23f85-f5d5-4e48-a673-4c93a02177ad Removed bricks: 10.70.37.147:/rhs/brick1/b1 Status : stopped IMO, brick processes going down, while remove-brick is in progress, should result in a failure in the remove-brick operation, and should not cause it to 'stop'. The status should be shown as failed, instead of stopped. Version-Release number of selected component (if applicable): glusterfs 3.4.0.35.1u2rhs How reproducible: Always Steps to Reproduce: 1. Create a distribute volume with 2 bricks, start it, mount it and create data at the mount point. 2. Start remove-brick operation on one of the bricks. 3. Kill glusterfsd processes. 4. Check volume status. Actual results: The status of the remove-brick operation is shown as stopped. Expected results: The status of the remove-brick operation should be shown as failed, not stopped. Additional info: sosreports attached.
Find sosreport at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1028995/
When brick processes are killed while remove-brick is going on, the status of the remove-brick operation is shown as stopped, instead of failed. glusterfs now expects either a commit or stop of this operation, before another task can be started. This causes RHSC engine to display the task as stopped. But, neither Commit nor Retain in the UI are enabled. So, a user using the Console can neither start a new task, nor commit/stop the previous task. This is causing RHSC problem...
The above happens with rebalance as well. When brick processes is killed, rebalance status is shown as stopped , instead of failed.
Under review at https://code.engineering.redhat.com/gerrit/16981
The status of the remove-brick task is now shown as 'failed'. But, when the user tries to start another task, say, rebalance, the following message is seen - volume rebalance: dis_vol: failed: A remove-brick task on volume dis_vol is not yet committed. Either commit or stop the remove-brick task. If the remove-brick task was a failure, then the user should not be expected to perform any other operation on the task, like commit or stop. Moving to ASSIGNED.
(In reply to Shruti Sampat from comment #6) > The status of the remove-brick task is now shown as 'failed'. But, when the > user tries to start another task, say, rebalance, the following message is > seen - > > volume rebalance: dis_vol: failed: A remove-brick task on volume dis_vol is > not yet committed. Either commit or stop the remove-brick task. > > If the remove-brick task was a failure, then the user should not be expected > to perform any other operation on the task, like commit or stop. Moving to > ASSIGNED. This bug report was about the rebalance status being displayed as stopped instead of failed when a brick is killed, which has been fixed. The issue of not being able to start a rebalance/remove-brick once a remove-brick fails is unrelated to this bug, and is a newer bug. This would have happened even if this bug didn't exist. The issue is with how glusterd tracks a remove-brick process. In glusterd's eyes a remove-brick task is only completed after a commit or a stop is issued. This is because unlike a rebalance, remove-brick requires changes to the volume information which will be on hold. These changes need to be either committed or reverted before doing further operations, and needs to be done manually. Please open another bug for this issue, so that we will be able to track it correctly. If you are being blocked by this, you can do a 'remove-brick stop' command to revert the volume changes. You should be able to continue testing after that. I'm moving this bug back to ON_QA. If you need any more clarification regarding this you can talk to me directly.
Thanks for the clarification. Will open another bug for the other issue, that is, the user being required to perform commit or stop before another task can be started. Marking this one as verified.
Can you please verify the doc text for technical accuracy?
The doc text looks fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html