1028995 – When brick process is killed while remove-brick is in progress, the status of the remove-brick task is shown as as stopped

Bug 1028995 - When brick process is killed while remove-brick is in progress, the status of the remove-brick task is shown as as stopped

Summary: When brick process is killed while remove-brick is in progress, the status of...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	Kaushal
QA Contact:	Shruti Sampat
Docs Contact:
URL:
Whiteboard:
Depends On:	1038452
Blocks:
TreeView+	depends on / blocked

Reported:	2013-11-11 12:49 UTC by Shruti Sampat
Modified:	2015-05-13 16:29 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.4.0.49rhs
Doc Type:	Bug Fix
Doc Text:	Previously, when brick process was terminated while remove-brick was in progress, the status of the remove-brick operation was displayed as 'stopped'. With this fix, the status is displayed appropriately.
Clone Of:
Clones:	1038452 (view as bug list)
Environment:
Last Closed:	2014-02-25 08:02:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Shruti Sampat 2013-11-11 12:49:20 UTC

Description of problem:
-----------------------
After starting remove-brick on a volume, the bricks were brought down. The status of the remove-brick operations is now shown as stopped.

[root@rhs ~]# gluster v status
Status of volume: test_dis
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick 10.70.37.147:/rhs/brick1/b1                       N/A     N       N/A
Brick 10.70.37.147:/rhs/brick1/b2                       N/A     N       N/A
NFS Server on localhost                                 2049    Y       7094
 
Task Status of Volume test_dis
------------------------------------------------------------------------------
Task                 : Remove brick        
ID                   : b3b23f85-f5d5-4e48-a673-4c93a02177ad
Removed bricks:     
10.70.37.147:/rhs/brick1/b1
Status               : stopped             


IMO, brick processes going down, while remove-brick is in progress, should result in a failure in the remove-brick operation, and should not cause it to 'stop'. The status should be shown as failed, instead of stopped.

Version-Release number of selected component (if applicable):
glusterfs 3.4.0.35.1u2rhs

How reproducible:
Always

Steps to Reproduce:
1. Create a distribute volume with 2 bricks, start it, mount it and create data at the mount point.
2. Start remove-brick operation on one of the bricks.
3. Kill glusterfsd processes.
4. Check volume status.

Actual results:
The status of the remove-brick operation is shown as stopped.

Expected results:
The status of the remove-brick operation should be shown as failed, not stopped.

Additional info:
sosreports attached.

Comment 1 Shruti Sampat 2013-11-11 12:53:31 UTC

Find sosreport at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1028995/

Comment 3 Dusmant 2013-11-11 13:21:52 UTC

When brick processes are killed while remove-brick is going on, the status of the
remove-brick operation is shown as stopped, instead of failed. glusterfs now expects either a commit or stop of this operation, before another task can be started.

This causes RHSC engine to display the task as stopped. But, neither Commit nor Retain in the UI are enabled. So, a user using the Console can neither start a new task, nor commit/stop the previous task.

This is causing RHSC problem...

Comment 4 RamaKasturi 2013-11-21 08:25:20 UTC

The above happens with rebalance as well.

When brick processes is killed, rebalance status is shown as stopped , instead of failed.

Comment 5 Kaushal 2013-12-09 05:00:12 UTC

Under review at https://code.engineering.redhat.com/gerrit/16981

Comment 6 Shruti Sampat 2013-12-12 12:26:28 UTC

The status of the remove-brick task is now shown as 'failed'. But, when the user tries to start another task, say, rebalance, the following message is seen - 

volume rebalance: dis_vol: failed: A remove-brick task on volume dis_vol is not yet committed. Either commit or stop the remove-brick task.

If the remove-brick task was a failure, then the user should not be expected to perform any other operation on the task, like commit or stop. Moving to ASSIGNED.

Comment 7 Kaushal 2013-12-13 07:41:42 UTC

(In reply to Shruti Sampat from comment #6)
> The status of the remove-brick task is now shown as 'failed'. But, when the
> user tries to start another task, say, rebalance, the following message is
> seen - 
> 
> volume rebalance: dis_vol: failed: A remove-brick task on volume dis_vol is
> not yet committed. Either commit or stop the remove-brick task.
> 
> If the remove-brick task was a failure, then the user should not be expected
> to perform any other operation on the task, like commit or stop. Moving to
> ASSIGNED.

This bug report was about the rebalance status being displayed as stopped instead of failed when a brick is killed, which has been fixed.

The issue of not being able to start a rebalance/remove-brick once a remove-brick fails is unrelated to this bug, and is a newer bug. This would have happened even if this bug didn't exist. The issue is with how glusterd tracks a remove-brick process. In glusterd's eyes a remove-brick task is only completed after a commit or a stop is issued. This is because unlike a rebalance, remove-brick requires changes to the volume information which will be on hold. These changes need to be either committed or reverted before doing further operations, and needs to be done manually.

Please open another bug for this issue, so that we will be able to track it correctly. If you are being blocked by this, you can do a 'remove-brick stop' command to revert the volume changes. You should be able to continue testing after that.

I'm moving this bug back to ON_QA. If you need any more clarification regarding this you can talk to me directly.

Comment 8 Shruti Sampat 2013-12-13 10:35:57 UTC

Thanks for the clarification. Will open another bug for the other issue, that is, the user being required to perform commit or stop before another task can be started. Marking this one as verified.

Comment 9 Pavithra 2014-01-03 06:21:07 UTC

Can you please verify the doc text for technical accuracy?

Comment 10 Kaushal 2014-01-03 07:07:47 UTC

The doc text looks fine.

Comment 12 errata-xmlrpc 2014-02-25 08:02:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.