1303125 – After GlusterD restart, Remove-brick commit happening even though data migration not completed.

Bug 1303125 - After GlusterD restart, Remove-brick commit happening even though data migration not completed.

Summary: After GlusterD restart, Remove-brick commit happening even though data migra...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Atin Mukherjee
QA Contact:	Byreddy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1268895 1299184 1302968 1303269 1310972
TreeView+	depends on / blocked

Reported:	2016-01-29 15:52 UTC by Byreddy
Modified:	2016-09-17 16:45 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.7.9-1
Doc Type:	Bug Fix
Doc Text:	Previously, when glusterd was restarted on a node while rebalance was still in progress, remove-brick commits succeeded even though rebalance was not yet complete. This resulted in data loss. This update ensures that remove-brick commits fail with appropriate log messages when rebalance is in progress.
Clone Of:
Clones:	1303269 (view as bug list)
Environment:
Last Closed:	2016-06-23 05:05:52 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description Byreddy 2016-01-29 15:52:08 UTC

Description of problem:
=======================
Have two node cluster with Distributed-Replica volume and mounted as fuse with enough data  and started removing replica brick set which triggered rebalance, during rebalance in progress, restarted glusterd on a node from where data migration is happening, after that tried to commit the remove-brick, it's get committed even though data migration not completed.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.5-17


How reproducible:
=================
Every time


Steps to Reproduce:
====================
1.Have a two node cluster with Distributed-Replica volume (2 *2 )
2.Mount the volume as Fuse and write enough data
3.Start replica brick set remove // will trigger the data migration
4.Using remove-brick status identify brick node from where data migration is happening.
5. Restart glusterd on the node identified in step-4 during rebalance  in progress
6.Try to commit the remove-brick //commit will happen with out fail.

Actual results:
===============
remove-brick commit happens even though rebalance not completed.


Expected results:
=================
remove-brick commit should not happen when rebalance is in progress.

Additional info:

Comment 4 Atin Mukherjee 2016-01-30 04:08:17 UTC

RCA:

remove brick operation when in progress is determined by a flag 'decommission_is_in_progress' in volume. This flag doesn't get persisted though and because of which on a glusterd restart the information is lost and all such validations of blocking remove brick commit when rebalance is in progress is skipped through. I agree with QE that this is a potential data loss situation and should be considered as *blocker*.

I've posted a fix in upstream http://review.gluster.org/#/c/13323/

Comment 5 Gaurav Kumar Garg 2016-02-02 06:42:58 UTC

Workaround for this bug is that after restarting glusterd and before performing remove-brick commit user should check remove-brick status. If the remove brick status is in progress then user should not perform remove-brick commit operation.

Comment 6 Atin Mukherjee 2016-02-03 03:39:41 UTC

I don't think #comment 5 is valid until and unless we pull in https://bugzilla.redhat.com/show_bug.cgi?id=1302968 . On a glusterd restart as per the current code it can never connect to the ongoing rebalance daemon which means the statistics are stale. So executing remove brick status after glusterd restart can not indicate the rebalance completion status of all the nodes with the current code.

Comment 7 Gaurav Kumar Garg 2016-02-03 04:36:49 UTC

Yes Atin, #comment 5 is valid only when https://bugzilla.redhat.com/show_bug.cgi?id=1302968 pulled in.

Comment 19 Atin Mukherjee 2016-02-11 04:37:03 UTC

Looks good now :)

Comment 21 Atin Mukherjee 2016-03-22 12:02:45 UTC

The fix is now available in rhgs-3.1.3 branch, hence moving the state to Modified.

Comment 23 Byreddy 2016-04-06 04:21:48 UTC

Verified this bug using the build "glusterfs-3.7.9-1"

Repeated the reproducing steps mentioned in description section, Fix is working properly, it's not allowing to commit the remove-brick operation when data migration is in progress after glusterd restart.

and rebalance will continue after glusterd restart as well.


With these details, moving this bug to next state,

Comment 26 Atin Mukherjee 2016-06-06 07:01:20 UTC

LGTM :) but why the flag is moved to '?'

Comment 29 errata-xmlrpc 2016-06-23 05:05:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.