Bug 1303125

Summary:	After GlusterD restart, Remove-brick commit happening even though data migration not completed.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Byreddy <bsrirama>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED ERRATA	QA Contact:	Byreddy <bsrirama>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	amukherj, asrivast, byarlaga, lbailey, rcyriac, rhs-bugs, sankarshan, smohan, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-1	Doc Type:	Bug Fix
Doc Text:	Previously, when glusterd was restarted on a node while rebalance was still in progress, remove-brick commits succeeded even though rebalance was not yet complete. This resulted in data loss. This update ensures that remove-brick commits fail with appropriate log messages when rebalance is in progress.	Story Points:	---
Clone Of:
Clones:	1303269 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:05:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1268895, 1299184, 1302968, 1303269, 1310972

Description Byreddy 2016-01-29 15:52:08 UTC

Description of problem:
=======================
Have two node cluster with Distributed-Replica volume and mounted as fuse with enough data  and started removing replica brick set which triggered rebalance, during rebalance in progress, restarted glusterd on a node from where data migration is happening, after that tried to commit the remove-brick, it's get committed even though data migration not completed.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.5-17


How reproducible:
=================
Every time


Steps to Reproduce:
====================
1.Have a two node cluster with Distributed-Replica volume (2 *2 )
2.Mount the volume as Fuse and write enough data
3.Start replica brick set remove // will trigger the data migration
4.Using remove-brick status identify brick node from where data migration is happening.
5. Restart glusterd on the node identified in step-4 during rebalance  in progress
6.Try to commit the remove-brick //commit will happen with out fail.

Actual results:
===============
remove-brick commit happens even though rebalance not completed.


Expected results:
=================
remove-brick commit should not happen when rebalance is in progress.

Additional info:

Comment 4 Atin Mukherjee 2016-01-30 04:08:17 UTC

RCA:

remove brick operation when in progress is determined by a flag 'decommission_is_in_progress' in volume. This flag doesn't get persisted though and because of which on a glusterd restart the information is lost and all such validations of blocking remove brick commit when rebalance is in progress is skipped through. I agree with QE that this is a potential data loss situation and should be considered as *blocker*.

I've posted a fix in upstream http://review.gluster.org/#/c/13323/

Comment 5 Gaurav Kumar Garg 2016-02-02 06:42:58 UTC

Workaround for this bug is that after restarting glusterd and before performing remove-brick commit user should check remove-brick status. If the remove brick status is in progress then user should not perform remove-brick commit operation.

Comment 6 Atin Mukherjee 2016-02-03 03:39:41 UTC

I don't think #comment 5 is valid until and unless we pull in https://bugzilla.redhat.com/show_bug.cgi?id=1302968 . On a glusterd restart as per the current code it can never connect to the ongoing rebalance daemon which means the statistics are stale. So executing remove brick status after glusterd restart can not indicate the rebalance completion status of all the nodes with the current code.

Comment 7 Gaurav Kumar Garg 2016-02-03 04:36:49 UTC

Yes Atin, #comment 5 is valid only when https://bugzilla.redhat.com/show_bug.cgi?id=1302968 pulled in.

Comment 19 Atin Mukherjee 2016-02-11 04:37:03 UTC

Looks good now :)

Comment 21 Atin Mukherjee 2016-03-22 12:02:45 UTC

The fix is now available in rhgs-3.1.3 branch, hence moving the state to Modified.

Comment 23 Byreddy 2016-04-06 04:21:48 UTC

Verified this bug using the build "glusterfs-3.7.9-1"

Repeated the reproducing steps mentioned in description section, Fix is working properly, it's not allowing to commit the remove-brick operation when data migration is in progress after glusterd restart.

and rebalance will continue after glusterd restart as well.


With these details, moving this bug to next state,

Comment 26 Atin Mukherjee 2016-06-06 07:01:20 UTC

LGTM :) but why the flag is moved to '?'

Comment 29 errata-xmlrpc 2016-06-23 05:05:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240