Bug 1303125 - After GlusterD restart, Remove-brick commit happening even though data migration not completed.
After GlusterD restart, Remove-brick commit happening even though data migra...
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
3.1
x86_64 Linux
unspecified Severity high
: ---
: RHGS 3.1.3
Assigned To: Atin Mukherjee
Byreddy
: ZStream
Depends On:
Blocks: 1302968 1268895 1299184 1303269 1310972
  Show dependency treegraph
 
Reported: 2016-01-29 10:52 EST by Byreddy
Modified: 2016-09-17 12:45 EDT (History)
10 users (show)

See Also:
Fixed In Version: glusterfs-3.7.9-1
Doc Type: Bug Fix
Doc Text:
Previously, when glusterd was restarted on a node while rebalance was still in progress, remove-brick commits succeeded even though rebalance was not yet complete. This resulted in data loss. This update ensures that remove-brick commits fail with appropriate log messages when rebalance is in progress.
Story Points: ---
Clone Of:
: 1303269 (view as bug list)
Environment:
Last Closed: 2016-06-23 01:05:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Byreddy 2016-01-29 10:52:08 EST
Description of problem:
=======================
Have two node cluster with Distributed-Replica volume and mounted as fuse with enough data  and started removing replica brick set which triggered rebalance, during rebalance in progress, restarted glusterd on a node from where data migration is happening, after that tried to commit the remove-brick, it's get committed even though data migration not completed.


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.5-17


How reproducible:
=================
Every time


Steps to Reproduce:
====================
1.Have a two node cluster with Distributed-Replica volume (2 *2 )
2.Mount the volume as Fuse and write enough data
3.Start replica brick set remove // will trigger the data migration
4.Using remove-brick status identify brick node from where data migration is happening.
5. Restart glusterd on the node identified in step-4 during rebalance  in progress
6.Try to commit the remove-brick //commit will happen with out fail.

Actual results:
===============
remove-brick commit happens even though rebalance not completed.


Expected results:
=================
remove-brick commit should not happen when rebalance is in progress.

Additional info:
Comment 4 Atin Mukherjee 2016-01-29 23:08:17 EST
RCA:

remove brick operation when in progress is determined by a flag 'decommission_is_in_progress' in volume. This flag doesn't get persisted though and because of which on a glusterd restart the information is lost and all such validations of blocking remove brick commit when rebalance is in progress is skipped through. I agree with QE that this is a potential data loss situation and should be considered as *blocker*.

I've posted a fix in upstream http://review.gluster.org/#/c/13323/
Comment 5 Gaurav Kumar Garg 2016-02-02 01:42:58 EST
Workaround for this bug is that after restarting glusterd and before performing remove-brick commit user should check remove-brick status. If the remove brick status is in progress then user should not perform remove-brick commit operation.
Comment 6 Atin Mukherjee 2016-02-02 22:39:41 EST
I don't think #comment 5 is valid until and unless we pull in https://bugzilla.redhat.com/show_bug.cgi?id=1302968 . On a glusterd restart as per the current code it can never connect to the ongoing rebalance daemon which means the statistics are stale. So executing remove brick status after glusterd restart can not indicate the rebalance completion status of all the nodes with the current code.
Comment 7 Gaurav Kumar Garg 2016-02-02 23:36:49 EST
Yes Atin, #comment 5 is valid only when https://bugzilla.redhat.com/show_bug.cgi?id=1302968 pulled in.
Comment 19 Atin Mukherjee 2016-02-10 23:37:03 EST
Looks good now :)
Comment 21 Atin Mukherjee 2016-03-22 08:02:45 EDT
The fix is now available in rhgs-3.1.3 branch, hence moving the state to Modified.
Comment 23 Byreddy 2016-04-06 00:21:48 EDT
Verified this bug using the build "glusterfs-3.7.9-1"

Repeated the reproducing steps mentioned in description section, Fix is working properly, it's not allowing to commit the remove-brick operation when data migration is in progress after glusterd restart.

and rebalance will continue after glusterd restart as well.


With these details, moving this bug to next state,
Comment 26 Atin Mukherjee 2016-06-06 03:01:20 EDT
LGTM :) but why the flag is moved to '?'
Comment 29 errata-xmlrpc 2016-06-23 01:05:52 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.