Bug 1400092

Summary: Increasing replica count while I/O is in progress can lead to replica inconsistency
Product: Red Hat Gluster Storage Reporter: Ravishankar N <ravishankar>
Component: glusterdAssignee: hari gowtham <hgowtham>
Status: CLOSED CANTFIX QA Contact: Bala Konda Reddy M <bmekala>
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.2CC: amukherj, anepatel, apaladug, bkunal, bmohanra, ccalhoun, hgowtham, jcall, nchilaka, pkarampu, ravishankar, rcyriac, rhs-bugs, sabose, storage-qa-internal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Performing add-brick to increase replica count while I/O is going on can lead to data loss. Workaround: Ensure that increasing replica count is done offline, i.e. without clients accessing the volume.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-20 10:08:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1351530, 1632148    

Description Ravishankar N 2016-11-30 12:49:00 UTC
Steps:
1. created a 2x2 volume (bricks b1 to b4) using a 2 node cluster, fuse mounted it on a client.
2. brought down one b1.
3. Started small file creation on the mount
4. performed add-brick to convert it to arbiter i.e. convert it from 2x2 to 2x(2+1) using a 3rd node for the newly added bricks. Let the bricks be b5 and b6.
5. `volume start force` to bring up b1.
6. I/O was still going on.
7. After I/O and self-heal completed, it was found that a few files were missing on the newly added brick b5 (but present in the other bricks of the replica i.e. b1 and b2). heal-info showed zero entries.

Problem:
When add-brick was performed, the shd got the updated volfile first and it did a conservative merge (as expected) and reset the pending xattrs for entry-heal.

The fuse mount was still operating on the old graph (with replica 2) and hence the creates did not happen on b5, until the fuse mount also got the new graph after which the creates went to all bricks.


This is a gluster infra problem but is serious when replicate comes into the picture:

- If it were a plain distribute vol, the effect of fuse client doing I/O on the old graph is that the the files may get hashed based on the old layout.

- When replication is involved, this can lead to data loss:
In the above example the files were present in b1 and b2 and not b5. If for some reason, *later on*, an I/O happens which makes b5 as the source for entry heal, then it will delete the files from b1 and b2.


We need to document this as a known issue. ie. Doing an add-brick to increase the replica count should only be done offline, i.e. when no I/O is going on.

Comment 6 Bhavana 2017-03-13 15:33:54 UTC
Edited the doc text slightly for the release notes.

Comment 11 Anand Paladugu 2018-07-03 18:29:14 UTC
Atin:  Any inputs that you can provide w.r.t to this issue.  It's pretty old, but a customer enquired about this as it's preventing them from increasing bricks without shutting down the production environment ...