Bug 1417177 - Split brain resolution must check for all the bricks to be up to avoiding serving of inconsistent data(visible on x3 or more)
Summary: Split brain resolution must check for all the bricks to be up to avoiding ser...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: replicate
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: RHGS 3.2.0
Assignee: Ravishankar N
QA Contact: Karan Sandha
URL:
Whiteboard:
Depends On:
Blocks: 1351528 1351530 1417522 1420982 1420983 1420984
TreeView+ depends on / blocked
 
Reported: 2017-01-27 12:23 UTC by nchilaka
Modified: 2017-03-23 06:04 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.8.4-15
Doc Type: Bug Fix
Doc Text:
Earlier, the split-brain resolution commands would erroneously resolve split-brains if two bricks that blamed each other were available, but a correct source brick was unavailable. This has now been corrected so that split-brain resolution commands will work only when all bricks are available and true split-brain conditions are present.
Clone Of:
: 1417522 (view as bug list)
Environment:
Last Closed: 2017-03-23 06:04:08 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description nchilaka 2017-01-27 12:23:43 UTC
Description of problem:
======================
Automatic split brain resolution must come into effect only when all the bricks are up, else we would be serving inconsistent or undesired data as explained below




Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. create a 1x3 volume (clientside quorum is enabled by default) with say b1, b2 ,b3
also set favorite child policy to say mtime(automatic resolution of splitbrain)

2. fuse mount the volume on three different clients in below fashion
c1: can ping only b1, b2 bricks and not b3
c2: can ping only b2,b3 and not b1
c3: can ping all bricks

3. now create a file say f1 from c3 ==>that means c3 is now Available on all bricks
4. now append  from c1 say line-c1 and from c2 line-c2 to file f1
 that means b2 will mark b1 pending with line-c2 
            b2 will also mark b2 pending with line-c1

that means b2 has the only good copy

5. Now bring down b2 
6. heal info will now show f1 as in splitbrain as b1 blames b3 and b3 blames b1

Ideally the file should now give IO error for new writes
7. however that means automatic splitbrain resolution will pick this file f1 for resolving.
But that is wrong as the good copy is on b2 which is down.

With the resolving users can now access the file f1 which must not actually be allowed, as this means the contents on the actual good copy are lost when b2 comes back up, as that is healed because now b1 and b3 blame b2


expected behvior:
1)b2 has the good copy which is down, hence not further writes must be allowed
2) when b2 comes back up, it must be soruce to b1 and b3 instead of healing via automatic splitbrain and marking b2 as bad copy


Solution:
make sure automatic splitbrain doesnt take effect on afr replica set when even one  of the bricks are down



Actual results:


Expected results:


Additional info:

Comment 2 Ravishankar N 2017-01-30 04:34:13 UTC
Upstream patch: https://review.gluster.org/#/c/16476/

Comment 3 nchilaka 2017-01-31 09:17:37 UTC
changing the title . by removing "automatic" term as it is possible to hit this even on a non-automatic and cli based splitbrain resolution

Comment 4 Ravishankar N 2017-02-10 04:41:33 UTC
Downstream patch https://code.engineering.redhat.com/gerrit/#/c/97384

Comment 10 errata-xmlrpc 2017-03-23 06:04:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.