Description of problem: ====================== Automatic split brain resolution must come into effect only when all the bricks are up, else we would be serving inconsistent or undesired data as explained below Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. create a 1x3 volume (clientside quorum is enabled by default) with say b1, b2 ,b3 also set favorite child policy to say mtime(automatic resolution of splitbrain) 2. fuse mount the volume on three different clients in below fashion c1: can ping only b1, b2 bricks and not b3 c2: can ping only b2,b3 and not b1 c3: can ping all bricks 3. now create a file say f1 from c3 ==>that means c3 is now Available on all bricks 4. now append from c1 say line-c1 and from c2 line-c2 to file f1 that means b2 will mark b1 pending with line-c2 b2 will also mark b2 pending with line-c1 that means b2 has the only good copy 5. Now bring down b2 6. heal info will now show f1 as in splitbrain as b1 blames b3 and b3 blames b1 Ideally the file should now give IO error for new writes 7. however that means automatic splitbrain resolution will pick this file f1 for resolving. But that is wrong as the good copy is on b2 which is down. With the resolving users can now access the file f1 which must not actually be allowed, as this means the contents on the actual good copy are lost when b2 comes back up, as that is healed because now b1 and b3 blame b2 expected behvior: 1)b2 has the good copy which is down, hence not further writes must be allowed 2) when b2 comes back up, it must be soruce to b1 and b3 instead of healing via automatic splitbrain and marking b2 as bad copy Solution: make sure automatic splitbrain doesnt take effect on afr replica set when even one of the bricks are down Actual results: Expected results: Additional info:
Upstream patch: https://review.gluster.org/#/c/16476/
changing the title . by removing "automatic" term as it is possible to hit this even on a non-automatic and cli based splitbrain resolution
Downstream patch https://code.engineering.redhat.com/gerrit/#/c/97384
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html