Bug 1378547

Summary: Asynchronous Unsplit-brain still causes Input/Output Error on system calls
Product: [Community] GlusterFS Reporter: Simon Turcotte-Langevin <simon.turcotte-langevin>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.8CC: bugs, pkarampu, ravishankar, simon.turcotte-langevin
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1386188 1387501 (view as bug list) Environment:
Last Closed: 2017-01-16 12:26:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1386188, 1387501, 1403121, 1403577    

Description Simon Turcotte-Langevin 2016-09-22 17:33:05 UTC
Description of problem:

The unsplit-brain mechanism is triggered along the self-healing mechanism. Since the self-healing mechanism is asynchronous, so is the unsplit-brain mechanism. Therefore, even tough the split-brain is resolved eventually, all system calls made before this happens causes an IOE to occur. This pushes the responsibility back to the client application, which needs to retry the system call, which in turn cause a waste of resources.

The self-heal mechanism should still be asynchronous, but the right version of the favorite child policy should be resolved synchronously to prevent the Input/Output exception to occur.

Version-Release number of selected component (if applicable):
3.8.4-1

How reproducible:
Create a split-brained file and assert that the first read still always causes an Input/Output Error.

Steps to Reproduce:
1. Set cluster.entry-self-heal to on, cluster.data-self-heal to on, cluster.metadata-self-heal to on and cluster.favorite-child-policy to mtime
2. Create a split-brained file
3. Cat the split-brained file -> Ensure that an Input/Output Error is raised
4. Cat the file again ~1sec later -> Ensure that the file was healed

Actual results:
[root@host vol]# cat test
cat: test: Input/output error
[root@host vol]# cat test
[root@host vol]#

Expected results:
[root@host vol]# cat test
[root@host vol]#


Additional info:

Comment 1 Worker Ant 2016-12-09 07:18:04 UTC
REVIEW: http://review.gluster.org/16091 (afr: allow I/O when favorite-child-policy is enabled) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 2 Ravishankar N 2016-12-12 09:04:09 UTC
Hello Simon, 

Would it be possible for you to test the patch (Comment #1) and see if you find any problems with it? The 3.8 maintainer is concerned about taking the patch in since it is relatively large: http://www.gluster.org/pipermail/maintainers/2016-December/001866.html

Thanks,
Ravi

Comment 3 Simon Turcotte-Langevin 2016-12-12 15:27:16 UTC
Hello Ravi,

Firstly, thank you very much for the efforts on this issue, it is much appreciated. We will be testing the patch on the latest 3.8 sources and we will execute our benchmark to see if there's any issues. We will also test whether synchronous heals happen as expected.

Thanks,
Simon

Comment 4 Simon Turcotte-Langevin 2016-12-13 19:01:56 UTC
Hello again Ravi,

We're currently testing 3.8.5-1 with the patch you've given us applied to it. When all self-heals are on, the file is unplit-brain synchronously as expected.

However, if the self-heal is set to off, and the favorite child policy is set, then there's a deadlock ocurring.

Steps to reproduce:

 1) gluster volume set vol cluster.entry-self-heal off
    gluster volume set vol cluster.data-self-heal off
    gluster volume set vol cluster.metadata-self-heal off
    gluster volume set vol cluster.favorite-child-policy mtime
 2) node 1:
    setfattr --name=trusted.afr.gv0-client-0 --value=0x100000000000000000000000 /brick/test
    setfattr --name=trusted.afr.gv0-client-1 --value=0x000000000000000000000000 /brick/test
    setfattr --name=trusted.afr.gv0-client-2 --value=0x000000000000000000000000 /brick/test
 3) node 2:
    setfattr --name=trusted.afr.gv0-client-0 --value=0x000000000000000000000000 /brick/test
    setfattr --name=trusted.afr.gv0-client-1 --value=0x010000000000000000000000 /brick/test
    setfattr --name=trusted.afr.gv0-client-2 --value=0x000000000000000000000000 /brick/test
 4) node 3:
    setfattr --name=trusted.afr.gv0-client-0 --value=0x000000000000000000000000 /brick/test
    setfattr --name=trusted.afr.gv0-client-1 --value=0x000000000000000000000000 /brick/test
    setfattr --name=trusted.afr.gv0-client-2 --value=0x001000000000000000000000 /brick/test
 5) cat /vol/test # Never resolves.

Expected:

 a) Unsplitbrain mechanism triggers and returns right version or;
 b) Unsplitbrain mechanism does not trigger because self heal is toggled off


For our use case, if possible, a) is preffered. This might not be possible however, and b) should be honored.

Thanks,
Simon

Comment 5 Ravishankar N 2016-12-14 06:17:57 UTC
Hi Simon, Thanks a lot for testing!

While the steps you described does cause hang due to an infinite inode-refresh loop, the values you set for xattrs on the back end is not a valid scenario. 
You have set them in such a way that each brick blames itself (i.e trusted.afr.gv0-client-0 for the 1st brick, trusted.afr.gv0-client-1 for the 2nd brick etc). This is not possible in AFR-v2 (i.e. glusterfs-3.6 onwards), where each brick can have xattrs only blaming the other brick if some I/O fails.

You could retest by setting something like this:

(1)
1st brick: set trusted.afr.gv0-client-1 and trusted.afr.gv0-client-2
2nd brick: set trusted.afr.gv0-client-0 and trusted.afr.gv0-client-2
3rd brick: set trusted.afr.gv0-client-0 and trusted.afr.gv0-client-1.

Then things should work. 

(2) Alternatively, you can also bring bricks up/down while I/O is going on. But for replica-3 it is difficult to cause split-brain by the up/down method (works fine for replica 2 ).

I'm leaving a need-info for you to test and see if you find any issues with approaches (1) or (2).


Also, if you are able to hit the state where each brick blames itself (like in comment #4) without manually setting the xattrs, please raise a bug for it.

Thanks again,
Ravi

Comment 6 Simon Turcotte-Langevin 2016-12-14 18:42:01 UTC
Hello Ravishankar,

This is very interesting indeed, thanks for the insight on how the mechanism work under the hood. I've tested it with the right xattr, and it works perfectly, without the need for self-healing options. This is exactly the behaviour we were looking for.

As for the tests, I've sent you an email that details issues that we've found so far. However, the backport patch for this fix does not seem to arise any additional issues for 3.8.

Thanks,
Simon

Comment 7 Niels de Vos 2016-12-15 05:06:12 UTC
Many thanks for the testing, Simon! I've marked this to be included in the next (3.8.8) minor release. The release is targeted for ~10th of January.

Comment 8 Worker Ant 2017-01-04 11:53:51 UTC
REVIEW: http://review.gluster.org/16091 (afr: allow I/O when favorite-child-policy is enabled) posted (#2) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 9 Worker Ant 2017-01-04 11:53:55 UTC
REVIEW: http://review.gluster.org/16322 (afr: Ignore event_generation checks post inode refresh for write txns) posted (#1) for review on release-3.8 by Ravishankar N (ravishankar)

Comment 10 Worker Ant 2017-01-08 10:48:53 UTC
COMMIT: http://review.gluster.org/16091 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit c539e23023abe743770287439ebe81989a732728
Author: Ravishankar N <ravishankar>
Date:   Fri Dec 9 07:14:17 2016 +0000

    afr: allow I/O when favorite-child-policy is enabled
    
    Problem:
    Currently, I/O on a split-brained file fails even when the
    favorite-child-policy is set until the self-heal is complete.
    
    Fix:
    If a valid 'source' is found using the set favorite-child-policy, inspect
    and reset the afr pending xattrs on the 'sinks' (inside appropriate locks),
    refresh the inode and then proceed with the read or write transaction.
    
    The resetting itself happens in the self-heal code and hence can also
    happen in the client side background-heal or by the shd's index-heal in
    addition to the txn code path explained above. When it happens in via
    heal, we also add checks in undo-pending to not reset the sink xattrs
    again.
    
    > Reviewed-on: http://review.gluster.org/15673
    > Tested-by: Pranith Kumar Karampuri <pkarampu>
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    
    Change-Id: Ic8c1317720cb26bd114b6fe6af4e58c73b864626
    BUG: 1378547
    Signed-off-by: Ravishankar N <ravishankar>
    Reported-by: Simon Turcotte-Langevin <simon.turcotte-langevin>
    Reviewed-on: http://review.gluster.org/16091
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>

Comment 11 Worker Ant 2017-01-08 10:50:44 UTC
REVIEW: http://review.gluster.org/16322 (afr: Ignore event_generation checks post inode refresh for write txns) posted (#2) for review on release-3.8 by Niels de Vos (ndevos)

Comment 12 Worker Ant 2017-01-08 11:15:32 UTC
COMMIT: http://review.gluster.org/16322 committed in release-3.8 by Niels de Vos (ndevos) 
------
commit 268a1c1100ca661095d5606d0248e038bdbefd49
Author: Ravishankar N <ravishankar>
Date:   Wed Jan 4 17:21:35 2017 +0530

    afr: Ignore event_generation checks post inode refresh for write txns
    
    Backport of http://review.gluster.org/#/c/16205/
    
    Before http://review.gluster.org/#/c/16091/, after inode refresh, we
    failed read txns in case of EIO or event_generation being zero. For
    write transactions, the check was only for EIO. 16091 re-factored the
    code to fail both read and write when event_generation=0. This seems to
    have caused a regression as explained in the BZ.
    
    This patch restores that behaviour in afr_txn_refresh_done().
    
    Change-Id: Id763ed2d420b6d045d4505893a18959d998c91a3
    BUG: 1378547
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/16322
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>
    Smoke: Gluster Build System <jenkins.org>

Comment 13 Niels de Vos 2017-01-16 12:26:06 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.8, please open a new bug report.

glusterfs-3.8.8 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2017-January/000064.html
[2] https://www.gluster.org/pipermail/gluster-users/