Bug 870565

Summary:	Open can succeed on split-brain files due to race
Product:	[Community] GlusterFS	Reporter:	Jeff Darcy <jdarcy>
Component:	replicate	Assignee:	Jeff Darcy <jdarcy>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	mainline	CC:	bugs, gluster-bugs, pkarampu
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.6.0beta1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	972021 (view as bug list)		Environment:
Last Closed:	2014-11-11 08:23:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	972021

Description Jeff Darcy 2012-10-26 21:21:12 UTC

Procedure:

(1) Mount a volume with a file in split-brain state.

(2) DO NOT lookup/stat/etc. the file.

(3) Cat the file

Result is that you'll actually get the contents of one replica (which one is based on the first-to-respond algorithm and thus effectively random).  If you try to cat it again, you'll get EIO.  The reason is that the lookup for the first open kicks off a self-heal *in the background*.  If the open happens before the self-heal is complete, it can succeed and reads can occur from the chosen replica.  By the time of the second open, the self-heal has completed and marked the file as split-brain so that fails as it should.

This affects more than split-brain, actually.  If a file merely needs self-heal, it's possible that the first cat will return data from the replica that's about to get stomped - if it was the first to respond but also the one that's missing updates (according to the other's changelog).

We shouldn't be allowing the contents to be read in either case.  As painful as it's going to be, we need to make that first open wait for the resolution of the self-heal that started with the first lookup.

Comment 1 Jeff Darcy 2012-10-31 14:15:39 UTC

http://review.gluster.org/4142 posted for review.

Comment 2 Jeff Darcy 2012-10-31 14:18:03 UTC

Also http://review.gluster.org/4059.

Comment 3 Pranith Kumar K 2012-11-16 06:29:27 UTC

Jeff,
    I observed similar issue exists even when we fix the split-brain and first open after the split-brain fixing is still failing because the unwind of lookup happens before the split-brain is reset in the inode ctx. This behaviour has nothing to do with whether the file is a fresh lookup or not. I am sending a patch which shall set the split-brain status before the self-heal unwinds the fop to prevent the race.

Pranith

Comment 4 Jeff Darcy 2012-11-16 10:50:50 UTC

Sounds like you're seeing a different bug with a similar symptom, and should have reported it earlier.

Comment 5 Jeff Darcy 2012-11-16 13:27:13 UTC

Applied http://review.gluster.org/#change,4198 which seems to fit your description (but was associated with another bug so I'm not sure).  The simple test described above still fails.  With 4059+4142 (above) it passes reliably.  The bug you're fixing might be real, but it's not this one.

Comment 6 Pranith Kumar K 2012-11-16 14:47:54 UTC

I am beginning to believe they could be different. Sorry for the confusion :-(.
I tried re-creating the issue myself, but failed to on master. I think I mis-understood the steps in the bug. It is failing with EIO for me. Could you tell me where I went wrong in the following steps:
[root@pranithk-laptop ~]# getfattr -d -m . -e hex /gfs/r2_?/a
getfattr: Removing leading '/' from absolute path names
# file: gfs/r2_0/a
trusted.afr.r2-client-0=0x000000000000000000000000
trusted.afr.r2-client-1=0x000000210000000000000000
trusted.gfid=0x5402607ab48147cb95b1a696bafc1234

# file: gfs/r2_1/a
trusted.afr.r2-client-0=0x000000210000000000000000
trusted.afr.r2-client-1=0x000000000000000000000000
trusted.gfid=0x5402607ab48147cb95b1a696bafc1234

[root@pranithk-laptop ~]# mount | grep "/mnt/r2"
[root@pranithk-laptop ~]# mount -t glusterfs pranithk-laptop:/r2 /mnt/r2
[root@pranithk-laptop ~]# cd /mnt/r2
[root@pranithk-laptop r2]# cat a
cat: a: Input/output error
[root@pranithk-laptop r2]# gluster volume info r2
 
Volume Name: r2
Type: Replicate
Volume ID: 2d92a896-beaf-465f-b83c-f80508b47a8f
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: pranithk-laptop:/gfs/r2_0
Brick2: pranithk-laptop:/gfs/r2_1
Options Reconfigured:
cluster.self-heal-daemon: off
diagnostics.brick-log-level: DEBUG
diagnostics.client-log-level: DEBUG

Thanks in advance for the help.
Pranith.

Comment 7 Jeff Darcy 2012-11-29 15:01:04 UTC

Current patches are Pranith's http://review.gluster.org/4198 (in AFR) and my http://review.gluster.org/4215 (in AFR and quick-read).

Comment 8 Anand Avati 2013-12-23 12:02:19 UTC

REVIEW: http://review.gluster.org/6578 (cluster/afr: avoid race due to afr_is_transaction_running()) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 9 Anand Avati 2013-12-24 10:24:28 UTC

COMMIT: http://review.gluster.org/6578 committed in master by Vijay Bellur (vbellur) 
------
commit f9698036fcc1ceedea19110139400d0cf4a54c9a
Author: Ravishankar N <ravishankar>
Date:   Mon Dec 23 09:32:22 2013 +0000

    cluster/afr: avoid race due to afr_is_transaction_running()
    
    Problem:
    ------------------------------------------
    afr_lookup_perform_self_heal() {
            if(afr_is_transaction_running())
                    goto out
            else
                    afr_launch_self_heal();
    }
    ------------------------------------------
    
    When 2 clients simultaneously access a file in split-brain, one of them
    acquires the inode lock and proceeds with afr_launch_self_heal (which
    eventually fails and sets "sh-failed" in the callback.)
    
    The second client meanwhile bails out of afr_lookup_perform_self_heal()
    because afr_is_transaction_running() returns true due to the lock obtained by
    client-1. Consequetly in client-2, "sh-failed" does not get set in the dict,
    causing quick-read translator to *not* invalidate the inode, thereby serving
    data randomly from one of the bricks.
    
    Fix:
    If a possible split-brain is detected on lookup, forcefully traverse the
    afr_launch_self_heal() code path in afr_lookup_perform_self_heal().
    
    Change-Id: I316f9f282543533fd3c958e4b63ecada42c2a14f
    BUG: 870565
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: http://review.gluster.org/6578
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Varun Shastry <vshastry>

Comment 10 Niels de Vos 2014-09-22 12:31:19 UTC

A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 11 Niels de Vos 2014-11-11 08:23:27 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users