Procedure: (1) Mount a volume with a file in split-brain state. (2) DO NOT lookup/stat/etc. the file. (3) Cat the file Result is that you'll actually get the contents of one replica (which one is based on the first-to-respond algorithm and thus effectively random). If you try to cat it again, you'll get EIO. The reason is that the lookup for the first open kicks off a self-heal *in the background*. If the open happens before the self-heal is complete, it can succeed and reads can occur from the chosen replica. By the time of the second open, the self-heal has completed and marked the file as split-brain so that fails as it should. This affects more than split-brain, actually. If a file merely needs self-heal, it's possible that the first cat will return data from the replica that's about to get stomped - if it was the first to respond but also the one that's missing updates (according to the other's changelog). We shouldn't be allowing the contents to be read in either case. As painful as it's going to be, we need to make that first open wait for the resolution of the self-heal that started with the first lookup.
http://review.gluster.org/4142 posted for review.
Also http://review.gluster.org/4059.
Jeff, I observed similar issue exists even when we fix the split-brain and first open after the split-brain fixing is still failing because the unwind of lookup happens before the split-brain is reset in the inode ctx. This behaviour has nothing to do with whether the file is a fresh lookup or not. I am sending a patch which shall set the split-brain status before the self-heal unwinds the fop to prevent the race. Pranith
Sounds like you're seeing a different bug with a similar symptom, and should have reported it earlier.
Applied http://review.gluster.org/#change,4198 which seems to fit your description (but was associated with another bug so I'm not sure). The simple test described above still fails. With 4059+4142 (above) it passes reliably. The bug you're fixing might be real, but it's not this one.
I am beginning to believe they could be different. Sorry for the confusion :-(. I tried re-creating the issue myself, but failed to on master. I think I mis-understood the steps in the bug. It is failing with EIO for me. Could you tell me where I went wrong in the following steps: [root@pranithk-laptop ~]# getfattr -d -m . -e hex /gfs/r2_?/a getfattr: Removing leading '/' from absolute path names # file: gfs/r2_0/a trusted.afr.r2-client-0=0x000000000000000000000000 trusted.afr.r2-client-1=0x000000210000000000000000 trusted.gfid=0x5402607ab48147cb95b1a696bafc1234 # file: gfs/r2_1/a trusted.afr.r2-client-0=0x000000210000000000000000 trusted.afr.r2-client-1=0x000000000000000000000000 trusted.gfid=0x5402607ab48147cb95b1a696bafc1234 [root@pranithk-laptop ~]# mount | grep "/mnt/r2" [root@pranithk-laptop ~]# mount -t glusterfs pranithk-laptop:/r2 /mnt/r2 [root@pranithk-laptop ~]# cd /mnt/r2 [root@pranithk-laptop r2]# cat a cat: a: Input/output error [root@pranithk-laptop r2]# gluster volume info r2 Volume Name: r2 Type: Replicate Volume ID: 2d92a896-beaf-465f-b83c-f80508b47a8f Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: pranithk-laptop:/gfs/r2_0 Brick2: pranithk-laptop:/gfs/r2_1 Options Reconfigured: cluster.self-heal-daemon: off diagnostics.brick-log-level: DEBUG diagnostics.client-log-level: DEBUG Thanks in advance for the help. Pranith.
Current patches are Pranith's http://review.gluster.org/4198 (in AFR) and my http://review.gluster.org/4215 (in AFR and quick-read).
REVIEW: http://review.gluster.org/6578 (cluster/afr: avoid race due to afr_is_transaction_running()) posted (#1) for review on master by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/6578 committed in master by Vijay Bellur (vbellur) ------ commit f9698036fcc1ceedea19110139400d0cf4a54c9a Author: Ravishankar N <ravishankar> Date: Mon Dec 23 09:32:22 2013 +0000 cluster/afr: avoid race due to afr_is_transaction_running() Problem: ------------------------------------------ afr_lookup_perform_self_heal() { if(afr_is_transaction_running()) goto out else afr_launch_self_heal(); } ------------------------------------------ When 2 clients simultaneously access a file in split-brain, one of them acquires the inode lock and proceeds with afr_launch_self_heal (which eventually fails and sets "sh-failed" in the callback.) The second client meanwhile bails out of afr_lookup_perform_self_heal() because afr_is_transaction_running() returns true due to the lock obtained by client-1. Consequetly in client-2, "sh-failed" does not get set in the dict, causing quick-read translator to *not* invalidate the inode, thereby serving data randomly from one of the bricks. Fix: If a possible split-brain is detected on lookup, forcefully traverse the afr_launch_self_heal() code path in afr_lookup_perform_self_heal(). Change-Id: I316f9f282543533fd3c958e4b63ecada42c2a14f BUG: 870565 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/6578 Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Varun Shastry <vshastry>
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED. Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html [2] http://supercolony.gluster.org/pipermail/gluster-users/
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report. glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html [2] http://supercolony.gluster.org/mailman/listinfo/gluster-users