Bug 830134

Summary: NFS Mount doesn't report "I/0 Error" when a file is in split-brain state
Product: [Community] GlusterFS Reporter: Shwetha Panduranga <shwetha.h.panduranga>
Component: replicateAssignee: Jeff Darcy <jdarcy>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.3-betaCC: gluster-bugs, jdarcy, rfortier
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 853682 (view as bug list) Environment:
Last Closed: 2013-07-24 17:55:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 853682, 855913, 858497    

Description Shwetha Panduranga 2012-06-08 10:31:32 UTC
Description of problem:
-----------------------
In a replicate volume, when a file is in split-brain state , cat on that file from the nfs mount should report "I/O Error"

Version-Release number of selected component (if applicable):
------------------------------------------------------------
3.3.0qa45

How reproducible:
----------------
Often

Steps to Reproduce:
---------------------
1.Create a replicate volume(1x2. brick1 and brick2)
2.set self-heal-daemon off for the volume
3.Start the volume.
4.Create a NFS mount.
5.Create a directory <testdir> from NFS mount
6.Create a file <testdir/file> from NFS mount
6.Bring down "brick1".
7.From nfs mount execute: echo "TestCase: Test Split-Brain. Brick1 is down now" > testdir/file
8.Bring back the brick "brick1"
9.Bring down "brick2"
10.From nfs mount execute:echo "TestCase: Test Split-Brain. Brick2 is down now" > testdir/file
11.Bring back the brick "brick2"
12.From the mount execute : cat testdir/file
  
Actual results:
----------------
[06/08/12 - 21:18:14 root@APP-CLIENT1 ~]# cd /mnt/nfsc1; cat testdir/file
TestCase: Test Split-Brain. Brick2 is down now


Expected results:
-------------------
Should report I/O Error

Additional info:
---------------

[06/08/12 - 21:13:48 root@APP-SERVER1 ~]# gluster v info
 
Volume Name: dstore
Type: Replicate
Volume ID: 03c2125d-c86a-45d3-abbe-7f83567d2d0b
Status: Created
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.2.35:/export_sdb/dir1
Brick2: 192.168.2.36:/export_sdb/dir1
Options Reconfigured:
cluster.self-heal-daemon: off

Brick1 data:-
-------------

[06/08/12 - 21:23:30 root@APP-SERVER1 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/file 
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/testdir/file
trusted.afr.dstore-client-0=0x000000000000000000000000
trusted.afr.dstore-client-1=0x000000010000000100000000
trusted.gfid=0x910b72d06aa842efa8300b16df998741

[06/08/12 - 21:23:31 root@APP-SERVER1 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/testdir/
trusted.gfid=0x617d018b908042dbb16d14a0d084b224

[06/08/12 - 21:23:33 root@APP-SERVER1 ~]# getfattr -d -m . -e hex /export_sdb/dir1/
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0x03c2125dc86a45d3abbe7f83567d2d0b

Brick2 data:-
------------

[06/08/12 - 21:24:01 root@APP-SERVER2 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/file 
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/testdir/file
trusted.afr.dstore-client-0=0x000000010000000100000000
trusted.afr.dstore-client-1=0x000000000000000000000000
trusted.gfid=0x910b72d06aa842efa8300b16df998741

[06/08/12 - 21:24:01 root@APP-SERVER2 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/testdir/
trusted.gfid=0x617d018b908042dbb16d14a0d084b224

[06/08/12 - 21:24:03 root@APP-SERVER2 ~]# getfattr -d -m . -e hex /export_sdb/dir1/
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/
trusted.gfid=0x00000000000000000000000000000001
trusted.glusterfs.volume-id=0x03c2125dc86a45d3abbe7f83567d2d0b

Note:- Further rm on the file from mount succeeds 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Mount Output:-
------------
[06/08/12 - 21:24:22 root@APP-CLIENT1 testdir]# rm file 
rm: remove regular file `file'? y

Brick1 :-
-------

[06/08/12 - 21:24:37 root@APP-SERVER1 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/file
getfattr: /export_sdb/dir1/testdir/file: No such file or directory

[06/08/12 - 21:25:48 root@APP-SERVER1 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/testdir/
trusted.gfid=0x617d018b908042dbb16d14a0d084b224


Brick2:-
--------

[06/08/12 - 21:24:47 root@APP-SERVER2 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/file
getfattr: /export_sdb/dir1/testdir/file: No such file or directory

[06/08/12 - 21:25:52 root@APP-SERVER2 ~]# getfattr -d -m . -e hex /export_sdb/dir1/testdir/
getfattr: Removing leading '/' from absolute path names
# file: export_sdb/dir1/testdir/
trusted.gfid=0x617d018b908042dbb16d14a0d084b224

Comment 1 Krishna Srinivas 2012-09-10 12:30:23 UTC
Pranith, replicate was not returning EIO in case like this (note that it is anonymous fd read). Can you take a look?

Comment 2 Pranith Kumar K 2012-09-11 02:20:16 UTC
Nfs does not perform lookups. Afr depends on lookup fop to realize that there is a split-brain and report it, so with NFS no EIOs are seen this is a known issue.

Comment 3 Jeff Darcy 2012-10-09 15:32:03 UTC
Submitted http://review.gluster.org/4050 to bump mtime/ctime on getattr requests (which NFS uses to check cache freshness) and force a new lookup.  When the self-heal done as part of the lookup fails due to split brain or GFID mismatch, the NFS client gets EIO back.

Comment 4 Vijay Bellur 2012-10-19 14:16:44 UTC
CHANGE: http://review.gluster.org/4058 (nfs: do lookup on getattr after brick-status change) merged in master by Vijay Bellur (vbellur)

Comment 5 Jeff Darcy 2012-10-26 21:10:23 UTC
*** Bug 830121 has been marked as a duplicate of this bug. ***