1220347 – Read operation on a file which is in split-brain condition is successful

Bug 1220347 - Read operation on a file which is in split-brain condition is successful

Summary: Read operation on a file which is in split-brain condition is successful

Keywords:
Status:	CLOSED DUPLICATE of bug 1229226
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1223758 1224709
TreeView+	depends on / blocked

Reported:	2015-05-11 11:54 UTC by Shruti Sampat
Modified:	2015-06-09 16:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	1223758 1224709 (view as bug list)
Environment:
Last Closed:	2015-06-09 16:29:05 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shruti Sampat 2015-05-11 11:54:03 UTC

Description of problem:
------------------------

`cat' on a file that was in split-brain condition was successful. This should ideally fail with `Input/output error'.

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.7.0beta1-0.69.git1a32479.el6.x86_64

How reproducible:
------------------
Always

Steps to Reproduce:
--------------------

1. Create a distributed-replicate volume and mount it via fuse.
2. Create a file `1' on the mount point -
# touch 1
3. Bring down one brick in the replica pair where `1' resides.
#kill -9 <pid-of-brick-process>
4. Write to the file -
# echo "Hello" > 1
5. Start volume with force option.
6. Bring down the other brick in the replica pair and write to the file again -
# echo "World" > 1
7. `cat' the file -
# cat 1

Actual results:
----------------

# cat 1
World

Expected results:
------------------

`cat' should fail with `Input/output error'.

Additional info:
-----------------

The volume configuration -

# gluster volume info 2-test
 
Volume Name: 2-test
Type: Distributed-Replicate
Volume ID: 0e312bd3-0473-4fdc-ba2f-7df53b9e9683
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp37-126.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick2: dhcp37-123.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick3: dhcp37-98.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick4: dhcp37-54.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick5: dhcp37-210.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick6: dhcp37-59.lab.eng.blr.redhat.com:/rhs/brick4/b1
Brick7: dhcp37-126.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick8: dhcp37-123.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick9: dhcp37-98.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick10: dhcp37-54.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick11: dhcp37-210.lab.eng.blr.redhat.com:/rhs/brick5/b1
Brick12: dhcp37-59.lab.eng.blr.redhat.com:/rhs/brick5/b1
Options Reconfigured:
performance.readdir-ahead: on
cluster.self-heal-daemon: off
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
features.uss: enable
features.quota: on
performance.write-behind: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.quick-read: off
performance.open-behind: off
features.bitrot: on
features.scrub: Active
diagnostics.client-log-level: DEBUG

Comment 1 Ravishankar N 2015-05-11 12:44:51 UTC

Observations from debugging the setup.

When debugging the mount process with gdb, it was observed that in afr_lookup_done, we do afr_inode_read_subvol_reset() and consequently when afr_read_txn(), afr_read_txn_refresh_done()  is called, we bail out because there are no read subvols and the client gets EIO.

When no gdb was attached, the client again began reading stale data. On further examination, it was observed that fuse sends the following FOPS when 'cat' was performed on the mount:

1)fuse_fop_resume-->fuse_lookup_resume
2)fuse_fop_resume-->fuse_open_resume
3)fuse_fop_resume-->fuse_getattr_resume--->afr_fstat-->afr_read_txn-->bail out with EIO.
4)fuse_fop_resume-->fuse_flush_resume


However when 'cat' was done in rapid succession, (3) was not being called. i.e only fuse_lookup_resume, fuse_open_resume and fuse_flush_resume were being called. Since the getattr was not sent by fuse, it did not get the EIO and was serving data from kernel cache. It was noted that this data returned was always the one written to the latest brick, "World" in this case.

I don't think we should hit the issue if we perform a 1) drop_caches on the existing mount, or 2) do a remount or 3)mount with the options  attribute-timeout and entry-timeout set to zero to begin with.

Comment 2 Shruti Sampat 2015-05-11 15:06:39 UTC

> 
> I don't think we should hit the issue if we perform a 1) drop_caches on the
> existing mount, or 2) do a remount or 3)mount with the options 
> attribute-timeout and entry-timeout set to zero to begin with.

Tried each of the above 3 and did not hit the issue.

Comment 3 Raghavendra Talur 2015-05-19 13:45:32 UTC

Can be closed now that it is proved it kernel cache in action? or can be this
taken as a feature?

Ravi, I guess you can decide.

Comment 4 Ravishankar N 2015-05-19 14:09:32 UTC

Raghavendra G has suggested a fix where we can set attribute-timeout to zero for the files that are in split-brain forcing fuse to send a fuse_getattr_resume(). I'll send a patch for it, let  us see if it is acceptable. Keeping the bug open until then.

Comment 5 Jeff Darcy 2015-06-09 16:29:05 UTC

Closing this as a duplicate of 1229226 (instead of the other way around) because there's more discussion there.

*** This bug has been marked as a duplicate of bug 1229226 ***

Note You need to log in before you can comment on or make changes to this bug.