Bug 1387501

Summary:	Asynchronous Unsplit-brain still causes Input/Output Error on system calls
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ravishankar N <ravishankar>
Component:	replicate	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, pkarampu, rhinduja, rhs-bugs, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1378547	Environment:
Last Closed:	2017-03-23 06:13:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1378547
Bug Blocks:	1351528, 1386188, 1403121

Description Ravishankar N 2016-10-21 06:09:28 UTC

+++ This bug was initially created as a clone of Bug #1378547 +++

Description of problem:

The unsplit-brain mechanism is triggered along the self-healing mechanism. Since the self-healing mechanism is asynchronous, so is the unsplit-brain mechanism. Therefore, even tough the split-brain is resolved eventually, all system calls made before this happens causes an IOE to occur. This pushes the responsibility back to the client application, which needs to retry the system call, which in turn cause a waste of resources.

The self-heal mechanism should still be asynchronous, but the right version of the favorite child policy should be resolved synchronously to prevent the Input/Output exception to occur.

Version-Release number of selected component (if applicable):
3.8.4-1

How reproducible:
Create a split-brained file and assert that the first read still always causes an Input/Output Error.

Steps to Reproduce:
1. Set cluster.entry-self-heal to on, cluster.data-self-heal to on, cluster.metadata-self-heal to on and cluster.favorite-child-policy to mtime
2. Create a split-brained file
3. Cat the split-brained file -> Ensure that an Input/Output Error is raised
4. Cat the file again ~1sec later -> Ensure that the file was healed

Actual results:
[root@host vol]# cat test
cat: test: Input/output error
[root@host vol]# cat test
[root@host vol]#

Expected results:
[root@host vol]# cat test
[root@host vol]#


Additional info:

Comment 4 Ravishankar N 2016-11-08 01:10:37 UTC

Upstream patch: http://review.gluster.org/#/c/15673/4

Comment 5 Ravishankar N 2016-11-28 08:49:28 UTC

Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/91354/

Comment 7 Nag Pavan Chilakam 2017-01-16 10:35:50 UTC

on_qa verification:



steps run:
1. Set cluster.entry-self-heal to on, cluster.data-self-heal to on, cluster.metadata-self-heal to on and cluster.favorite-child-policy to mtime
2. Create a split-brained file
3. Cat the split-brained file -> Ensure that an Input/Output Error is raised
4. Cat the file again ~1sec later -> Ensure that the file was healed

the files are getting healed based on latest mtime
hence moving to verified

test version:3.8.4-11

Comment 9 errata-xmlrpc 2017-03-23 06:13:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html