1387501 – Asynchronous Unsplit-brain still causes Input/Output Error on system calls

Bug 1387501 - Asynchronous Unsplit-brain still causes Input/Output Error on system calls

Summary: Asynchronous Unsplit-brain still causes Input/Output Error on system calls

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Ravishankar N
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:	1378547
Blocks:	1351528 1386188 1403121
TreeView+	depends on / blocked

Reported:	2016-10-21 06:09 UTC by Ravishankar N
Modified:	2017-03-23 06:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-6
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1378547
Environment:
Last Closed:	2017-03-23 06:13:44 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Ravishankar N 2016-10-21 06:09:28 UTC

+++ This bug was initially created as a clone of Bug #1378547 +++

Description of problem:

The unsplit-brain mechanism is triggered along the self-healing mechanism. Since the self-healing mechanism is asynchronous, so is the unsplit-brain mechanism. Therefore, even tough the split-brain is resolved eventually, all system calls made before this happens causes an IOE to occur. This pushes the responsibility back to the client application, which needs to retry the system call, which in turn cause a waste of resources.

The self-heal mechanism should still be asynchronous, but the right version of the favorite child policy should be resolved synchronously to prevent the Input/Output exception to occur.

Version-Release number of selected component (if applicable):
3.8.4-1

How reproducible:
Create a split-brained file and assert that the first read still always causes an Input/Output Error.

Steps to Reproduce:
1. Set cluster.entry-self-heal to on, cluster.data-self-heal to on, cluster.metadata-self-heal to on and cluster.favorite-child-policy to mtime
2. Create a split-brained file
3. Cat the split-brained file -> Ensure that an Input/Output Error is raised
4. Cat the file again ~1sec later -> Ensure that the file was healed

Actual results:
[root@host vol]# cat test
cat: test: Input/output error
[root@host vol]# cat test
[root@host vol]#

Expected results:
[root@host vol]# cat test
[root@host vol]#


Additional info:

Comment 4 Ravishankar N 2016-11-08 01:10:37 UTC

Upstream patch: http://review.gluster.org/#/c/15673/4

Comment 5 Ravishankar N 2016-11-28 08:49:28 UTC

Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/91354/

Comment 7 Nag Pavan Chilakam 2017-01-16 10:35:50 UTC

on_qa verification:



steps run:
1. Set cluster.entry-self-heal to on, cluster.data-self-heal to on, cluster.metadata-self-heal to on and cluster.favorite-child-policy to mtime
2. Create a split-brained file
3. Cat the split-brained file -> Ensure that an Input/Output Error is raised
4. Cat the file again ~1sec later -> Ensure that the file was healed

the files are getting healed based on latest mtime
hence moving to verified

test version:3.8.4-11

Comment 9 errata-xmlrpc 2017-03-23 06:13:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.