Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 762908 (GLUSTER-1176)

Summary:	Two replica failures prevent self-heal even when one node recovers
Product:	[Community] GlusterFS	Reporter:	Shehjar Tikoo <shehjart>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	medium	Docs Contact:
Priority:	low
Version:	mainline	CC:	amarts, gluster-bugs, rabhat, vijay, vijaykumar
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	RTP	Mount Type:	nfs
Documentation:	DP	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Shehjar Tikoo 2010-07-20 06:46:29 UTC

(In reply to comment #0)
> In a three replica setup, if one replica goes down, IO continues as normal to
> the two remaining replicas, and self-heals when this replica comes back up.
> 
> When two replicas go down, IO continues to the one remaining replica. When one
> replica comes back up, self-heal allows IO on the source replica to continue in
> parallel to self-heal on the destination replica.


Correct behaviour is to pause IO on all replicas, and allow only self heal to continue on the replica that came back up. Right now, parallel self-heal and IO on the source replica is resulting in data corruption on the recovering replica.

Comment 1 Shehjar Tikoo 2010-07-20 08:42:53 UTC

In a three replica setup, if one replica goes down, IO continues as normal to the two remaining replicas, and self-heals when this replica comes back up.

When two replicas go down, IO continues to the one remaining replica. When one replica comes back up, self-heal allows IO on the source replica to continue in parallel to self-heal on the destination replica.

Comment 2 Raghavendra Bhat 2011-04-07 05:25:28 UTC

I did a similar kind of test. i.e. created a 3 replica volume. Started it and mounted it as nfs mount. Started dd of a 2gb file and brought down 2 bricks. After around 500 MB is copied brought up the 2 bricks. After the file is entirely written, checked the md5sum of the file on all the bricks. md5sum was same on all the bricks and on the mount point. Seems that it is working fine.

Comment 3 Shehjar Tikoo 2011-04-07 06:06:08 UTC

The problem is when only one downed node is brought up, not both. Leave the third replica down and let the IO complete. Then check md5 sum. Make sure you're using a file other than /dev/zero for dd input.

Comment 4 Amar Tumballi 2011-04-25 09:33:13 UTC

Please update the status of this bug as its been more than 6months since its filed (bug id < 2000)

Please resolve it with proper resolution if its not valid anymore. If its still valid and not critical, move it to 'enhancement' severity.

Comment 5 Shehjar Tikoo 2011-04-26 03:11:03 UTC

Still valid.

Comment 6 Vijaykumar 2011-08-16 07:46:44 UTC

I created 3 replica volume, started it and mounted it as nfs mount point. started dd of 2gb with if=/dev/urandom at the mount point. I brought down 2 bricks. I was monitoring du of file in all the bricks and at the mount. when i brought down two bricks, file size was increasing in one brick which was up, and it was constant in other two. When i brought up one of the bricks which were down , IO stopped on the mount point till self healing was complete. Later the file size in both bricks and mount point started increasing simultaneously. I have performed this test for some 7 to 8 times, it was consistently working fine.