836101 – Reoccuring unhealable split-brain

Bug 836101 - Reoccuring unhealable split-brain

Summary: Reoccuring unhealable split-brain

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-06-28 06:35 UTC by Johannes Martin
Modified:	2013-02-25 11:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-02-22 11:31:20 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Johannes Martin 2012-06-28 06:35:21 UTC

Description of problem:
I have various character devices files stored on a glusterfs volume. For some reason, after upgrading from 3.1.2 to 3.3.0, the file got into a split brain condition. Deleting one of the replicas does not resolve the split brain condition.

Version-Release number of selected component (if applicable):
glusterfs 3.3.0 built on Jun 24 2012 22:48:03
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>


How reproducible:
Not sure.

Steps to Reproduce:
1. on client: ls /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
--> ls: cannot access /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2: Input/output error
2. on server: rm /media/gluster/brick0/vz/var-lib-vz/private/6003/dev/ttyp2
3. on client: ls /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
crw-rw-rw- 1 root tty 3, 2 Dec 10  2008 /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
4. on client: ls /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
ls: cannot access /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2: Input/output error
  
Actual results:
I/O error because of split brain condition.

Expected results:
Split brain healed.

Additional info:
Excerpt from client log file:
[2012-06-28 08:31:31.854210] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-vz-replicate-0: path /var-lib-vz/private/6003/dev/ttyp2 on subvolume vz-client-1 => -1 (No such file or directory)
[2012-06-28 08:31:31.856216] E [afr-self-heal-metadata.c:481:afr_sh_metadata_fix] 0-vz-replicate-0: Unable to self-heal permissions/ownership of '/var-lib-vz/private/6003/dev/ttyp2' (possible split-brain). Please fix the file on all backend volumes
[2012-06-28 08:31:31.856528] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-vz-replicate-0: background  meta-data data entry missing-entry gfid self-heal failed on /var-lib-vz/private/6003/dev/ttyp2
[2012-06-28 08:31:33.859259] W [afr-self-heal-data.c:831:afr_lookup_select_read_child_by_txn_type] 0-vz-replicate-0: /var-lib-vz/private/6003/dev/ttyp2: Possible split-brain


Running 
getfattr -d -m trusted.gfid -e hex /media/gluster/brick0/vz/var-lib-vz/private/6003/dev/ttyp2
on either server yields no result (so solution from https://bugzilla.redhat.com/show_bug.cgi?id=825559 canot be applied here.

Comment 1 Pranith Kumar K 2012-07-02 10:32:01 UTC


*** This bug has been marked as a duplicate of bug 832305 ***

Comment 2 Johannes Martin 2012-09-06 09:26:22 UTC

(In reply to comment #1)
> 
> *** This bug has been marked as a duplicate of bug 832305 ***

I don't think this bug is really a duplicate of bug 832305.

I applied the patch from 832305, deleted the inaccessible files on one brick, and checked that they were accessible again. They were indeed recreated on the brick where I had deleted them, and the files were accessible through the glusterfs mount.

A couple hours later, the rsync process that syncs some other non-glusterfs mount to the glusterfs-mount reported errors again, and the files were again inaccessible.

Comment 3 Pranith Kumar K 2012-09-06 09:41:11 UTC

Could you please provide a test case to re-create the issue on our setup.

Comment 4 Pranith Kumar K 2012-09-23 03:26:35 UTC

Johannes,
    Any updates on the test-case to re-create the problem?

Thanks in advance for you help
Pranith

Comment 5 Johannes Martin 2012-09-28 05:14:47 UTC

Sorry for taking so long to get back to you. I'm currently in the process of upgrading the OS on the server (from Debian Lenny to Squeeze) and will then recreate the gluster shares from scratch and try to reproduce the problem.

Comment 6 Vijay Bellur 2012-12-11 05:30:53 UTC

Any luck with re-creating the problem?

Comment 7 Johannes Martin 2012-12-17 08:56:55 UTC

Sorry, I haven't had any time to work on this again. Maybe early next year.

Comment 8 Pranith Kumar K 2013-02-22 11:31:20 UTC

Please feel free to re-open the bug with the data requested.

Comment 9 Johannes Martin 2013-02-25 11:02:52 UTC

Sorry again for the slow response. 

I recreated the shares about three weeks ago and I've been running the rsync that originally led to the split brain daily so far without any problems. So I assume the problem is solved now.

Maybe there was some problem with the migration from pre 3.3.0 glusterfs to 3.3.0 that led to the permanent split brain.

Comment 10 Pranith Kumar K 2013-02-25 11:07:42 UTC

Johannes,
    Thanks for the response. We shall keep the bug closed for now.

Pranith.

Note You need to log in before you can comment on or make changes to this bug.