Bug 836101 - Reoccuring unhealable split-brain
Reoccuring unhealable split-brain
Status: CLOSED INSUFFICIENT_DATA
Product: GlusterFS
Classification: Community
Component: replicate (Show other bugs)
3.3.0
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Pranith Kumar K
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-28 02:35 EDT by Johannes Martin
Modified: 2013-02-25 06:07 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-22 06:31:20 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Johannes Martin 2012-06-28 02:35:21 EDT
Description of problem:
I have various character devices files stored on a glusterfs volume. For some reason, after upgrading from 3.1.2 to 3.3.0, the file got into a split brain condition. Deleting one of the replicas does not resolve the split brain condition.

Version-Release number of selected component (if applicable):
glusterfs 3.3.0 built on Jun 24 2012 22:48:03
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>


How reproducible:
Not sure.

Steps to Reproduce:
1. on client: ls /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
--> ls: cannot access /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2: Input/output error
2. on server: rm /media/gluster/brick0/vz/var-lib-vz/private/6003/dev/ttyp2
3. on client: ls /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
crw-rw-rw- 1 root tty 3, 2 Dec 10  2008 /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
4. on client: ls /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2
ls: cannot access /server/gluster/vz/var-lib-vz/private/6003/dev/ttyp2: Input/output error
  
Actual results:
I/O error because of split brain condition.

Expected results:
Split brain healed.

Additional info:
Excerpt from client log file:
[2012-06-28 08:31:31.854210] E [afr-self-heal-common.c:1087:afr_sh_common_lookup_resp_handler] 0-vz-replicate-0: path /var-lib-vz/private/6003/dev/ttyp2 on subvolume vz-client-1 => -1 (No such file or directory)
[2012-06-28 08:31:31.856216] E [afr-self-heal-metadata.c:481:afr_sh_metadata_fix] 0-vz-replicate-0: Unable to self-heal permissions/ownership of '/var-lib-vz/private/6003/dev/ttyp2' (possible split-brain). Please fix the file on all backend volumes
[2012-06-28 08:31:31.856528] E [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-vz-replicate-0: background  meta-data data entry missing-entry gfid self-heal failed on /var-lib-vz/private/6003/dev/ttyp2
[2012-06-28 08:31:33.859259] W [afr-self-heal-data.c:831:afr_lookup_select_read_child_by_txn_type] 0-vz-replicate-0: /var-lib-vz/private/6003/dev/ttyp2: Possible split-brain


Running 
getfattr -d -m trusted.gfid -e hex /media/gluster/brick0/vz/var-lib-vz/private/6003/dev/ttyp2
on either server yields no result (so solution from https://bugzilla.redhat.com/show_bug.cgi?id=825559 canot be applied here.
Comment 1 Pranith Kumar K 2012-07-02 06:32:01 EDT

*** This bug has been marked as a duplicate of bug 832305 ***
Comment 2 Johannes Martin 2012-09-06 05:26:22 EDT
(In reply to comment #1)
> 
> *** This bug has been marked as a duplicate of bug 832305 ***

I don't think this bug is really a duplicate of bug 832305.

I applied the patch from 832305, deleted the inaccessible files on one brick, and checked that they were accessible again. They were indeed recreated on the brick where I had deleted them, and the files were accessible through the glusterfs mount.

A couple hours later, the rsync process that syncs some other non-glusterfs mount to the glusterfs-mount reported errors again, and the files were again inaccessible.
Comment 3 Pranith Kumar K 2012-09-06 05:41:11 EDT
Could you please provide a test case to re-create the issue on our setup.
Comment 4 Pranith Kumar K 2012-09-22 23:26:35 EDT
Johannes,
    Any updates on the test-case to re-create the problem?

Thanks in advance for you help
Pranith
Comment 5 Johannes Martin 2012-09-28 01:14:47 EDT
Sorry for taking so long to get back to you. I'm currently in the process of upgrading the OS on the server (from Debian Lenny to Squeeze) and will then recreate the gluster shares from scratch and try to reproduce the problem.
Comment 6 Vijay Bellur 2012-12-11 00:30:53 EST
Any luck with re-creating the problem?
Comment 7 Johannes Martin 2012-12-17 03:56:55 EST
Sorry, I haven't had any time to work on this again. Maybe early next year.
Comment 8 Pranith Kumar K 2013-02-22 06:31:20 EST
Please feel free to re-open the bug with the data requested.
Comment 9 Johannes Martin 2013-02-25 06:02:52 EST
Sorry again for the slow response. 

I recreated the shares about three weeks ago and I've been running the rsync that originally led to the split brain daily so far without any problems. So I assume the problem is solved now.

Maybe there was some problem with the migration from pre 3.3.0 glusterfs to 3.3.0 that led to the permanent split brain.
Comment 10 Pranith Kumar K 2013-02-25 06:07:42 EST
Johannes,
    Thanks for the response. We shall keep the bug closed for now.

Pranith.

Note You need to log in before you can comment on or make changes to this bug.