Bug 765194 (GLUSTER-3462)

Summary:	[glusterfs-3.3.0qa6]: split brain happened when one of the servers brought down
Product:	[Community] GlusterFS	Reporter:	Raghavendra Bhat <rabhat>
Component:	replicate	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Raghavendra Bhat <rabhat>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	pre-release	CC:	gluster-bugs, rfortier, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-07-24 17:29:08 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:	glusterfs-3.3.0qa43.	Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	817967

Description Raghavendra Bhat 2011-08-22 10:31:43 UTC

Split brain happened when one of the server was brought down and up. 

Situation:

When one of the servers was brought down afr had done xattrop on the extended attribute of the file corresponding to that server and the xattr corresponding to other server on that file was still pending. But by the time xattrop reached there it was brought down (thus that server will think that operation is pending on other subvolume), and the further changes to the file on the mount point happened only on the server which is up which modified its xattrs on the file indicating that the operations are pending on other subvolume. When the down server is brought up, leads split brain situation.


[2011-08-22 15:13:23.132624] I [afr-inode-write.c:340:afr_trigger_open_fd_self_heal] 0-mirror-replicate-0:  data missing-entry gfid self-heal triggered. path: /passwd, reason: Replicate up down flush, data lock is held
[2011-08-22 15:13:23.154583] I [afr-common.c:1225:afr_launch_self_heal] 0-mirror-replicate-0: background  data missing-entry gfid self-heal triggered. path: /passwd
[2011-08-22 15:13:23.461168] I [afr-self-heal-common.c:1210:afr_sh_missing_entries_lookup_done] 0-mirror-replicate-0: No sources for dir of /passwd, in missing entry self-heal, continuing with the rest of the self-heals
[2011-08-22 15:13:23.499398] E [afr-self-heal-data.c:683:afr_sh_data_fix] 0-mirror-replicate-0: Unable to self-heal contents of '/passwd' (possible split-brain). Please delete the file from all but the preferred subvolume.
[2011-08-22 15:13:23.499541] E [afr-self-heal-common.c:2019:afr_self_heal_completion_cbk] 0-mirror-replicate-0: background  data missing-entry gfid self-heal failed on /passwd
[2011-08-22 15:13:24.160815] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5324247: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:13:24.161618] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5324252: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:15:39.128152] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5338631: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:15:39.136005] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5338633: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:15:39.215522] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5338634: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:15:40.264977] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5338643: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:15:40.313234] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5338661: LOOKUP() /passwd => -1 (Input/output error)
[2011-08-22 15:15:40.646996] W [fuse-bridge.c:184:fuse_entry_cbk] 0-glusterfs-fuse: 5338711: LOOKUP() /passwd => -1 (Input/output error)

Comment 1 Pranith Kumar K 2012-04-16 16:10:22 UTC

For this bug to hit, the brick process needs to die exactly after the change-log for itself is decremented to zero and before the pending change-log on the other-subvolume is not decremented to zero. This is a corner case. It is very difficult to hit this case. It is not a blocker.

Comment 2 Vidya Sakar 2012-04-16 16:16:37 UTC

Agree and removing blocker flag as discussed.

Comment 3 Anand Avati 2012-04-16 18:20:13 UTC

CHANGE: http://review.gluster.com/3149 (cluster/afr: increment change log with correct byte order) merged in master by Vijay Bellur (vijay)

Comment 4 Vijay Bellur 2012-05-18 12:37:39 UTC

Not a blocker. Removing blocker flag.

Comment 5 Anand Avati 2012-05-19 03:30:38 UTC

CHANGE: http://review.gluster.com/3226 (cluster/afr: Enforce order in pre/post op) merged in master by Anand Avati (avati)

Comment 6 Raghavendra Bhat 2012-05-24 09:24:30 UTC

Checked with glusterfs-3.3.0qa43. Brought down the brick many times while tests were running on the mount point, and the bug did not occur.