Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1408660

Summary:	Setup a CentOS 7 VM to test split-brain-favorite-child-policy.t failures
Product:	[Community] GlusterFS	Reporter:	Nigel Babu <nigelb>
Component:	project-infrastructure	Assignee:	bugs <bugs>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs, gluster-infra, ravishankar
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-12 04:39:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nigel Babu 2016-12-26 09:08:20 UTC

split-brain-favorite-child-policy.t fails consistently on CentOS 7. Create a test VM for Ravi so he can run the tests and figure out what's wrong.

Comment 1 Nigel Babu 2016-12-26 11:56:53 UTC

Machine created and access granted to Ravi.

Comment 2 Ravishankar N 2016-12-26 12:32:24 UTC

Thanks for the setup Nigel. What is happening is this:

When `TEST dd if=/dev/urandom of=$M0/file bs=1024 count=1024` is run with a brick down, on Fedora, CentOS-6 etc, there are only pending data heals because writevs are the only FOPS hitting the file.

# getfattr -d -m . -e hex /d/backends/patchy*/file
getfattr: Removing leading '/' from absolute path names
# file: d/backends/patchy0/file
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.patchy-client-1=0x000000020000000000000000
trusted.bit-rot.version=0x02000000000000005861074c0007eb8b
trusted.gfid=0x14f0e574e9154e36b5158176582de68d

But when the same test is run on CentOS7, there is also a removexattr FOP coming:
afr_removexattr (frame=0x7f61e00163cc, this=0x7f61e800b8a0, loc=0x7f61dc03a33c, name=0x7f61dc02b760 "security.ima", xdata=0x7f61e001310c) at afr-inode-write.c:1827

Since a brick is down, the dirty is set in the pre-op, and since security.ima is not there in the brick, the cbk gets a op-ret -1 and errono=ENODATA. This is treated as a symmetric error the FOP is treated as success and dirty is not unset. Thus we have:

[root@centos-7-test glusterfs]# g /d/backends/patchy*/file
getfattr: Removing leading '/' from absolute path names
# file: d/backends/patchy0/file
trusted.afr.dirty=0x000000000000000100000000 <------- This is not cleared.
trusted.afr.patchy-client-1=0x000000020000000000000000
trusted.bit-rot.version=0x02000000000000005861074c0007eb8b
trusted.gfid=0x14f0e574e9154e36b5158176582de68d


Now when both bricks are up, metadata self-heal happens and updates the ctimes on the bricks as a part of undo-pending. Since this is done in a for loop, the 2nd brick will have latest ctime.


This breaks the assumption in the .t that the 1st brick has the latest ctime (which would have been the case had it not been for the removexattr FOP), resulting in a heal in the opposite direction, hence failing the md5sum comparison check in the .t


I will fix the .t to note ctime of both bricks from the back end and then pick up the one with the latest ctime as source. I'm retaining the machine until I test and send the patch.

Comment 3 Ravishankar N 2016-12-27 05:48:36 UTC

Sent http://review.gluster.org/#/c/16288 against BZ 1408757. Nigel, will update here once the patch gets merged, so that you can take back the machine.

Comment 4 Ravishankar N 2017-01-03 11:10:06 UTC

Nigel, the patch has been merged in master. Feel free to take back the CentOS and the netbsd machines.

Comment 5 Nigel Babu 2017-01-12 04:39:55 UTC

NetBSD machine returned to pool and Centos 7 machine killed.