Bug 1401969

Summary:	Bringing down data bricks in cyclic order results in arbiter brick becoming the source for heal.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	arbiter	Assignee:	Karthik U S <ksubrahm>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	rhgs-3.2	CC:	amukherj, anepatel, bkunal, ksubrahm, nchilaka, ravishankar, rcyriac, rhinduja, rhs-bugs, sabose, sasundar, sheggodu, srmukher, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-2	Doc Type:	Bug Fix
Doc Text:	Previously, when bricks went down in a particular order while parallel I/O was in progress, the arbiter brick became the source for data heal. This led to data being unavailable, since arbiter bricks store only metadata. With this fix, arbiter brick will not be marked as source.	Story Points:	---
Clone Of:
Clones:	1482064 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:29:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1439657, 1482064, 1516313, 1566131
Bug Blocks:	1433896, 1503134

Description RamaKasturi 2016-12-06 13:32:15 UTC

Description of problem:
When data bricks in arbiter volume are brought down in a cyclic manner i see that arbiter brick becomes the source for heal which should not happen as this brick just contains meta data.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-5.el7rhgs.x86_64

How reproducible:
Hit it once

Steps to Reproduce:
1. Install HC stack on arbiter volumes
2. start doing I/O on the vms
3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another data brick 
4.After some time Bring up the down brick and i observed few VM's are getting paused and arbiter brick becomes the source for other two bricks.

Actual results:
Vms are getting paused and i see that arbiter brick becomes source for the other two bricks.

Expected results:
Arbiter brick should not become source for other two bricks as it does not hold any data.

Additional info:

Comment 2 RamaKasturi 2016-12-06 13:41:45 UTC

Volume info :
================
[root@rhsqa-grafton1 ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: 09d43f7c-a6a2-4f4d-b781-c36e53a48bca
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.79:/rhgs/brick2/data
Brick2: 10.70.36.80:/rhgs/brick2/data
Brick3: 10.70.36.81:/rhgs/brick2/data (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
performance.strict-o-direct: on
network.ping-timeout: 30
user.cifs: off
cluster.granular-entry-heal: on

gluster volume heal info output on data volume:
==============================================
[root@rhsqa-grafton1 ~]# gluster volume heal data info
Brick 10.70.36.79:/rhgs/brick2/data
/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 
Status: Connected
Number of entries: 1

Brick 10.70.36.80:/rhgs/brick2/data
/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 
Status: Connected
Number of entries: 1

Brick 10.70.36.81:/rhgs/brick2/data
/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73 
Status: Connected
Number of entries: 1

fattrs on the first node:
============================
[root@rhsqa-grafton1 ~]# getfattr -d -m . -e hex /rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73
getfattr: Removing leading '/' from absolute path names
# file: rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.data-client-1=0x0000156a0000000000000000
trusted.afr.dirty=0x000000010000000000000000
trusted.bit-rot.version=0x0200000000000000583ebacf000b18d6
trusted.gfid=0x46744dafdde147758967c233e249f707
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000035af0000000000000000000000000000001c1f600000000000000000

fattrs on the second node:
=============================
getfattr -d -m . -e hex /rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73
getfattr: Removing leading '/' from absolute path names
# file: rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.data-client-0=0x000000010000000000000000
trusted.afr.dirty=0x000000010000000000000000
trusted.bit-rot.version=0x0300000000000000583ecac6000e17f7
trusted.gfid=0x46744dafdde147758967c233e249f707
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000035af0000000000000000000000000000001c1f600000000000000000

fattrs on the third node:
==============================
[root@rhsqa-grafton3 ~]# getfattr -d -m . -e hex /rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73
getfattr: Removing leading '/' from absolute path names
# file: rhgs/brick2/data/f3b0e738-03e9-49a1-886c-aa021cd8badb/images/ce010932-d6aa-4755-a445-f6ba9e508e88/f997ad93-101d-47e6-b0eb-37164a617d73
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.data-client-0=0x000000010000000000000000
trusted.afr.data-client-1=0x0000156a0000000000000000
trusted.afr.dirty=0x000000010000000000000000
trusted.bit-rot.version=0x0200000000000000583eb09a000b126b
trusted.gfid=0x46744dafdde147758967c233e249f707
trusted.glusterfs.shard.block-size=0x0000000020000000
trusted.glusterfs.shard.file-size=0x0000000035af0000000000000000000000000000001c1f600000000000000000

Comment 3 RamaKasturi 2016-12-06 13:50:28 UTC

sosreports can be found at the link:
=====================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1401969/

Comment 4 Nag Pavan Chilakam 2016-12-07 10:40:53 UTC

is this similar to https://bugzilla.redhat.com/show_bug.cgi?id=1361518 - Files not able to heal after arbiter and data bricks were rebooted ?

Comment 5 Ravishankar N 2016-12-07 10:54:19 UTC

(In reply to nchilaka from comment #4)
> is this similar to https://bugzilla.redhat.com/show_bug.cgi?id=1361518 -
> Files not able to heal after arbiter and data bricks were rebooted ?

No, those are zero byte files where arbiter is used as source brick during entry self-heal's new entry creation.

Comment 8 RamaKasturi 2017-04-12 14:00:37 UTC

I do see that the first two data bricks in the volume blame each other (first brick says that second one needs healing and second one says first one needs healing)

Comment 9 Sahina Bose 2017-04-18 05:59:24 UTC

Ravi, can you provide doc text with workaround?

Comment 13 SATHEESARAN 2017-06-16 07:53:15 UTC

Tested with RHGS 3.3.0 interim build ( glusterfs-3.8.4-28.el7rhgs ) and I could hit this issue consistenly with the other issue of split-brain on arbiter volume BZ 1384983

Very simple test is to:
1. Create arbiter volume 1x (2+1) with bricks - brick1, brick2, arbiter
2. Fuse mount it on any RHEL 7 client
3. Run some app ( dd, truncate, etc, ) on a single file
4. Kill brick2
5. sleep for 3 seconds
6. Bring up brick2, sleep for 3 seconds, kill arbiter
7. sleep for 3 seconds
8. Bring up arbiter, sleep for 3 seconds, kill brick1
9. sleep for 3 seconds
10. continue with step 4

When the above steps are repeated, I observed that I landed up in a split-brain  ( bz 1384983 ) or arbiter becoming source of heal.

Comment 16 Karthik U S 2017-07-12 06:18:50 UTC

There is a race which is leading to this situation.

This happens when eager-lock is on, due to which 2 writes happen in parallel on a FD. First write fails on one brick and before marking the pending xattrs with post-op, another write comes in parallel. This will do the inode refresh and get the readables. Since we did not mark the xattrs on the disk yet, the refresh will get both the data bricks as readable and set it in the inode context.
The in-flight split brain check see both the data bricks as readable and allow the second write. This write fails on the other brick and succeeds on the previously failed brick.
Now we have one write failed on first data brick and the other failed on the second data brick. Now the post-op completes for both writes and marks pending on both the bricks, leading to arbiter becoming source.

Comment 20 Karthik U S 2017-08-30 05:12:24 UTC

Upstream patch: https://review.gluster.org/#/c/18049/

Comment 26 Karthik U S 2017-12-26 09:08:12 UTC

Upstream patch: https://review.gluster.org/#/c/19045/

Comment 30 SATHEESARAN 2018-08-27 09:52:38 UTC

Tested with RHGS 3.4.0 nightly build - glusterfs-3.12.2-16.el7rhgs with the following steps:

1. Create a 1x(2+1) arbitrated replicate volume and used that as a storage domain in RHV.
2. Created few VMs with their boot disks on this domain
3. Run some I/O inside the VM
4. Killed the first brick, wait for 10 mins
5. Bring back the brick, wait till self-heal is complete.
6. Repeat 4 & 5 for second & third brick
7. Repeate 4,5,6 for 100 iterations.

All worked good. Arbiter has never become the source of heal

Comment 31 Srijita Mukherjee 2018-09-03 15:38:28 UTC

Have updated the doc text. Kindly review and confirm

Comment 32 Karthik U S 2018-09-03 16:02:47 UTC

Made a small change. Rest looks good to me.

Comment 33 Srijita Mukherjee 2018-09-03 16:08:48 UTC

have updated the doc text. Kindly review and confirm

Comment 34 Karthik U S 2018-09-03 16:16:38 UTC

In the last sentence you have to rewrite "arbiter bricks are not considered as source" to "arbiter brick will not be marked as source", because we will decide anything as source or sink based on the pending changelogs set on the file. With this fix we do not even allow to set the data pending part in the pending changelog xattrs if it is an arbiter brick, which was happening before. Considering source and marking source are two different things.

Comment 35 errata-xmlrpc 2018-09-04 06:29:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607