Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1363721

Summary:	[HC]: After bringing down and up of the bricks VM's are getting paused
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	replicate	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	bugs, mzywusko, pkarampu, rhs-bugs, rmekala, sabose, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.9.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1333406
Clones:	1367270 1367272 (view as bug list)		Environment:
Last Closed:	2017-03-27 18:25:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1333406
Bug Blocks:	1367270, 1367272

Description Krutika Dhananjay 2016-08-03 12:24:42 UTC

+++ This bug was initially created as a clone of Bug #1333406 +++

Description of problem:
=====================
After bringing down and up of the bricks, VM's are getting paused

Version-Release number of selected component (if applicable):
=============
glusterfs-server-3.7.9-2.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
=====================
1. Create 1x3 volume and host few VM's on the gluster volumes
2. Login to the VM's and run script to populate data (using DD) 
3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another brick 
4. After some time Bring up the down brick and bring down another brick during the brick down and bring up process observed few VM's are getting paused 

Actual results:
==================
Virtual machines are getting paused 


Expected results:
=================
VM's should not be paused 

Additional info:
===================
[root@zod ~]# gluster vol info
 
Volume Name: data
Type: Replicate
Volume ID: 5021c1f8-0b2f-4b34-92ea-a087afe84ce3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/rhgs/data/data-brick1
Brick2: server2:/rhgs/data/data-brick2
Brick3: server3:/rhgs/data/data-brick3
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
cluster.shd-max-threads: 16
 
Volume Name: engine
Type: Replicate
Volume ID: 5e14889a-0ffc-415f-8fbd-259451972c46
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/rhgs/engine/engine-brick1
Brick2: server2:/rhgs/engine/engine-brick2
Brick3: server3:/rhgs/engine/engine-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
nfs.disable: on
cluster.data-self-heal-algorithm: full
performance.low-prio-threads: 32
features.shard-block-size: 512MB
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
 
Volume Name: vmstore
Type: Replicate
Volume ID: edd3e117-138e-437b-9e65-319084fecc4b
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: server1:/rhgs/vmstore/vmstore-brick1
Brick2: server2:/rhgs/vmstore/vmstore-brick2
Brick3: server3:/rhgs/vmstore/vmstore-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
[root@zod ~]#


--- Additional comment from Sahina Bose on 2016-05-19 05:42:59 EDT ---

This bug is related to cyclic network outage test causing file to be in split brain and is not a very likely scenario.


--- Additional comment from Krutika Dhananjay on 2016-07-18 01:39:27 EDT ---

(In reply to RajeshReddy from comment #0)
> Description of problem:
> =====================
> After bringing down and up of the bricks, VM's are getting paused
> 
> Version-Release number of selected component (if applicable):
> =============
> glusterfs-server-3.7.9-2.el7rhgs.x86_64
> 
> How reproducible:
> 
> 
> Steps to Reproduce:
> =====================
> 1. Create 1x3 volume and host few VM's on the gluster volumes
> 2. Login to the VM's and run script to populate data (using DD) 
> 3. While IO is going on bring down one of the brick and after some time
> bring up the brick and bring down another brick 
> 4. After some time Bring up the down brick and bring down another brick
> during the brick down and bring up process observed few VM's are getting
> paused 
> 
> Actual results:
> ==================
> Virtual machines are getting paused 
> 
> 
> Expected results:
> =================
> VM's should not be paused 


Just wondering whether it is possible at all to keep the VM from pausing in this scenario. The best we can do is to prevent the shard/vm image from going into a split-brain when bricks are brought offline and back online in cyclic order, which means the VM(s) will _still_ pause (with EROFS?) at some point, only this time after the particular file/shard is healed, IO may be resumed from inside the VM without requiring manual intervention to fix the split-brain.

@Pranith: Are the above statements correct? Or is there a way to actually keep the VM from pausing?

-Krutika

--- Additional comment from Pranith Kumar K on 2016-07-18 06:14:41 EDT ---

You are correct, we can't prevent VMs getting paused. We only need to make sure that split-brains won't happen. Please note that this case may lead to the VM image going extremely bad, but all we can guarantee is the file not going into split-brain.

Comment 1 Vijay Bellur 2016-08-03 13:06:18 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought on and off in cyclic order) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 2 Vijay Bellur 2016-08-03 13:07:12 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 3 Vijay Bellur 2016-08-04 11:46:41 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#3) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Vijay Bellur 2016-08-05 02:33:30 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

Comment 5 Vijay Bellur 2016-08-09 07:30:40 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#5) for review on master by Krutika Dhananjay (kdhananj)

Comment 6 Vijay Bellur 2016-08-09 08:22:24 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#6) for review on master by Krutika Dhananjay (kdhananj)

Comment 7 Vijay Bellur 2016-08-11 09:42:06 UTC

REVIEW: http://review.gluster.org/15145 (cluster/afr: Bug fixes in txn codepath) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 8 Vijay Bellur 2016-08-12 01:23:15 UTC

REVIEW: http://review.gluster.org/15145 (cluster/afr: Bug fixes in txn codepath) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 9 Vijay Bellur 2016-08-15 10:40:58 UTC

COMMIT: http://review.gluster.org/15145 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 79b9ad3dfa146ef29ac99bf87d1c31f5a6fe1fef
Author: Krutika Dhananjay <kdhananj>
Date:   Fri Aug 5 12:18:05 2016 +0530

    cluster/afr: Bug fixes in txn codepath
    
    AFR sets transaction.pre_op[] array even before actually doing the
    pre-op on-disk. Therefore, AFR must not only consider the pre_op[] array
    but also the failed_subvols[] information before setting the pre_op_done[]
    flag. This patch fixes that.
    
    Change-Id: I78ccd39106bd4959441821355a82572659e3affb
    BUG: 1363721
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/15145
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Ravishankar N <ravishankar>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Reviewed-by: Anuradha Talur <atalur>
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 10 Vijay Bellur 2016-08-16 09:04:03 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#7) for review on master by Krutika Dhananjay (kdhananj)

Comment 11 Vijay Bellur 2016-08-18 04:08:28 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#8) for review on master by Krutika Dhananjay (kdhananj)

Comment 12 Vijay Bellur 2016-08-18 10:20:18 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#9) for review on master by Krutika Dhananjay (kdhananj)

Comment 13 Vijay Bellur 2016-08-18 10:51:06 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#10) for review on master by Krutika Dhananjay (kdhananj)

Comment 14 Vijay Bellur 2016-08-18 12:20:09 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#11) for review on master by Krutika Dhananjay (kdhananj)

Comment 15 Vijay Bellur 2016-08-18 17:31:35 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#12) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 16 Vijay Bellur 2016-08-18 18:37:59 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#13) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 17 Vijay Bellur 2016-08-18 21:17:16 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#14) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 18 Vijay Bellur 2016-08-18 21:41:12 UTC

REVIEW: http://review.gluster.org/15080 (cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order) posted (#15) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 19 Worker Ant 2016-08-22 09:38:39 UTC

COMMIT: http://review.gluster.org/15080 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit fcb5b70b1099d0379b40c81f35750df8bb9545a5
Author: Krutika Dhananjay <kdhananj>
Date:   Thu Jul 28 21:29:59 2016 +0530

    cluster/afr: Prevent split-brain when bricks are brought off and on in cyclic order
    
    When the bricks are brought offline and then online in cyclic
    order while writes are in progress on a file, thanks to inode
    refresh in write txns, AFR will mostly fail the write attempt
    when the only good copy is offline. However, there is still a
    remote possibility that the file will run into split-brain if
    the brick that has the lone good copy goes offline *after* the
    inode refresh but *before* the write txn completes (I call it
    in-flight split-brain in the patch for ease of reference),
    requiring intervention from admin to resolve the split-brain
    before the IO can resume normally on the file. To get around this,
    the patch does the following things:
    i) retains the dirty xattrs on the file
    ii) avoids marking the last of the good copies as bad (or accused)
        in case it is the one to go down during the course of a write.
    iii) fails that particular write with the appropriate errno.
    
    This way, we still have one good copy left despite the split-brain situation
    which when it is back online, will be chosen as source to do the heal.
    
    Change-Id: I9ca634b026ac830b172bac076437cc3bf1ae7d8a
    BUG: 1363721
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/15080
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Ravishankar N <ravishankar>
    Reviewed-by: Oleksandr Natalenko <oleksandr>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 20 Shyamsundar 2017-03-27 18:25:08 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/