Bug 1333406

Summary:	[HC]: After bringing down and up of the bricks VM's are getting paused
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RajeshReddy <rmekala>
Component:	replicate	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED ERRATA	QA Contact:	SATHEESARAN <sasundar>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	amukherj, mzywusko, pkarampu, rcyriac, rhinduja, rhs-bugs, sabose, sasundar
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-1	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1363721 (view as bug list)		Environment:
Last Closed:	2017-03-23 05:29:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1277939, 1351522, 1363721, 1367270, 1367272

Description RajeshReddy 2016-05-05 12:32:06 UTC

Description of problem:
=====================
After bringing down and up of the bricks, VM's are getting paused

Version-Release number of selected component (if applicable):
=============
glusterfs-server-3.7.9-2.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
=====================
1. Create 1x3 volume and host few VM's on the gluster volumes
2. Login to the VM's and run script to populate data (using DD) 
3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another brick 
4. After some time Bring up the down brick and bring down another brick during the brick down and bring up process observed few VM's are getting paused 

Actual results:
==================
Virtual machines are getting paused 


Expected results:
=================
VM's should not be paused 

Additional info:
===================
[root@zod ~]# gluster vol info
 
Volume Name: data
Type: Replicate
Volume ID: 5021c1f8-0b2f-4b34-92ea-a087afe84ce3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/data/data-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/data/data-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/data/data-brick3
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
cluster.shd-max-threads: 16
 
Volume Name: engine
Type: Replicate
Volume ID: 5e14889a-0ffc-415f-8fbd-259451972c46
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
nfs.disable: on
cluster.data-self-heal-algorithm: full
performance.low-prio-threads: 32
features.shard-block-size: 512MB
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
 
Volume Name: vmstore
Type: Replicate
Volume ID: edd3e117-138e-437b-9e65-319084fecc4b
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
[root@zod ~]#

Comment 2 RajeshReddy 2016-05-05 13:34:48 UTC

sosreports are avilable @rhsqe-repo.lab.eng.blr.redhat.com:/home/repo/sosreports/bug.1333406

Comment 3 Sahina Bose 2016-05-19 09:42:59 UTC

This bug is related to cyclic network outage test causing file to be in split brain. As this is not a likely scenario, removing from 3.1.3 target

Comment 6 Pranith Kumar K 2016-07-18 10:14:41 UTC

You are correct, we can't prevent VMs getting paused. We only need to make sure that split-brains won't happen. Please note that this case may lead to the VM image going extremely bad, but all we can guarantee is the file not going into split-brain.

Comment 7 Atin Mukherjee 2016-08-09 04:24:57 UTC

Upstream mainline patch http://review.gluster.org/15080 posted for review.

Comment 9 Atin Mukherjee 2016-09-17 14:47:10 UTC

Upstream mainline : http://review.gluster.org/15080
                    http://review.gluster.org/15145

Upstream 3.8 : http://review.gluster.org/15221
               http://review.gluster.org/15164
               

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 12 SATHEESARAN 2017-01-31 06:59:42 UTC

Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-12.el7rhgs ) with the following steps:

1. Created replica 3 volume and used it as data domain in RHV
2. When there are continuous I/O happening on the VMs, killed first brick
3. After some time brought up the down brick, and in few mins killed second brick
4. After some time brought up the down brick, and in another few mins killed third brick.
5. After some time brought up the down brick, and in another few mins killed first  brick.

After all this steps, I haven't seen any hiccups with VMs, VMs healthy post reboot, and there are no problems

Comment 14 errata-xmlrpc 2017-03-23 05:29:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html