1333406 – [HC]: After bringing down and up of the bricks VM's are getting paused

Bug 1333406 - [HC]: After bringing down and up of the bricks VM's are getting paused

Summary: [HC]: After bringing down and up of the bricks VM's are getting paused

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Krutika Dhananjay
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-2 1351522 1363721 1367270 1367272
TreeView+	depends on / blocked

Reported:	2016-05-05 12:32 UTC by RajeshReddy
Modified:	2017-03-23 05:29 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1363721 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:29:33 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description RajeshReddy 2016-05-05 12:32:06 UTC

Description of problem:
=====================
After bringing down and up of the bricks, VM's are getting paused

Version-Release number of selected component (if applicable):
=============
glusterfs-server-3.7.9-2.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
=====================
1. Create 1x3 volume and host few VM's on the gluster volumes
2. Login to the VM's and run script to populate data (using DD) 
3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another brick 
4. After some time Bring up the down brick and bring down another brick during the brick down and bring up process observed few VM's are getting paused 

Actual results:
==================
Virtual machines are getting paused 


Expected results:
=================
VM's should not be paused 

Additional info:
===================
[root@zod ~]# gluster vol info
 
Volume Name: data
Type: Replicate
Volume ID: 5021c1f8-0b2f-4b34-92ea-a087afe84ce3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/data/data-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/data/data-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/data/data-brick3
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
cluster.shd-max-threads: 16
 
Volume Name: engine
Type: Replicate
Volume ID: 5e14889a-0ffc-415f-8fbd-259451972c46
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
nfs.disable: on
cluster.data-self-heal-algorithm: full
performance.low-prio-threads: 32
features.shard-block-size: 512MB
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
 
Volume Name: vmstore
Type: Replicate
Volume ID: edd3e117-138e-437b-9e65-319084fecc4b
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
[root@zod ~]#

Comment 2 RajeshReddy 2016-05-05 13:34:48 UTC

sosreports are avilable @rhsqe-repo.lab.eng.blr.redhat.com:/home/repo/sosreports/bug.1333406

Comment 3 Sahina Bose 2016-05-19 09:42:59 UTC

This bug is related to cyclic network outage test causing file to be in split brain. As this is not a likely scenario, removing from 3.1.3 target

Comment 6 Pranith Kumar K 2016-07-18 10:14:41 UTC

You are correct, we can't prevent VMs getting paused. We only need to make sure that split-brains won't happen. Please note that this case may lead to the VM image going extremely bad, but all we can guarantee is the file not going into split-brain.

Comment 7 Atin Mukherjee 2016-08-09 04:24:57 UTC

Upstream mainline patch http://review.gluster.org/15080 posted for review.

Comment 9 Atin Mukherjee 2016-09-17 14:47:10 UTC

Upstream mainline : http://review.gluster.org/15080
                    http://review.gluster.org/15145

Upstream 3.8 : http://review.gluster.org/15221
               http://review.gluster.org/15164
               

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 12 SATHEESARAN 2017-01-31 06:59:42 UTC

Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-12.el7rhgs ) with the following steps:

1. Created replica 3 volume and used it as data domain in RHV
2. When there are continuous I/O happening on the VMs, killed first brick
3. After some time brought up the down brick, and in few mins killed second brick
4. After some time brought up the down brick, and in another few mins killed third brick.
5. After some time brought up the down brick, and in another few mins killed first  brick.

After all this steps, I haven't seen any hiccups with VMs, VMs healthy post reboot, and there are no problems

Comment 14 errata-xmlrpc 2017-03-23 05:29:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.