Bug 1333406

Summary: [HC]: After bringing down and up of the bricks VM's are getting paused
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RajeshReddy <rmekala>
Component: replicateAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amukherj, mzywusko, pkarampu, rcyriac, rhinduja, rhs-bugs, sabose, sasundar
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1363721 (view as bug list) Environment:
Last Closed: 2017-03-23 05:29:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277939, 1351522, 1363721, 1367270, 1367272    

Description RajeshReddy 2016-05-05 12:32:06 UTC
Description of problem:
=====================
After bringing down and up of the bricks, VM's are getting paused

Version-Release number of selected component (if applicable):
=============
glusterfs-server-3.7.9-2.el7rhgs.x86_64

How reproducible:


Steps to Reproduce:
=====================
1. Create 1x3 volume and host few VM's on the gluster volumes
2. Login to the VM's and run script to populate data (using DD) 
3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another brick 
4. After some time Bring up the down brick and bring down another brick during the brick down and bring up process observed few VM's are getting paused 

Actual results:
==================
Virtual machines are getting paused 


Expected results:
=================
VM's should not be paused 

Additional info:
===================
[root@zod ~]# gluster vol info
 
Volume Name: data
Type: Replicate
Volume ID: 5021c1f8-0b2f-4b34-92ea-a087afe84ce3
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/data/data-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/data/data-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/data/data-brick3
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
cluster.shd-max-threads: 16
 
Volume Name: engine
Type: Replicate
Volume ID: 5e14889a-0ffc-415f-8fbd-259451972c46
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
nfs.disable: on
cluster.data-self-heal-algorithm: full
performance.low-prio-threads: 32
features.shard-block-size: 512MB
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
 
Volume Name: vmstore
Type: Replicate
Volume ID: edd3e117-138e-437b-9e65-319084fecc4b
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick1
Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick2
Brick3: zod.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick3
Options Reconfigured:
cluster.shd-max-threads: 16
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
nfs.disable: on
[root@zod ~]#

Comment 2 RajeshReddy 2016-05-05 13:34:48 UTC
sosreports are avilable @rhsqe-repo.lab.eng.blr.redhat.com:/home/repo/sosreports/bug.1333406

Comment 3 Sahina Bose 2016-05-19 09:42:59 UTC
This bug is related to cyclic network outage test causing file to be in split brain. As this is not a likely scenario, removing from 3.1.3 target

Comment 6 Pranith Kumar K 2016-07-18 10:14:41 UTC
You are correct, we can't prevent VMs getting paused. We only need to make sure that split-brains won't happen. Please note that this case may lead to the VM image going extremely bad, but all we can guarantee is the file not going into split-brain.

Comment 7 Atin Mukherjee 2016-08-09 04:24:57 UTC
Upstream mainline patch http://review.gluster.org/15080 posted for review.

Comment 9 Atin Mukherjee 2016-09-17 14:47:10 UTC
Upstream mainline : http://review.gluster.org/15080
                    http://review.gluster.org/15145

Upstream 3.8 : http://review.gluster.org/15221
               http://review.gluster.org/15164
               

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 12 SATHEESARAN 2017-01-31 06:59:42 UTC
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-12.el7rhgs ) with the following steps:

1. Created replica 3 volume and used it as data domain in RHV
2. When there are continuous I/O happening on the VMs, killed first brick
3. After some time brought up the down brick, and in few mins killed second brick
4. After some time brought up the down brick, and in another few mins killed third brick.
5. After some time brought up the down brick, and in another few mins killed first  brick.

After all this steps, I haven't seen any hiccups with VMs, VMs healthy post reboot, and there are no problems

Comment 14 errata-xmlrpc 2017-03-23 05:29:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html