Description of problem: ===================== After bringing down and up of the bricks, VM's are getting paused Version-Release number of selected component (if applicable): ============= glusterfs-server-3.7.9-2.el7rhgs.x86_64 How reproducible: Steps to Reproduce: ===================== 1. Create 1x3 volume and host few VM's on the gluster volumes 2. Login to the VM's and run script to populate data (using DD) 3. While IO is going on bring down one of the brick and after some time bring up the brick and bring down another brick 4. After some time Bring up the down brick and bring down another brick during the brick down and bring up process observed few VM's are getting paused Actual results: ================== Virtual machines are getting paused Expected results: ================= VM's should not be paused Additional info: =================== [root@zod ~]# gluster vol info Volume Name: data Type: Replicate Volume ID: 5021c1f8-0b2f-4b34-92ea-a087afe84ce3 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/data/data-brick1 Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/data/data-brick2 Brick3: zod.lab.eng.blr.redhat.com:/rhgs/data/data-brick3 Options Reconfigured: diagnostics.client-log-level: INFO performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full nfs.disable: on cluster.shd-max-threads: 16 Volume Name: engine Type: Replicate Volume ID: 5e14889a-0ffc-415f-8fbd-259451972c46 Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick1 Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick2 Brick3: zod.lab.eng.blr.redhat.com:/rhgs/engine/engine-brick3 Options Reconfigured: cluster.shd-max-threads: 16 nfs.disable: on cluster.data-self-heal-algorithm: full performance.low-prio-threads: 32 features.shard-block-size: 512MB features.shard: on storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.readdir-ahead: on Volume Name: vmstore Type: Replicate Volume ID: edd3e117-138e-437b-9e65-319084fecc4b Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: sulphur.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick1 Brick2: tettnang.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick2 Brick3: zod.lab.eng.blr.redhat.com:/rhgs/vmstore/vmstore-brick3 Options Reconfigured: cluster.shd-max-threads: 16 performance.readdir-ahead: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server storage.owner-uid: 36 storage.owner-gid: 36 features.shard: on features.shard-block-size: 512MB performance.low-prio-threads: 32 cluster.data-self-heal-algorithm: full nfs.disable: on [root@zod ~]#
sosreports are avilable @rhsqe-repo.lab.eng.blr.redhat.com:/home/repo/sosreports/bug.1333406
This bug is related to cyclic network outage test causing file to be in split brain. As this is not a likely scenario, removing from 3.1.3 target
You are correct, we can't prevent VMs getting paused. We only need to make sure that split-brains won't happen. Please note that this case may lead to the VM image going extremely bad, but all we can guarantee is the file not going into split-brain.
Upstream mainline patch http://review.gluster.org/15080 posted for review.
Upstream mainline : http://review.gluster.org/15080 http://review.gluster.org/15145 Upstream 3.8 : http://review.gluster.org/15221 http://review.gluster.org/15164 And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-12.el7rhgs ) with the following steps: 1. Created replica 3 volume and used it as data domain in RHV 2. When there are continuous I/O happening on the VMs, killed first brick 3. After some time brought up the down brick, and in few mins killed second brick 4. After some time brought up the down brick, and in another few mins killed third brick. 5. After some time brought up the down brick, and in another few mins killed first brick. After all this steps, I haven't seen any hiccups with VMs, VMs healthy post reboot, and there are no problems
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html