Bug 1405302 - vm does not boot up when first data brick in the arbiter volume is killed.
Summary: vm does not boot up when first data brick in the arbiter volume is killed.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: arbiter
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Ravishankar N
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks: Gluster-HC-2 1351528
TreeView+ depends on / blocked
 
Reported: 2016-12-16 07:05 UTC by RamaKasturi
Modified: 2017-03-23 05:58 UTC (History)
7 users (show)

Fixed In Version: glusterfs-3.8.4-10
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 05:58:12 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description RamaKasturi 2016-12-16 07:05:53 UTC
Description of problem:
Bring down the first data brick in the arbiter volume and create a vm. Once the vm installation finishes poweroff the node. Bring back the first brick up and start the vm. I see that vm does not boot and goes to grub mode.

As suggested by vijay, while starting the vm i have not brought the first brick up and i see that vm boots with out any issues.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-8.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a 1 x (2 + 1) arbiter volume.
2. Now bring down the first data brick in the volume.
3. Create a vm.
4. Once vm installation finishes poweroff the vm.
5. Bring the first brick which was down and start the vm.

Actual results:
I see that vm does not boot and goes to grub mode.

Expected results:
vm should boot with out any issues.

Additional info:

Comment 2 RamaKasturi 2016-12-16 07:16:16 UTC
gluster volume info on vmstore:
====================================
[root@rhsqa-grafton4 ~]# gluster volume info vmstore
 
Volume Name: vmstore
Type: Replicate
Volume ID: 3d67c0ad-5084-4190-a4b5-c468994ca084
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.82:/rhgs/brick2/vmstore
Brick2: 10.70.36.83:/rhgs/brick2/vmstore
Brick3: 10.70.36.84:/rhgs/brick2/vmstore (arbiter)
Options Reconfigured:
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
network.ping-timeout: 30
user.cifs: off
performance.strict-o-direct: on
client.ssl: on
server.ssl: on
auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83
cluster.granular-entry-heal: enable
cluster.use-compound-fops: on

Comment 3 RamaKasturi 2016-12-16 07:16:40 UTC
sosreports can be found in the brick below:
=============================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1405302/

Comment 6 Ravishankar N 2016-12-20 06:54:24 UTC
From the errors in rhev-data-center-mnt-glusterSD-10.70.36.82\:_vmstore.log in grafton5, it looks like the same problem as BZ 1404982 (comment #5). I am providing a test build to Kasturi with the same fix on top of the latest downstream code (HEAD @ tag: v3.8.4-9, origin/rhgs-3.2.0, rhgs-3.2.0) protocol/client: fix op_errno handling, was unused variable) to see if it fixes the issue.

Comment 7 Atin Mukherjee 2016-12-20 09:52:30 UTC
upstream mainline patch http://review.gluster.org/#/c/16205/ posted for review.

Comment 8 RamaKasturi 2016-12-20 14:03:10 UTC
Hi Ravi,

   Tested this issue with glusterfs-3.8.4-5.el7rhgs.x86_64. I tried the steps mentioned in the description thrice. But i was not able to hit the issue.

Thanks
kasturi

Comment 9 Ravishankar N 2016-12-20 15:45:29 UTC
(In reply to RamaKasturi from comment #8)
> Hi Ravi,
> 
>    Tested this issue with glusterfs-3.8.4-5.el7rhgs.x86_64. I tried the
> steps mentioned in the description thrice. But i was not able to hit the
> issue.
> 
> Thanks
> kasturi

Thanks Kasturi, if we are able to hit the issue with glusterfs-3.8.4-6, then we have a reasonably lesser no. of fixes to do a git bisect and find the offending commit. Please give it a try on v3.8.4-6 as well.
Thanks!
Ravi

Comment 10 Ravishankar N 2016-12-22 06:26:16 UTC
Just for the record, after comment#9, Kasturi tried a couple of test builds (thanks a lot Kasturi!) and we were not able to hit the issue with some modifications made to the original patch posted in comment #7.

Comment 11 Ravishankar N 2016-12-22 06:27:49 UTC
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/93560/

Comment 13 RamaKasturi 2016-12-29 07:23:55 UTC
will verify this bug once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1400057 lands.

Comment 14 RamaKasturi 2017-01-13 09:50:41 UTC
verified and works fine with build glusterfs-3.8.4-11.el7rhgs.x86_64.

Brought the first brick down in the volume, created a vm and installed os. Once vm is installed,powered off the vm,brought the first brick up and i see that vm has been booted successfully.

Comment 15 RamaKasturi 2017-01-13 09:51:13 UTC
Moving this to verified state.

Comment 16 RamaKasturi 2017-01-13 09:52:25 UTC
[root@rhsqa-grafton4 ~]# gluster volume info vmstore
 
Volume Name: vmstore
Type: Replicate
Volume ID: 2f8938c2-26d3-4912-a6e0-bc12b76146d0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.82:/rhgs/brick1/vmstore
Brick2: 10.70.36.83:/rhgs/brick1/vmstore
Brick3: 10.70.36.84:/rhgs/brick1/vmstore (arbiter)
Options Reconfigured:
auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83
server.ssl: on
client.ssl: on
cluster.granular-entry-heal: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 4MB
storage.owner-gid: 36
storage.owner-uid: 36
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 18 errata-xmlrpc 2017-03-23 05:58:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.