Bug 1405302

Summary:	vm does not boot up when first data brick in the arbiter volume is killed.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	RamaKasturi <knarra>
Component:	arbiter	Assignee:	Ravishankar N <ravishankar>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, pkarampu, rcyriac, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-10	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-23 05:58:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1277939, 1351528

Description RamaKasturi 2016-12-16 07:05:53 UTC

Description of problem:
Bring down the first data brick in the arbiter volume and create a vm. Once the vm installation finishes poweroff the node. Bring back the first brick up and start the vm. I see that vm does not boot and goes to grub mode.

As suggested by vijay, while starting the vm i have not brought the first brick up and i see that vm boots with out any issues.

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-8.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Create a 1 x (2 + 1) arbiter volume.
2. Now bring down the first data brick in the volume.
3. Create a vm.
4. Once vm installation finishes poweroff the vm.
5. Bring the first brick which was down and start the vm.

Actual results:
I see that vm does not boot and goes to grub mode.

Expected results:
vm should boot with out any issues.

Additional info:

Comment 2 RamaKasturi 2016-12-16 07:16:16 UTC

gluster volume info on vmstore:
====================================
[root@rhsqa-grafton4 ~]# gluster volume info vmstore
 
Volume Name: vmstore
Type: Replicate
Volume ID: 3d67c0ad-5084-4190-a4b5-c468994ca084
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.82:/rhgs/brick2/vmstore
Brick2: 10.70.36.83:/rhgs/brick2/vmstore
Brick3: 10.70.36.84:/rhgs/brick2/vmstore (arbiter)
Options Reconfigured:
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
network.ping-timeout: 30
user.cifs: off
performance.strict-o-direct: on
client.ssl: on
server.ssl: on
auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83
cluster.granular-entry-heal: enable
cluster.use-compound-fops: on

Comment 3 RamaKasturi 2016-12-16 07:16:40 UTC

sosreports can be found in the brick below:
=============================================
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1405302/

Comment 6 Ravishankar N 2016-12-20 06:54:24 UTC

From the errors in rhev-data-center-mnt-glusterSD-10.70.36.82\:_vmstore.log in grafton5, it looks like the same problem as BZ 1404982 (comment #5). I am providing a test build to Kasturi with the same fix on top of the latest downstream code (HEAD @ tag: v3.8.4-9, origin/rhgs-3.2.0, rhgs-3.2.0) protocol/client: fix op_errno handling, was unused variable) to see if it fixes the issue.

Comment 7 Atin Mukherjee 2016-12-20 09:52:30 UTC

upstream mainline patch http://review.gluster.org/#/c/16205/ posted for review.

Comment 8 RamaKasturi 2016-12-20 14:03:10 UTC

Hi Ravi,

   Tested this issue with glusterfs-3.8.4-5.el7rhgs.x86_64. I tried the steps mentioned in the description thrice. But i was not able to hit the issue.

Thanks
kasturi

Comment 9 Ravishankar N 2016-12-20 15:45:29 UTC

(In reply to RamaKasturi from comment #8)
> Hi Ravi,
> 
>    Tested this issue with glusterfs-3.8.4-5.el7rhgs.x86_64. I tried the
> steps mentioned in the description thrice. But i was not able to hit the
> issue.
> 
> Thanks
> kasturi

Thanks Kasturi, if we are able to hit the issue with glusterfs-3.8.4-6, then we have a reasonably lesser no. of fixes to do a git bisect and find the offending commit. Please give it a try on v3.8.4-6 as well.
Thanks!
Ravi

Comment 10 Ravishankar N 2016-12-22 06:26:16 UTC

Just for the record, after comment#9, Kasturi tried a couple of test builds (thanks a lot Kasturi!) and we were not able to hit the issue with some modifications made to the original patch posted in comment #7.

Comment 11 Ravishankar N 2016-12-22 06:27:49 UTC

Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/93560/

Comment 13 RamaKasturi 2016-12-29 07:23:55 UTC

will verify this bug once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1400057 lands.

Comment 14 RamaKasturi 2017-01-13 09:50:41 UTC

verified and works fine with build glusterfs-3.8.4-11.el7rhgs.x86_64.

Brought the first brick down in the volume, created a vm and installed os. Once vm is installed,powered off the vm,brought the first brick up and i see that vm has been booted successfully.

Comment 15 RamaKasturi 2017-01-13 09:51:13 UTC

Moving this to verified state.

Comment 16 RamaKasturi 2017-01-13 09:52:25 UTC

[root@rhsqa-grafton4 ~]# gluster volume info vmstore
 
Volume Name: vmstore
Type: Replicate
Volume ID: 2f8938c2-26d3-4912-a6e0-bc12b76146d0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.36.82:/rhgs/brick1/vmstore
Brick2: 10.70.36.83:/rhgs/brick1/vmstore
Brick3: 10.70.36.84:/rhgs/brick1/vmstore (arbiter)
Options Reconfigured:
auth.ssl-allow: 10.70.36.84,10.70.36.82,10.70.36.83
server.ssl: on
client.ssl: on
cluster.granular-entry-heal: on
user.cifs: off
network.ping-timeout: 30
performance.strict-o-direct: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
performance.low-prio-threads: 32
features.shard-block-size: 4MB
storage.owner-gid: 36
storage.owner-uid: 36
cluster.data-self-heal-algorithm: full
features.shard: on
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: off
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 18 errata-xmlrpc 2017-03-23 05:58:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html