Bug 1487645

Summary:	block-volume creation fails when one of the node is down in a 3 node RHGS cluster
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	krishnaram Karthick <kramdoss>
Component:	heketi	Assignee:	Raghavendra Talur <rtalur>
Status:	CLOSED ERRATA	QA Contact:	krishnaram Karthick <kramdoss>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	cns-3.6	CC:	akhakhar, annair, fcami, hchiramm, jarrpa, madam, mliyazud, mzywusko, pprakash, rcyriac, rhs-bugs, rreddy, rtalur, sselvan, storage-qa-internal
Target Milestone:	---
Target Release:	CNS 3.6
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	heketi-5.0.0-12 rhgs-volmanager-docker-5.0.0-16	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-11 07:09:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1445448

Description krishnaram Karthick 2017-09-01 13:26:30 UTC

Description of problem:
When one node is down in a 3 node CRS configuration, and a PVC request is made with HA as 2, block-device creation should succeed. However, it fails and the pvc remains in pending state.

# oc get pvc/test-03
NAME      STATUS    VOLUME    CAPACITY   ACCESSMODES   STORAGECLASS     AGE
test-03   Pending                                      glusterblockip   5m

snippet from heketi logs:

[sshexec] ERROR 2017/09/01 13:19:28 /src/github.com/heketi/heketi/pkg/utils/ssh/ssh.go:173: Failed to run command [/bin/bash -c 'gluster-block create vol_314b6b07b1e82a528a7bd1e2d2d00d20/blockvol_22433e831468f01ec243fb0f06ac4897  ha 2 auth enable  10.70.46.1,10.70.47.105 1G --json'] on dhcp47-25.lab.eng.blr.redhat.com:22: Err[Process exited with status 255]: Stdout [{ "RESULT": "FAIL", "errCode": 255, "errMsg": "failed to configure on 10.70.47.105 : No route to host" }

This issue was seen in CRS configuration, but should be seen in CNS as well.

Version-Release number of selected component (if applicable):
cns-deploy-5.0.0-25.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. create a 3 node CRS setup with heketi in openshift
2. fail one of the node
3. Try to create a blockdevice with ha:2

Actual results:
block-volume fails to create

Expected results:
block-volume should succeed as there are 2 node available for this request to complete

Additional info:

Comment 3 Humble Chirammal 2017-09-14 11:40:36 UTC

The next heketi build will have the fix and as soon as  build is available , I we will move this to ON_QA.

Comment 4 Raghavendra Talur 2017-09-17 20:56:18 UTC

Although, the patch is in the build, this is not going to work as expected with  kubernetes/openshift because of how labels are provided. Will send another patch on top to eliminate the problem.

Comment 5 Humble Chirammal 2017-09-19 06:26:18 UTC

This is fixed in cns-deploy v45.

Comment 6 krishnaram Karthick 2017-09-19 11:38:55 UTC

Verified in build - cns-deploy-5.0.0-46.el7rhgs.x86_64

When one node is brought down on a 3 node TSP and block device is created with ha count 2, block device gets created.

However, it takes ~3 minutes to create the block device when one node is down. I'll raise a separate bug to bring this time delay to a lower value. 

For now, this bug shall be moved to verified.

Comment 7 errata-xmlrpc 2017-10-11 07:09:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879