Bug 1487645

Summary: block-volume creation fails when one of the node is down in a 3 node RHGS cluster
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: heketiAssignee: Raghavendra Talur <rtalur>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.6CC: akhakhar, annair, fcami, hchiramm, jarrpa, madam, mliyazud, mzywusko, pprakash, rcyriac, rhs-bugs, rreddy, rtalur, sselvan, storage-qa-internal
Target Milestone: ---   
Target Release: CNS 3.6   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: heketi-5.0.0-12 rhgs-volmanager-docker-5.0.0-16 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-11 07:09:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1445448    

Description krishnaram Karthick 2017-09-01 13:26:30 UTC
Description of problem:
When one node is down in a 3 node CRS configuration, and a PVC request is made with HA as 2, block-device creation should succeed. However, it fails and the pvc remains in pending state.

# oc get pvc/test-03
NAME      STATUS    VOLUME    CAPACITY   ACCESSMODES   STORAGECLASS     AGE
test-03   Pending                                      glusterblockip   5m

snippet from heketi logs:

[sshexec] ERROR 2017/09/01 13:19:28 /src/github.com/heketi/heketi/pkg/utils/ssh/ssh.go:173: Failed to run command [/bin/bash -c 'gluster-block create vol_314b6b07b1e82a528a7bd1e2d2d00d20/blockvol_22433e831468f01ec243fb0f06ac4897  ha 2 auth enable  10.70.46.1,10.70.47.105 1G --json'] on dhcp47-25.lab.eng.blr.redhat.com:22: Err[Process exited with status 255]: Stdout [{ "RESULT": "FAIL", "errCode": 255, "errMsg": "failed to configure on 10.70.47.105 : No route to host" }

This issue was seen in CRS configuration, but should be seen in CNS as well.

Version-Release number of selected component (if applicable):
cns-deploy-5.0.0-25.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. create a 3 node CRS setup with heketi in openshift
2. fail one of the node
3. Try to create a blockdevice with ha:2

Actual results:
block-volume fails to create

Expected results:
block-volume should succeed as there are 2 node available for this request to complete

Additional info:

Comment 3 Humble Chirammal 2017-09-14 11:40:36 UTC
The next heketi build will have the fix and as soon as  build is available , I we will move this to ON_QA.

Comment 4 Raghavendra Talur 2017-09-17 20:56:18 UTC
Although, the patch is in the build, this is not going to work as expected with  kubernetes/openshift because of how labels are provided. Will send another patch on top to eliminate the problem.

Comment 5 Humble Chirammal 2017-09-19 06:26:18 UTC
This is fixed in cns-deploy v45.

Comment 6 krishnaram Karthick 2017-09-19 11:38:55 UTC
Verified in build - cns-deploy-5.0.0-46.el7rhgs.x86_64

When one node is brought down on a 3 node TSP and block device is created with ha count 2, block device gets created.

However, it takes ~3 minutes to create the block device when one node is down. I'll raise a separate bug to bring this time delay to a lower value. 

For now, this bug shall be moved to verified.

Comment 7 errata-xmlrpc 2017-10-11 07:09:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:2879