Bug 1599645

Summary: Heketi fails to create block device when one out of two block hosting volumes are down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachael <rgeorge>
Component: heketiAssignee: John Mulligan <jmulligan>
Status: CLOSED WONTFIX QA Contact: Rachael <rgeorge>
Severity: high Docs Contact:
Priority: unspecified    
Version: cns-3.10CC: hchiramm, jmulligan, kramdoss, pprakash, rgeorge, rhs-bugs, rtalur, sankarshan, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-24 15:23:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1641915    

Description Rachael 2018-07-10 09:43:03 UTC
Description of problem:

On a CNS setup, two block hosting volumes were created and block devices were created on both the volumes. One of the block hosting volumes was stopped and the gluster-block-target.service was restarted. After the services had come up, a block PVC was created. However the creation failed, because heketi was trying to create the block device on the block hosting volume which was down. 

[kubeexec] ERROR 2018/07/10 09:30:22 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:242: Failed to run command [gluster-block create vol_0c4e24dba9d66bfd28cb3d7a9871397f/test-vol_glusterfs_mongodb-6_dadfb7e7-8423-11e8-9d0f-0a580a830009  ha 3 auth enable prealloc full 10.70.43.230,10.70.43.19,10.70.43.53 1GiB --json] on glusterfs-storage-krlwr: Err[command terminated with exit code 5]: Stdout [{ "RESULT": "FAIL", "errCode": 5, "errMsg": "Check if volume vol_0c4e24dba9d66bfd28cb3d7a9871397f is operational" }
]: Stderr []


On all the retry attempts it picked the same block hosting volume and failed even though there was another block hosting volume with sufficient space.

Volume Name: vol_0c4e24dba9d66bfd28cb3d7a9871397f
Type: Replicate
Volume ID: 4fc33195-9d1a-41b6-8c09-f82f1ca3cfa7
Status: Stopped
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.43.230:/var/lib/heketi/mounts/vg_86720e336d04af17d0d1df0de77fabe5/brick_cd20c546d7596fb88e9b264ba82b91d3/brick
Brick2: 10.70.43.19:/var/lib/heketi/mounts/vg_4f884c058ce59bce86912bbf803ba1d5/brick_1d15b816fe5749bdbd00add791ac18fc/brick
Brick3: 10.70.43.53:/var/lib/heketi/mounts/vg_a27e963d67f8a834e0048d9cf200c5cc/brick_72e518b267fc011b747875056fa355b0/brick
Options Reconfigured:
server.allow-insecure: on
user.cifs: off
features.shard-block-size: 64MB
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: disable
performance.strict-o-direct: on
performance.readdir-ahead: off
performance.open-behind: off
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
cluster.brick-multiplex: on
 
Volume Name: vol_c43f0247bd000e52369dde72cc428342
Type: Replicate
Volume ID: bd0c5acc-6a67-4d18-9c8f-bea8674712fe
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.70.43.230:/var/lib/heketi/mounts/vg_86720e336d04af17d0d1df0de77fabe5/brick_259857c4b93ac58f3f4416308f23c656/brick
Brick2: 10.70.43.19:/var/lib/heketi/mounts/vg_4f884c058ce59bce86912bbf803ba1d5/brick_3da36f9b37c449e259a2e38781e7aec2/brick
Brick3: 10.70.43.53:/var/lib/heketi/mounts/vg_a27e963d67f8a834e0048d9cf200c5cc/brick_821191c35768c513cf977fc6f71f340f/brick
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
performance.open-behind: off
performance.readdir-ahead: off
performance.strict-o-direct: on
network.remote-dio: disable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
features.shard-block-size: 64MB
user.cifs: off
server.allow-insecure: on
cluster.brick-multiplex: on


Version-Release number of selected component (if applicable):

# oc version
oc v3.10.0-0.67.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO


# oc rsh heketi-storage-1-4f7vz rpm -qa|grep heketi
python-heketi-7.0.0-2.el7rhgs.x86_64
heketi-client-7.0.0-2.el7rhgs.x86_64
heketi-7.0.0-2.el7rhgs.x86_64


# rpm -qa|grep gluster
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86


How reproducible: 1/1


Steps to Reproduce:
1. Create 2 block hosting volumes
2. Create block devices on both the volumes
3. Stop one of the block hosting volumes
4. Restart gluster-block-target service
5. Create a new block device

Actual results:
Since heketi is trying to create block device on the stopped volume, it fails


Expected results:
Block device creation should be successful 

Additional info:
Logs will be attached soon