Bug 1596035

Summary:	On a 4 node setup heketi block volume creation fails when a node is powered off
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	vinutha <vinug>
Component:	heketi	Assignee:	John Mulligan <jmulligan>
Status:	CLOSED ERRATA	QA Contact:	vinutha <vinug>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	cns-3.10	CC:	hchiramm, jmulligan, pprakash, rhs-bugs, rtalur, sankarshan, sselvan, storage-qa-internal, vinug
Target Milestone:	---
Target Release:	CNS 3.10
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-12 09:23:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1568862

Description vinutha 2018-06-28 06:23:13 UTC

Description of problem:
On a 4 node setup heketi block volume creation fails when a node is powered off 

Version-Release number of selected component (if applicable):
# rpm -qa| grep openshift
atomic-openshift-clients-3.10.0-0.67.0.git.0.ccd325f.el7.x86_64
openshift-ansible-roles-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
atomic-openshift-docker-excluder-3.10.0-0.67.0.git.0.ccd325f.el7.noarch
atomic-openshift-excluder-3.10.0-0.67.0.git.0.ccd325f.el7.noarch
atomic-openshift-3.10.0-0.67.0.git.0.ccd325f.el7.x86_64
openshift-ansible-docs-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch
atomic-openshift-hyperkube-3.10.0-0.67.0.git.0.ccd325f.el7.x86_64
atomic-openshift-node-3.10.0-0.67.0.git.0.ccd325f.el7.x86_64
openshift-ansible-playbooks-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch

# oc rsh glusterfs-storage-77pm2 
sh-4.2# rpm -qa| grep gluster
glusterfs-libs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-3.8.4-54.12.el7rhgs.x86_64
glusterfs-api-3.8.4-54.12.el7rhgs.x86_64
glusterfs-cli-3.8.4-54.12.el7rhgs.x86_64
glusterfs-server-3.8.4-54.12.el7rhgs.x86_64
gluster-block-0.2.1-20.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-54.12.el7rhgs.x86_64
glusterfs-fuse-3.8.4-54.12.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-54.12.el7rhgs.x86_64


# oc rsh heketi-storage-1-bccs6 
sh-4.2# rpm -qa | grep heketi
python-heketi-7.0.0-1.el7rhgs.x86_64
heketi-client-7.0.0-1.el7rhgs.x86_64
heketi-7.0.0-1.el7rhgs.x86_64


How reproducible:
2:2

Steps to Reproduce:

1. Create a 4 node CNS setup using ansible with ha count=3
2. Created 20 10GB file volumes. 
3. Poweroff 1 node manually. 
4. Create 20 blockvolumes of size 5GB using heketi. 
# for i in {1..20} ; do heketi-cli blockvolume create --name=b-vol-$i --size=5 ; sleep 10; done
5. Blockvolue creation fails with error message "Failed to allocate new block volume: No space" even when there is sufficient space on the 3 node which are up. 

-- snip of heketi topology before block volume creation 
 Node Id: b2a1923cf742cedfa9ea81ea78e304ed
        State: online
        Cluster Id: 14321f429bcd4c1c6e77017fd714ebe8
        Zone: 1
        Management Hostnames: dhcp47-70.lab.eng.blr.redhat.com
        Storage Hostnames: 10.70.47.70
        Devices:
                Id:543b45dce8a977adfd6000c0eedd4cd4   Name:/dev/sdd            State:online    Size (GiB):199     Used (GiB):90      Free (GiB):109
                        Bricks:
                                Id:12737b9e396ff18f196869d210ce1e92   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_543b45dce8a977adfd6000c0eedd4cd4/brick_12737b9e396ff18f196869d210ce1e92/brick
                                Id:1c8e02a6d92a37e652cbd5d18140af50   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_543b45dce8a977adfd6000c0eedd4cd4/brick_1c8e02a6d92a37e652cbd5d18140af50/brick
                                Id:a2f7169567380ab6fdaa56073547dee6   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_543b45dce8a977adfd6000c0eedd4cd4/brick_a2f7169567380ab6fdaa56073547dee6/brick
                                Id:b2dcb6da44c6e31f553e6f64f4e9b627   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_543b45dce8a977adfd6000c0eedd4cd4/brick_b2dcb6da44c6e31f553e6f64f4e9b627/brick
                                Id:c409fde9644556c3f1149c12fd697fda   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_543b45dce8a977adfd6000c0eedd4cd4/brick_c409fde9644556c3f1149c12fd697fda/brick
                                Id:eba5e57293b817a54b2f0e236f95df76   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_543b45dce8a977adfd6000c0eedd4cd4/brick_eba5e57293b817a54b2f0e236f95df76/brick
                Id:b87af79398381da12d08232eda69c722   Name:/dev/sde            State:online    Size (GiB):199     Used (GiB):42      Free (GiB):157
                        Bricks:
                                Id:2d5c878c07be7b52491099cff96910bc   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_b87af79398381da12d08232eda69c722/brick_2d5c878c07be7b52491099cff96910bc/brick
                                Id:400b7b1f147f2cce6f702f0e37e49a07   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_b87af79398381da12d08232eda69c722/brick_400b7b1f147f2cce6f702f0e37e49a07/brick
                                Id:5087442f15656c1b6c4ff0cc81c5cfd7   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_b87af79398381da12d08232eda69c722/brick_5087442f15656c1b6c4ff0cc81c5cfd7/brick
                                Id:aef10fc9ec18aca6e1f276ba977a72f8   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_b87af79398381da12d08232eda69c722/brick_aef10fc9ec18aca6e1f276ba977a72f8/brick

        Node Id: fb1bbfb8c67284475bc21bb279007e8c
        State: online
        Cluster Id: 14321f429bcd4c1c6e77017fd714ebe8
        Zone: 1
        Management Hostnames: dhcp46-210.lab.eng.blr.redhat.com
        Storage Hostnames: 10.70.46.210
        Devices:
                Id:adc7f09c5f19dd31c57ef72b3f2cb64e   Name:/dev/sde            State:online    Size (GiB):199     Used (GiB):110     Free (GiB):89
                        Bricks:
                                Id:4df3403eb063f838027c4c7b0423250b   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_4df3403eb063f838027c4c7b0423250b/brick
                                Id:67e2e56375e36deb99e15dc08eecce56   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_67e2e56375e36deb99e15dc08eecce56/brick
                                Id:aedfe6854af18bb22fb86ea07c2a1465   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_aedfe6854af18bb22fb86ea07c2a1465/brick
                                Id:b5c57402d83c5c92eeae822633e245ae   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_b5c57402d83c5c92eeae822633e245ae/brick
                                Id:bb8c0f58429a0bf0ee0ce6c80475e8f9   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_bb8c0f58429a0bf0ee0ce6c80475e8f9/brick
                                Id:bdbfa8fe8e94583e0087bff49a012170   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_bdbfa8fe8e94583e0087bff49a012170/brick
                                Id:c41287d78b90060e63f2c7dd8934072d   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_c41287d78b90060e63f2c7dd8934072d/brick
                                Id:c716f837c0e94ee690f83cf61adee31b   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_c716f837c0e94ee690f83cf61adee31b/brick
                                Id:ecc661dd93b2453a95a6ae6b48c2ea8a   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_adc7f09c5f19dd31c57ef72b3f2cb64e/brick_ecc661dd93b2453a95a6ae6b48c2ea8a/brick
                Id:ecd528618df514d2b1061c62afd86999   Name:/dev/sdd            State:online    Size (GiB):199     Used (GiB):62      Free (GiB):137
                        Bricks:
                                Id:1b50e4eb496c7c2b325ff4b11708198f   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_ecd528618df514d2b1061c62afd86999/brick_1b50e4eb496c7c2b325ff4b11708198f/brick
                                Id:3e1256bd97344681519613e92546a779   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_ecd528618df514d2b1061c62afd86999/brick_3e1256bd97344681519613e92546a779/brick
                                Id:7b642187486250dca487691e0bef5e81   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_ecd528618df514d2b1061c62afd86999/brick_7b642187486250dca487691e0bef5e81/brick
                                Id:80c00b04b09f566c65ebcb6f975c4fa2   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_ecd528618df514d2b1061c62afd86999/brick_80c00b04b09f566c65ebcb6f975c4fa2/brick
                                Id:9947ee0999f188218a42f588c1677cc3   Size (GiB):10      Path: /var/lib/heketi/mounts/vg_ecd528618df514d2b1061c62afd86999/brick_9947ee0999f188218a42f588c1677cc3/brick
                                Id:9db57e4184a91f004d608f492b94433d   Size (GiB):2       Path: /var/lib/heketi/mounts/vg_ecd528618df514d2b1061c62afd86999/brick_9db57e4184a91f004d608f492b94433d/brick

-------------snip ------------

Actual results:
 Blockvolue creation fails with error message "Failed to allocate new block volume: No space" even when there is sufficient space on the 3 node which are up. 

Expected results:
Blockvolume creation should be successful since 3 nodes are up

Additional info:
logs attached

Comment 2 vinutha 2018-06-28 06:27:28 UTC

As expected not able to login to pod hosted on the powered off node 
# oc rsh glusterfs-storage-2x6qg
Error from server: error dialing backend: dial tcp 10.70.46.29:10250: getsockopt: no route to host

-- process error messages 

# for i in {1..20} ; do heketi-cli blockvolume create --name=b-vol-$i --size=5 ; sleep 10; done 
Error: Unable to execute command on glusterfs-storage-fwh6t:
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space
Error: Failed to allocate new block volume: No space

--- heketi log snip ---------------------

[kubeexec] DEBUG 2018/06/28 04:40:48 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:246: Host: dhcp46-210.lab.eng.blr.redhat.com Pod: glusterfs-storage-rqcpl Command: gluster --mode=script volume stop vol_5d6a205b63f2455ba8a825d96102020a force
Result: volume stop: vol_5d6a205b63f2455ba8a825d96102020a: success
[heketi] WARNING 2018/06/28 04:40:49 failed to delete volume 5d6a205b63f2455ba8a825d96102020a via dhcp46-210.lab.eng.blr.redhat.com: Unable to delete volume vol_5d6a205b63f2455ba8a825d96102020a: Unable to execute command on glusterfs-storage-rqcpl: volume delete: vol_5d6a205b63f2455ba8a825d96102020a: failed: Some of the peers are down
[kubeexec] ERROR 2018/06/28 04:40:49 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:242: Failed to run command [gluster --mode=script volume delete vol_5d6a205b63f2455ba8a825d96102020a] on glusterfs-storage-rqcpl: Err[command terminated with exit code 1]: Stdout []: Stderr [volume delete: vol_5d6a205b63f2455ba8a825d96102020a: failed: Some of the peers are down
]
[cmdexec] ERROR 2018/06/28 04:40:49 /src/github.com/heketi/heketi/executors/cmdexec/volume.go:153: Unable to delete volume vol_5d6a205b63f2455ba8a825d96102020a: Unable to execute command on glusterfs-storage-rqcpl: volume delete: vol_5d6a205b63f2455ba8a825d96102020a: failed: Some of the peers are down
[heketi] ERROR 2018/06/28 04:40:49 /src/github.com/heketi/heketi/apps/glusterfs/volume_entry.go:372: failed to delete volume in cleanup: no hosts available (4 total)
[heketi] ERROR 2018/06/28 04:40:49 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:873: Error on create volume rollback: failed to clean up volume: 5d6a205b63f2455ba8a825d96102020a
[heketi] ERROR 2018/06/28 04:40:49 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:1183: Create Block Volume Rollback error: failed to clean up volume: 5d6a205b63f2455ba8a825d96102020a
[heketi] ERROR 2018/06/28 04:40:49 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:1185: Create Block Volume Failed: Unable to execute command on glusterfs-storage-fwh6t: 
[asynchttp] INFO 2018/06/28 04:40:49 asynchttp.go:292: Completed job bb4a4d0a2034567ed4d452e0c3aed545 in 25.416868303s
[negroni] Started GET /queue/bb4a4d0a2034567ed4d452e0c3aed545
[negroni] Completed 500 Internal Server Error in 403.83µs
[negroni] Started POST /blockvolumes
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #0
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #0
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #1
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #0
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #1
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #2
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #0
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #1
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #2
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #3
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #4
[heketi] INFO 2018/06/28 04:40:59 Allocating brick set #5
[heketi] ERROR 2018/06/28 04:40:59 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:1167: Create Block Volume Build Failed: No space
[negroni] Completed 500 Internal Server Error in 28.972259ms

----------------------

Comment 11 Raghavendra Talur 2018-07-10 19:22:30 UTC

The test case for the title : "On a 4 node setup heketi block volume creation fails when a node is powered off" should be

1. create a 4 node cluster
2. bring down any one of the 4 nodes
3. create a blockvolume with ha count 3 (remember if you are using heketi-cli you must specify it)
4. blockvolume should get created.

Corner case not part of the test case:
If a node goes down *during* the blockvolume create operation, heketi won't pick other nodes.

Comment 17 errata-xmlrpc 2018-09-12 09:23:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2686