Description of problem: ++++++++++++++++++++++++++ We were scaling up to create atleast 100 APP pods with block devices bind-mounted to them. We were creating each pvc+pod after in intervals of 10 s. Each PVC was of 1 GB each. T While creating 90th to 100th pvc in the namespace "fiotest", new block device creations failed with "No space left on device". pvc describe listed error messages as following: "Failed to provision volume with StorageClass "gluster-block": failed to create volume: [heketi] failed to create volume: Unable to execute command on glusterfs-storage-fj674:" Note: ++++++++ 1. The system had one block-hosting volume - 8317f50bf66dd1bf02a1d7de68ee280a which now only has 1 GB free. Hence, for subsequent block device creations, a new block hosting volume should have been created automatically. It should not have failed with "[No space left on device]" }". Could it be that the free size in 8317f50bf66dd1bf02a1d7de68ee280a = the new pvc requested size of 1GB, hence somehow the creations failed. 2. The pvc requests were also of 1GB each. The available space in 8317f50bf66dd1bf02a1d7de68ee280a was 7GB before we started creating 10 new devices in an interval of 10 s. 6 pvcs got created successfully and subsequent 4 failed. Current available size in 8317f50bf66dd1bf02a1d7de68ee280a = 1GB. 1. Error message from heketi ++++++++++++++++++++++++++++ [kubeexec] ERROR 2018/07/23 16:27:13 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:242: Failed to run command [gluster-block create vol_8317f50bf66dd1bf02a1d7de68ee280a/blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 ha 3 auth enable prealloc full 10.70.46.1,10.70.46.175,10.70.46.75 1GiB --json] on glusterfs-storage-fj674: Err[command terminated with exit code 28]: Stdout [{ "RESULT": "FAIL", "errCode": 28, "errMsg": "Not able to create storage for vol_8317f50bf66dd1bf02a1d7de68ee280a\/blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 [No space left on device]" } ]: Stderr [] [kubeexec] ERROR 2018/07/23 16:27:13 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:242: Failed to run command [gluster-block delete vol_8317f50bf66dd1bf02a1d7de68ee280a/blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 --json] on glusterfs-storage-fj674: Err[command terminated with exit code 2]: Stdout [{ "RESULT": "FAIL", "errCode": 2, "errMsg": "block vol_8317f50bf66dd1bf02a1d7de68ee280a\/blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 doesn't exist" } ]: Stderr [] [cmdexec] ERROR 2018/07/23 16:27:13 /src/github.com/heketi/heketi/executors/cmdexec/block_volume.go:102: Unable to delete volume blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202: Unable to execute command on glusterfs-storage-fj674: [heketi] ERROR 2018/07/23 16:27:13 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:816: Error executing create block volume: Unable to execute command on glusterfs-storage-fj674: [cmdexec] INFO 2018/07/23 16:27:13 Check Glusterd service status in node dhcp46-1.lab.eng.blr.redhat.com 2. Error message from gluster pod of glusterfs-storage-fj674 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 175,10.70.46.75 authmode=1 size=1073741824, rbsize=0 [at block_svc_routines.c+3778 :<block_create_cli_1_svc_st>] [2018-07-23 16:27:13.373883] ERROR: failed while creating block file in gluster volume volume: vol_8317f50bf66dd1bf02a1d7de68ee280a block: blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 file: 5354b6f8-038f-4a2e-8e30-d5d5fd6e684c host: 10.70.46.1,10.70.46.175,10.70.46.75 [at block_svc_routines.c+3868 :<block_create_cli_1_svc_st>] [2018-07-23 16:27:13.518944] INFO: delete cli request, volume=vol_8317f50bf66dd1bf02a1d7de68ee280a blockname=blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 [at block_svc_routines.c+4493 :<block_delete_cli_1_svc_st>] [2018-07-23 16:27:13.523342] ERROR: block with name blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 doesn't exist in the volume vol_8317f50bf66dd1bf02a1d7de68ee280a [at block_svc_routines.c+4528 :<block_delete_cli_1_svc_st>] [2018-07-23 16:27:13.812625] INFO: delete cli request, volume=vol_8317f50bf66dd1bf02a1d7de68ee280a blockname=blk_fiotest_pvcam3jjv0vwr_401e17c6-8e95-11e8-888e-0a580a810202 [at block_svc_routines.c+4493 :<block_delete_cli_1_svc_st>] These create requests didnt even reach the other glusterfs pods 3. Error message from oc describe pvc +++++++++++++++++++++++++++++++++++++ [root@dhcp47-178 openshift_scalability]# for i in `oc get pvc -n fiotest|grep -i pending |awk '{print$1}' `; do echo $i; echo +++++++++++++; oc describe pvc $i -n fiotest; echo ""; done pvc8otjs11sbc +++++++++++++ Name: pvc8otjs11sbc Namespace: fiotest StorageClass: gluster-block Status: Pending Volume: Labels: <none> Annotations: control-plane.alpha.kubernetes.io/leader={"holderIdentity":"5b1157a2-8e51-11e8-888e-0a580a810202","leaseDurationSeconds":15,"acquireTime":"2018-07-23T16:28:13Z","renewTime":"2018-07-23T16:45:27Z","lea... volume.beta.kubernetes.io/storage-class=gluster-block volume.beta.kubernetes.io/storage-provisioner=gluster.org/glusterblock Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Provisioning 38m (x14 over 53m) gluster.org/glusterblock 5b1157a2-8e51-11e8-888e-0a580a810202 External provisioner is provisioning volume for claim "fiotest/pvc8otjs11sbc" Warning ProvisioningFailed 38m (x14 over 53m) gluster.org/glusterblock 5b1157a2-8e51-11e8-888e-0a580a810202 Failed to provision volume with StorageClass "gluster-block": failed to create volume: [heketi] failed to create volume: Unable to execute command on glusterfs-storage-fj674: Normal ExternalProvisioning 3m (x461 over 53m) persistentvolume-controller waiting for a volume to be created, either by external provisioner "gluster.org/glusterblock" or manually created by system administrator ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Note: All services were up and running in the 3 glusterfs pods 4. Loop used to create pvc volumes #90 to #100 under fiotest --------------------------------------------------------------- [root@dhcp47-178 openshift_scalability]# python cluster-loader.py -f content/fio/fio-parameters.yaml && date oc v3.10.18 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp47-178.lab.eng.blr.redhat.com:8443 openshift v3.10.18 kubernetes v1.10.0+b81c8f8 forking fiotest project.project.openshift.io/fiotest templates: [{'num': 10, 'file': './content/fio/fio-template.json', 'parameters': [{'STORAGE_CLASS': 'gluster-block'}, {'STORAGE_SIZE': '1Gi'}, {'MOUNT_PATH': '/mnt/pvcmount'}, {'DOCKER_IMAGE': 'r7perffio'}]}] persistentvolumeclaim "pvct06rlmfdbe" created pod "fio-pod-tshdz" created persistentvolumeclaim "pvcnniamy8cel" created <----successfully created pod "fio-pod-brrl4" created persistentvolumeclaim "pvcg50nakvc1l" created <----successfully created pod "fio-pod-rkjkr" created persistentvolumeclaim "pvcwuafokgfam" created <----successfully created pod "fio-pod-jxddb" created persistentvolumeclaim "pvcae5cmlfiad" created <----successfully created pod "fio-pod-gqbth" created persistentvolumeclaim "pvcl4lylp2rvh" created <----successfully created pod "fio-pod-dh4w4" created persistentvolumeclaim "pvcam3jjv0vwr" created <----- no backend block device created pod "fio-pod-q59dr" created persistentvolumeclaim "pvcvqw113p5na" created <----- no backend block device created pod "fio-pod-bc4p9" created persistentvolumeclaim "pvcbggembg3pu" created <----- no backend block device created pod "fio-pod-rsjxc" created persistentvolumeclaim "pvc8otjs11sbc" created <----- no backend block device created pod "fio-pod-zfr7r" created Mon Jul 23 21:58:33 IST 2018 ------------------------------------------------------------------------ ********************************************************************** Version-Release number of selected component (if applicable): ++++++++++++++++++++++++++ [root@dhcp47-178 ~]# oc version oc v3.10.18 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp47-178.lab.eng.blr.redhat.com:8443 openshift v3.10.18 kubernetes v1.10.0+b81c8f8 [root@dhcp47-178 ~]# oc rsh glusterfs-storage- glusterfs-storage-4ffb2 glusterfs-storage-9bjx9 glusterfs-storage-fj674 [root@dhcp47-178 ~]# oc rsh glusterfs-storage-4ffb2 rpm -qa|grep gluster glusterfs-libs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-3.8.4-54.15.el7rhgs.x86_64 glusterfs-api-3.8.4-54.15.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.15.el7rhgs.x86_64 glusterfs-server-3.8.4-54.15.el7rhgs.x86_64 gluster-block-0.2.1-22.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-54.15.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.15.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.15.el7rhgs.x86_64 [root@dhcp47-178 ~]# oc rsh heketi-storage-1-6st7z rpm -qa|grep heketi python-heketi-7.0.0-4.el7rhgs.x86_64 heketi-client-7.0.0-4.el7rhgs.x86_64 heketi-7.0.0-4.el7rhgs.x86_64 [root@dhcp47-178 ~]# How reproducible: ++++++++++++++++++++++++++ Once Steps to Reproduce: ++++++++++++++++++++++++++ 1. Start a script to create pvcs of 1GB when the free size of the lone block-hosting volume present is also 1 GB. 2. Confirm that a new block hosting volume is created and 1 GB pvcs are carved out of it. Actual results: ++++++++++++++++++++++++++ With only 1 GB free in the lone block-hosting volume, new block device of 1 GB could not be created in the setup [No space left on device]. A new block hosting volume is also not created to provide space for new block devices. Expected results: ++++++++++++++++++++++++++ With less space available in the block-hosting volume, a new block hosting volume should get created to satify subsequent block device create requests. Seems like we hit this issue because the free size in vol_8317f50bf66dd1bf02a1d7de68ee280a(100GB in size) is 1GB and new request for pvc is also 1GB. Instead of creating a new block hosting volume, it tried to provision the 1GB-pvc from existing volume and failed.