All attempts at block pvc delete throw 'Delete Block Volume Build Failed' but eventually delete succeeds Description of problem: ============================ In OCS 3.11.1 build, while deleting block pvcs, we see some error messages in heketi logs and also sometimes in gluster-blockd logs . But the delete finally succeeds. We wanted to confirm what could be the reason for so many "Error" log messages for even a single pvc deletion, even when deletion succeeds. Attempted almost 4-5 block deletions and following things are noticed: ____________________________________________________________________________ 1. Even though block deletion is quite fast , for a single pvc delete request, we see multiple error messages like below in heketi: ++++++++++++++++++++++++++++++++++++++++++++++++++++ e.g. [negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98 [heketi] ERROR 2018/11/22 17:58:36 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:941: Pending block volume 4032fca686696705df44e42128747a98 can not be deleted --------------------------------------- 2. We also saw following error message on gluster-blockd side , once after a create and once after delete. ++++++++++++++++++++++++++++++++++++++++++++ "glusterBlockVolumeInit(e56279db502865659c524b763c9957bf) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>]" [2018-11-22 09:57:05.757325] INFO: create request, volume=vol_e56279db502865659c524b763c9957bf blockname=blk_glusterfs_block50_f0ac8d19-ee3c-11e8-9bfd-0a580a820008 blockhosts=10.70.46.100,10.70.47.20,10.70.47.18 filename=1f403680-a878-4d08-b636-1ae851e9279e authmode=1 passwd=bbebb862-30f9-4798-84fb-9f860f8d7740 size=3221225472 [at block_svc_routines.c+4270 :<block_create_common>] [2018-11-22 09:57:07.859370] INFO: command exit code, 0 [at block_svc_routines.c+4446 :<block_create_common>] [2018-11-22 09:58:36.383064] ERROR: glusterBlockVolumeInit(9afa8e97bae9f92bdb19376c320fbcb3) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>] [2018-11-22 09:59:01.714822] ERROR: glusterBlockVolumeInit(e56279db502865659c524b763c9957bf) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>] [2018-11-22 10:14:49.681918] INFO: command exit code, 0 [at block_svc_routines.c+4446 :<block_create_common>] [2018-11-22 10:16:06.034960] INFO: delete request, blockname=blk_glusterfs_file1_6f8f6a2d-ee3f-11e8-a6eb-0a580a810007 filename=f40d4303-4dfc-4838-b153-50ae1f86c94c [at block_svc_routines.c+4689 :<block_delete_1_svc_st>] [2018-11-22 10:16:09.802075] INFO: command exit code, 0 [at block_svc_routines.c+4720 :<block_delete_1_svc_st>] [2018-11-22 10:17:03.289160] ERROR: glusterBlockVolumeInit(e56279db502865659c524b763c9957bf) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>] ----------------------------------------------------------------- 3. Even though delete block volume succeeds, we still see some more "Started Delete" requests for the same volume after successful delete. This then errors out saying "Completed 404 Not Found in 91.097µs" +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ What could be the reason for re-trying delete even after delete is successful. Similar behavior was seen [kubeexec] DEBUG 2018/11/22 17:58:41 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:246: Host: dhcp47-20.lab.eng.blr.redhat.com Pod: glusterfs-storage-fkmqh Command: bash -c "set -o pipefail && gluster-block delete vol_e56279db502865659c524b763c9957bf/blk_glusterfs_block52_6ff02eac-ee7f-11e8-a6eb-0a580a810007 --json |tee /dev/stderr" Result: { "SUCCESSFUL ON": [ "10.70.47.18", "10.70.47.20", "10.70.46.100" ], "RESULT": "SUCCESS" } [negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98 [negroni] Started GET /queue/a06268cce654c493d44efcab2d471ee4 [negroni] Completed 200 OK in 172.054µs [heketi] INFO 2018/11/22 17:58:41 Delete Block Volume succeeded [asynchttp] INFO 2018/11/22 17:58:41 asynchttp.go:292: Completed job a06268cce654c493d44efcab2d471ee4 in 5.047771833s [heketi] ERROR 2018/11/22 17:58:41 /src/github.com/heketi/heketi/apps/glusterfs/operations_manage.go:89: Delete Block Volume Build Failed: Id not found [negroni] Completed 500 Internal Server Error in 163.907849ms [negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98 [negroni] Completed 404 Not Found in 202.273µs [negroni] Started GET /queue/a06268cce654c493d44efcab2d471ee4 [negroni] Completed 204 No Content in 170.067µs [negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98 [negroni] Completed 404 Not Found in 200.545µs [negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98 [negroni] Completed 404 Not Found in 91.097µs 3. Version-Release number of selected component (if applicable): # oc rsh glusterfs-storage-dnqkj rpm -qa|grep gluster glusterfs-fuse-3.12.2-27.el7rhgs.x86_64 python2-gluster-3.12.2-27.el7rhgs.x86_64 glusterfs-server-3.12.2-27.el7rhgs.x86_64 gluster-block-0.2.1-29.el7rhgs.x86_64 glusterfs-api-3.12.2-27.el7rhgs.x86_64 glusterfs-cli-3.12.2-27.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-27.el7rhgs.x86_64 glusterfs-libs-3.12.2-27.el7rhgs.x86_64 glusterfs-3.12.2-27.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-27.el7rhgs.x86_64 # oc rsh heketi-storage-1-6xdhf rpm -qa|grep heketi heketi-client-8.0.0-1.el7rhgs.x86_64 heketi-8.0.0-1.el7rhgs.x86_64 #oc version = 3.11.16 How reproducible: ++++++++++++++++ 5x5 Steps to Reproduce: 1. Create block pvcs and check the heketi and gluster-block logs. Check for meesages similar to "glusterBlockVolumeInit failed " 2. Delete block pvc and check heketi and gluster-block logs. Check for innumerable " Delete Block Volume Build Failed: The target exists, contains other items, or is in use."
From Heketi's perspective it appears to be generally doing the correct thing. What happens is that clients are requesting a delete of the the same volume, repeatedly, before that volume is deleted by the first request. Every request to delete the volume that is already being deleted gets the "Build Failed: The target exists, contains other items, or is in use." message. This makes me wonder if there could be an issue where the block provisioniner "spams" heketi with many requests to delete the same volume. Can we either get the block provisioner logs from this same time frame, or reproduce again and this time capture both the block provisioner logs and the heketi logs?