Bug 1652922

Summary: Attempts at block pvc delete throw 'Delete Block Volume Build Failed' but eventually all deletes succeed
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Neha Berry <nberry>
Component: heketiAssignee: John Mulligan <jmulligan>
Status: CLOSED WONTFIX QA Contact: Neha Berry <nberry>
Severity: high Docs Contact:
Priority: unspecified    
Version: ocs-3.11CC: hchiramm, kramdoss, madam, nberry, rhs-bugs, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-11 22:43:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Neha Berry 2018-11-23 15:10:05 UTC
All attempts at block pvc delete throw 'Delete Block Volume Build Failed' but eventually delete succeeds

Description of problem:
============================

In OCS 3.11.1 build, while deleting block pvcs, we see some error messages in heketi logs and also sometimes in gluster-blockd logs . But the delete finally succeeds.

We wanted to confirm what could be the reason for  so many "Error" log messages for even a single pvc deletion, even when deletion succeeds. 


Attempted almost 4-5 block deletions and following things are noticed:
____________________________________________________________________________


1. Even though block deletion is quite fast , for a single pvc delete request, we see multiple error messages like below in heketi:
++++++++++++++++++++++++++++++++++++++++++++++++++++
e.g. 
[negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98
[heketi] ERROR 2018/11/22 17:58:36 /src/github.com/heketi/heketi/apps/glusterfs/operations.go:941: Pending block volume 4032fca686696705df44e42128747a98 can not be deleted


---------------------------------------


2.  We also saw following error message on gluster-blockd side , once after a create and once after delete.
++++++++++++++++++++++++++++++++++++++++++++

"glusterBlockVolumeInit(e56279db502865659c524b763c9957bf) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>]"


[2018-11-22 09:57:05.757325] INFO: create request, volume=vol_e56279db502865659c524b763c9957bf blockname=blk_glusterfs_block50_f0ac8d19-ee3c-11e8-9bfd-0a580a820008 blockhosts=10.70.46.100,10.70.47.20,10.70.47.18 filename=1f403680-a878-4d08-b636-1ae851e9279e authmode=1 passwd=bbebb862-30f9-4798-84fb-9f860f8d7740 size=3221225472 [at block_svc_routines.c+4270 :<block_create_common>]
[2018-11-22 09:57:07.859370] INFO: command exit code, 0 [at block_svc_routines.c+4446 :<block_create_common>]
[2018-11-22 09:58:36.383064] ERROR: glusterBlockVolumeInit(9afa8e97bae9f92bdb19376c320fbcb3) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>]
[2018-11-22 09:59:01.714822] ERROR: glusterBlockVolumeInit(e56279db502865659c524b763c9957bf) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>]


[2018-11-22 10:14:49.681918] INFO: command exit code, 0 [at block_svc_routines.c+4446 :<block_create_common>]
[2018-11-22 10:16:06.034960] INFO: delete request, blockname=blk_glusterfs_file1_6f8f6a2d-ee3f-11e8-a6eb-0a580a810007 filename=f40d4303-4dfc-4838-b153-50ae1f86c94c [at block_svc_routines.c+4689 :<block_delete_1_svc_st>]
[2018-11-22 10:16:09.802075] INFO: command exit code, 0 [at block_svc_routines.c+4720 :<block_delete_1_svc_st>]
[2018-11-22 10:17:03.289160] ERROR: glusterBlockVolumeInit(e56279db502865659c524b763c9957bf) failed [at block_svc_routines.c+4974 :<block_list_cli_1_svc_st>]


-----------------------------------------------------------------

3. Even though delete block volume succeeds, we still see some more "Started Delete" requests for the same volume after successful delete. This then errors out saying "Completed 404 Not Found in 91.097µs"
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
What could be the reason for re-trying delete even after delete is successful. Similar behavior was seen 



[kubeexec] DEBUG 2018/11/22 17:58:41 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:246: Host: dhcp47-20.lab.eng.blr.redhat.com Pod: glusterfs-storage-fkmqh Command: bash -c "set -o pipefail && gluster-block delete vol_e56279db502865659c524b763c9957bf/blk_glusterfs_block52_6ff02eac-ee7f-11e8-a6eb-0a580a810007 --json |tee /dev/stderr"
Result: { "SUCCESSFUL ON": [ "10.70.47.18", "10.70.47.20", "10.70.46.100" ], "RESULT": "SUCCESS" }
[negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98
[negroni] Started GET /queue/a06268cce654c493d44efcab2d471ee4
[negroni] Completed 200 OK in 172.054µs
[heketi] INFO 2018/11/22 17:58:41 Delete Block Volume succeeded
[asynchttp] INFO 2018/11/22 17:58:41 asynchttp.go:292: Completed job a06268cce654c493d44efcab2d471ee4 in 5.047771833s
[heketi] ERROR 2018/11/22 17:58:41 /src/github.com/heketi/heketi/apps/glusterfs/operations_manage.go:89: Delete Block Volume Build Failed: Id not found
[negroni] Completed 500 Internal Server Error in 163.907849ms
[negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98
[negroni] Completed 404 Not Found in 202.273µs
[negroni] Started GET /queue/a06268cce654c493d44efcab2d471ee4
[negroni] Completed 204 No Content in 170.067µs
[negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98
[negroni] Completed 404 Not Found in 200.545µs
[negroni] Started DELETE /blockvolumes/4032fca686696705df44e42128747a98
[negroni] Completed 404 Not Found in 91.097µs


3.
Version-Release number of selected component (if applicable):

  # oc rsh glusterfs-storage-dnqkj rpm -qa|grep gluster
glusterfs-fuse-3.12.2-27.el7rhgs.x86_64
python2-gluster-3.12.2-27.el7rhgs.x86_64
glusterfs-server-3.12.2-27.el7rhgs.x86_64
gluster-block-0.2.1-29.el7rhgs.x86_64
glusterfs-api-3.12.2-27.el7rhgs.x86_64
glusterfs-cli-3.12.2-27.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-27.el7rhgs.x86_64
glusterfs-libs-3.12.2-27.el7rhgs.x86_64
glusterfs-3.12.2-27.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-27.el7rhgs.x86_64

# oc rsh heketi-storage-1-6xdhf rpm -qa|grep heketi
heketi-client-8.0.0-1.el7rhgs.x86_64
heketi-8.0.0-1.el7rhgs.x86_64

#oc version = 3.11.16



How reproducible:
++++++++++++++++
5x5

Steps to Reproduce:
1. Create block pvcs and check the heketi and gluster-block logs. Check for meesages similar to "glusterBlockVolumeInit  failed "
2. Delete block pvc and check heketi and gluster-block logs. Check for innumerable " Delete Block Volume Build Failed: The target exists, contains other items, or is in use."

Comment 3 John Mulligan 2018-11-28 17:57:25 UTC
From Heketi's perspective it appears to be generally doing the correct thing. What happens is that clients are requesting a delete of the the same volume, repeatedly, before that volume is deleted by the first request. Every request to delete the volume that is already being deleted gets the "Build Failed: The target exists, contains other items, or is in use." message.

This makes me wonder if there could be an issue where the block provisioniner "spams" heketi with many requests to delete the same volume.

Can we either get the block provisioner logs from this same time frame, or reproduce again and this time capture both the block provisioner logs and the heketi logs?