This bug is somewhat similar to Bug 1624738(only exception - heketi-cli blockvolume list doesnt display the ghost block device IDs) and https://bugzilla.redhat.com/show_bug.cgi?id=1634745
Description of problem:
++++++++++++++++++++++++++
TC being run = CNS-1285 - Target side failures -Brick failure on block hosting volume
# oc get pods
NAME READY STATUS RESTARTS AGE
cirrosblock1-1-9r7lv 1/1 Running 0 1h
glusterblock-storage-provisioner-dc-1-cvbx9 1/1 Running 3 4d
glusterfs-storage-g4slk 1/1 Running 2 4d
glusterfs-storage-jc66v 1/1 Running 0 3h
glusterfs-storage-rz6zt 1/1 Running 2 4d
glusterfs-storage-z22n9 1/1 Running 2 4d
heketi-storage-1-6fwjq 1/1 Running 3 4d
glusterfs-storage-g4slk
Steps Performed
-----------------
1. Created 10 BVs(block pvcs) on a BHV(vol_2b5bc5e6bd4036c82e9e93846d92e13f) of 100GB size and it was successful. Free size left on BHV=70GB.
2. Started a loop to create 10 more block pvcs (name - new-block{1..10}
for i in {1..10}; do ./pvc-create.sh new-block$i 3; date ; sleep 2; done
Start time = Wed Oct 3 19:12:42 IST 2018
end time = Wed Oct 3 19:13:02 IST 2018
3. Immediately after starting Step#2, killed 2 bricks of the BHV.
4. As expected, the pvcs stayed in pending state as 2 bricks were down.
Issue seen in current build:
=============================
5. Chekced in heketi logs, "ghost" BV ids were still getting created and this ultimately brought down the free space of the BHV vol_2b5bc5e6bd4036c82e9e93846d92e13f from 70GB to 1 GB
6. Once the first BHV had only 1 GB free due to the creation of numerous Ghost IDs(as seen from heketi logs), a new BHV(vol_cf75072b5ef16c3b8e85f5fd3b4cab58) was created.
7. The pending pvc requests were then fulfilled from the new BHV and all went to BOUND state, even though the first BHV still has 2 bricks DOWN.
8. The Free space at gluster backend mismatches with the free space of the first BHV in heketi.
9. For BHV vol_2b5bc5e6bd4036c82e9e93846d92e13f , 2 pods show 32GB used but one pod- glusterfs-storage-g4slk shows 23GB. This is also a bug mismatch between the 3 bricks of the same volume.
Observations for similar test steps in earlier versions of heketi
===================================================================
Once 2 bricks of a BHV were killed, the pvc requests used to stay in pending state till the BHV bricks were again brought back up.
No inflight or ghost "BV IDS" used to be seen in heketi dump. Also, a second BHV was never created as the free space of first BHV(whose 2 bricks are down) was still intact.
Version-Release number of selected component (if applicable):
+++++++++++++++++++++++++++++++++++++++++++++++++
OC version = v3.11.15
Heketi version from heketi pod =
++++++++
sh-4.2# rpm -qa|grep heketi
heketi-client-7.0.0-13.el7rhgs.x86_64
heketi-7.0.0-13.el7rhgs.x86_64
Heketi client version from master node
+++++
# rpm -qa|grep heketi
heketi-client-7.0.0-13.el7rhgs.x86_64
Gluster version
++++++
sh-4.2# rpm -qa|grep gluster
glusterfs-libs-3.12.2-18.1.el7rhgs.x86_64
glusterfs-3.12.2-18.1.el7rhgs.x86_64
glusterfs-api-3.12.2-18.1.el7rhgs.x86_64
python2-gluster-3.12.2-18.1.el7rhgs.x86_64
glusterfs-fuse-3.12.2-18.1.el7rhgs.x86_64
glusterfs-server-3.12.2-18.1.el7rhgs.x86_64
gluster-block-0.2.1-27.el7rhgs.x86_64
glusterfs-client-xlators-3.12.2-18.1.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.1.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-18.1.el7rhgs.x86_64
sh-4.2# rpm -qa|grep tcmu-runner
tcmu-runner-1.2.0-25.el7rhgs.x86_64
sh-4.2#
How reproducible:
++++++++++
Tried only once as of now.
Steps to Reproduce:
+++++++++++++++++
1. Create a few block pvcs and check the free size of the BHV
2. Start a loop to create around 10 pvcs
3. immediately kill 2 bricks of the BHV and check the pvc status.
4. Check heketi logs for the spurious BV's still getting created and entries being present in heketi db dump.
#heketi-cli server operations info
5. Check the used space of the BHV from heketi - its shows all space used up.
5. Check that a new BHV is created and the pending pvcs are carved from this new BHV
Actual results:
+++++++++++
Since the 2 bricks of a BHV are down, the BHV is getting filled up with "Ghost" BV ids and ultimately a new BHV is getting created to service the pending pvc.
Thus, the space is first BHV is assumed to be full though on gluster backend it still refelcts enough free space.
Expected results:
++++++++++++
As we used to see in older runs of these Test cases, the pvcs used to stay in Pending state and no spurious BVs were seen in heketi db dump. Thus, the space of the original BHV was kept intact.
Once the bricks were brought ONLINE, the pending pvcs were carved out from the same first BHV.
Comment 5krishnaram Karthick
2018-10-08 07:06:51 UTC
Proposing this bug as a blocker. QE feels this is a release blocker as this comes under heketi stability category. We've had several unhappy customers for the same reason and providing a clean up script isn't really helping.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHEA-2019:0286