2094576 – Sub optimal block storage allocation for small volumes from OCP

Bug 2094576 - Sub optimal block storage allocation for small volumes from OCP

Summary: Sub optimal block storage allocation for small volumes from OCP

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.5.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.5.3
Target Release:	---
Assignee:	Albert Esteve
QA Contact:	sshmulev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2094577
TreeView+	depends on / blocked

Reported:	2022-06-07 22:41 UTC by Germano Veit Michel
Modified:	2022-11-16 12:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, small qcow2 volumes in block storage were allocated 2.5 GiB (chunk size), without considering the requested capacity. As a result, there was wasted space with volumes allocated beyond their capacity. In this release, volumes with capacity smaller than one chunk use their capacity for the initial size (rounded to the next extent). For example, for capacities smaller than one extent (128 Mib), this results in 128 MiB allocated as their initial size.
Clone Of:
Environment:
Last Closed:	2022-10-03 14:00:44 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt vdsm pull 293	None	open	blockVolume: min sparse allocation to one extent	2022-08-04 15:12:06 UTC
Red Hat Issue Tracker	RHV-46368	None	None	None	2022-06-07 23:08:22 UTC
Red Hat Knowledge Base (Solution)	6962340	None	None	None	2022-06-07 23:07:02 UTC

Description Germano Veit Michel 2022-06-07 22:41:04 UTC

Description of problem:

OCP running on RHV can request some smaller volumes to be used as PVs.
I'm opening against engine initially, but the fix may be just on VDSM, unsure if the engine should send an initial size for this or not.

See this example.

1. A RHV client sends this via API:

DEBUG:root:< POST /ovirt-engine/api/disks HTTP/1.1
DEBUG:root:< <disk>
DEBUG:root:<   <description>My disk</description>
DEBUG:root:<   <format>cow</format>
DEBUG:root:<   <name>mydisk</name>
DEBUG:root:<   <provisioned_size>10485760</provisioned_size>
DEBUG:root:<   <storage_domains>
DEBUG:root:<     <storage_domain>
DEBUG:root:<       <name>iSCSI</name>
DEBUG:root:<     </storage_domain>
DEBUG:root:<   </storage_domains>
DEBUG:root:< </disk>

So its creating a small 10M disk on a block storage named iSCSI.

2. The engine sends this to the SPM. Not it only sends imageSizeInBytes (10M), but imageInitialSizeInBytes is 0, so up to VDSM to decide what its going to allocated as LV size on the block storage.

2022-06-08 08:07:55,150+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-9) [0763eb29-71b3-409c-a934-3c6b44369595] START, CreateVolumeVDSCommand(
CreateVolumeVDSCommandParameters:{
  storagePoolId='f35823fc-e31c-11ec-9be5-525400990001', 
  ignoreFailoverLimit='false', 
  storageDomainId='3a88a033-3ba2-4963-a63e-dca4c3e591f6', 
  imageGroupId='3e2b2a5b-be13-40fb-b95d-8b411fd5899e', 
  imageSizeInBytes='10485760', 
  volumeFormat='COW', 
  newImageId='107f9dfc-a20c-42e3-a04f-31cbffed3aa9', 
  imageType='Sparse', 
  newImageDescription='{"DiskAlias":"mydisk","DiskDescription":"My disk"}', 
  imageInitialSizeInBytes='0', 
  imageId='00000000-0000-0000-0000-000000000000', 
  sourceImageGroupId='00000000-0000-0000-0000-000000000000', 
  shouldAddBitmaps='false', 
  legal='true', 
  sequenceNumber='1', 
  bitmap='null'}
), log id: 2401414f

3. VDSM ends up here to define the LV volumes

   319	    def calculate_volume_alloc_size(
   ...
   342	        if preallocate == sc.SPARSE_VOL:
   343	            # Sparse qcow2
   344	            if initial_size:
   345	                # TODO: if initial_size == max_size, we exceed the max_size
   346	                # here. This should be fixed, but first we must check that
   347	                # engine is not assuming that vdsm will increase initial size
   348	                # like this.
   349	                alloc_size = int(initial_size * QCOW_OVERHEAD_FACTOR)
   350	            else:
   351	                chunk_size_mb = config.getint("irs",
   352	                                              "volume_utilization_chunk_mb")       <------- gets this, as initial_size (imageInitialSizeInBytes from engine) is 0/None
   353	                alloc_size = chunk_size_mb * MiB


And since this patch, this is now 2.5G instead of 1G.

commit a90335c8f6c5719c3d095dd3f6feb69be4c1c184
Date:   Tue Feb 22 07:59:36 2022 +0200

    config: Increase thin provisioning threshold


4. So we end up with a 2.5G LV holding a 10M qcow2:

LV                                   VG                                   Attr       LSize                                                                    
107f9dfc-a20c-42e3-a04f-31cbffed3aa9 3a88a033-3ba2-4963-a63e-dca4c3e591f6 -wi-------   2.50g  

# qemu-img info /dev/3a88a033-3ba2-4963-a63e-dca4c3e591f6/107f9dfc-a20c-42e3-a04f-31cbffed3aa9 | grep virtual
virtual size: 10 MiB (10485760 bytes)

CAP=10485760
CTIME=1654639675
DESCRIPTION={"DiskAlias":"mydisk","DiskDescription":"My disk"}
DISKTYPE=DATA
DOMAIN=3a88a033-3ba2-4963-a63e-dca4c3e591f6
FORMAT=COW
GEN=0
IMAGE=3e2b2a5b-be13-40fb-b95d-8b411fd5899e
LEGALITY=LEGAL
PUUID=00000000-0000-0000-0000-000000000000
SEQ=1
TYPE=SPARSE
VOLTYPE=LEAF
EOF

In the DB:


              image_guid              |   size   | actual_size 
--------------------------------------+----------+-------------
 107f9dfc-a20c-42e3-a04f-31cbffed3aa9 | 10485760 |  2684354560


And the discrepancy tool picks it up as its designed to pick up a snapshot bug that makes the base grow.

 Checking storage domain 'iSCSI' (3a88a033-3ba2-4963-a63e-dca4c3e591f6) from data-center  'Default'
	 Image 107f9dfc-a20c-42e3-a04f-31cbffed3aa9 size on the storage(2684354560) is larger than the capacity(10485760)

I can make it stop picking this up easily, but the whole thing looks bad. It seems we could have a 128M LV here, still a waste
but not as much. And then I can make the discrepancy tool ignore 128M only, not up to 2.5G.


Version-Release number of selected component (if applicable):
ovirt-engine-4.5.0.7-0.9.el8ev.noarch
vdsm-4.50.0.13-1.el8ev.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Use the below to create a 10M disk, just change 2**30 to 2**20
   https://raw.githubusercontent.com/oVirt/python-ovirt-engine-sdk4/main/examples/add_floating_disk.py

Actual results:
* Bad use of resources, can fill up the VG quickly if there are many

Expected results:
* Can it use the minimum LV extent size (128M) at least? It's still a waste for small volumes, but not as bad as 2.5G

Comment 2 Nir Soffer 2022-06-08 09:06:27 UTC

Engine does not need to send initial size in this case. It should send the disk size.
Vdsm should allocate 128m lv for this volume, but I'm not sure we tested allocation
of such tiny disks.

Avihai tells me that we the smallest disk tests is 1g. I think this should be easy
to fix in vdsm so you can move the bug to vdsm.

Comment 3 Germano Veit Michel 2022-06-08 21:33:01 UTC

Right, maybe logic in that calculate_volume_alloc_size() is enough.

Moving to VDSM, thank you.

Comment 4 Arik 2022-06-13 14:36:29 UTC

We don't really understand why to use such small size disks but in case they are needed it would be better to use file storage or, if all disks are that small, to change the chunk size

Comment 5 Nir Soffer 2022-06-13 19:19:48 UTC

Changing the chunk size to tiny size (e.g. 128 MiB) is very bad and
will cause VMs to pause quickly even with recent improvements in the
mailbox.

This should be a trivial fix and a good first issue.

Comment 6 Germano Veit Michel 2022-06-16 02:51:46 UTC

Nir, would you mind explaning the solution that will be done here? Are you going to allocate 128M for anything thats below 128MB?

Asking so I can move on with BZ2094577 to stop flagging about these small volumes.

Thanks!

Comment 7 Arik 2022-06-16 06:15:38 UTC

(In reply to Nir Soffer from comment #5)
> Changing the chunk size to tiny size (e.g. 128 MiB) is very bad and
> will cause VMs to pause quickly even with recent improvements in the
> mailbox.

Right but note the "if all disks are that small" part that came before that - IIRC the installer of OCP on RHV was changed to create preallocated disks for the VMs so if all the thin provisioned disks are those small PVs..

Anyway, we had a similar issue with the metadata volumes (for hibernated vms or snapshots with memory) and I don't remember all the details of what was discussed back then but eventually we ended up setting them with raw+preallocated, wouldn't it make sense to do the same for those disks?

Comment 8 Nir Soffer 2022-06-16 07:25:26 UTC

(In reply to Arik from comment #7)
> (In reply to Nir Soffer from comment #5)
> > Changing the chunk size to tiny size (e.g. 128 MiB) is very bad and
> > will cause VMs to pause quickly even with recent improvements in the
> > mailbox.
> 
> Right but note the "if all disks are that small" part that came before that
> - IIRC the installer of OCP on RHV was changed to create preallocated disks
> for the VMs so if all the thin provisioned disks are those small PVs..
> 
> Anyway, we had a similar issue with the metadata volumes (for hibernated vms
> or snapshots with memory) and I don't remember all the details of what was
> discussed back then but eventually we ended up setting them with
> raw+preallocated, wouldn't it make sense to do the same for those disks?

For metadata disks we wrongly used qcow2 format and then corrupted the disk
by writing over the qcow2 header.

In this case the user asked to create qcow2 disk with small virtual size
(10m).

disk = disks_service.add(
    types.Disk(
        name='mydisk',
        description='My disk',
        format=types.DiskFormat.COW,
        provisioned_size=10 * 2**20,
        storage_domains=[
            types.StorageDomain(
                name='mydata'
            )
        ]
    )
)

Vdsm ignored the requested virtual size and allocated one chunk
(2.5g) instead of one extent (128m). This is a trivial error in
the code computing the logical volume size for a new disk.

Comment 9 Nir Soffer 2022-06-16 07:37:51 UTC

(In reply to Germano Veit Michel from comment #6)
> Nir, would you mind explaning the solution that will be done here? Are you
> going to allocate 128M for anything thats below 128MB?

Yes, with block storage we cannot allocate less than 128 MiB. If you create
10 MiB disk you will get a 128 MiB logical volume with 10 MiB qcow2 image.

Comment 10 Arik 2022-06-16 07:55:50 UTC

(In reply to Nir Soffer from comment #8)
> In this case the user asked to create qcow2 disk with small virtual size
> (10m).
> 
> disk = disks_service.add(
>     types.Disk(
>         name='mydisk',
>         description='My disk',
>         format=types.DiskFormat.COW,
>         provisioned_size=10 * 2**20,
>         storage_domains=[
>             types.StorageDomain(
>                 name='mydata'
>             )
>         ]
>     )
> )
> 
> Vdsm ignored the requested virtual size and allocated one chunk
> (2.5g) instead of one extent (128m). This is a trivial error in
> the code computing the logical volume size for a new disk.

Yeah, but in this case it's not really the user that asks that, it's the OCP client that determines the format I suppose
Any reason to have them thin-provisioned? from my point of view, converting such small volumes to RAW+preallocated is also ok if it's just for OCP on RHV and they cannot change it at the moment..

Comment 11 Nir Soffer 2022-06-16 08:03:16 UTC

(In reply to Arik from comment #10)
> (In reply to Nir Soffer from comment #8)
> > In this case the user asked to create qcow2 disk with small virtual size
> > (10m).
> > 
> > disk = disks_service.add(
> >     types.Disk(
> >         name='mydisk',
> >         description='My disk',
> >         format=types.DiskFormat.COW,
> >         provisioned_size=10 * 2**20,
> >         storage_domains=[
> >             types.StorageDomain(
> >                 name='mydata'
> >             )
> >         ]
> >     )
> > )
> > 
> > Vdsm ignored the requested virtual size and allocated one chunk
> > (2.5g) instead of one extent (128m). This is a trivial error in
> > the code computing the logical volume size for a new disk.
> 
> Yeah, but in this case it's not really the user that asks that, it's the OCP
> client that determines the format I suppose
> Any reason to have them thin-provisioned? from my point of view, converting
> such small volumes to RAW+preallocated is also ok if it's just for OCP on
> RHV and they cannot change it at the moment..

I don't have any idea why they create small qcow2 disks, but it is clear that
vdsm behavior is wrong.

Germanno, do you know why these tiny disks are created? I don't see real use
case for 10 MiB disk except testing.

Comment 12 Germano Veit Michel 2022-06-16 22:26:21 UTC

(In reply to Nir Soffer from comment #11)
> Germanno, do you know why these tiny disks are created? I don't see real use
> case for 10 MiB disk except testing.

This would come from "apps" running on OCP. I'm not aware of a PVC required by the
default cluster components that would need such a small storage.

It depends on what is the app running and what are its needs, its not in our hands
as one can decide to run anything in OCP. With microservices and friends, it may be
expected?

Customer has just one of these on his entire OCP cluster, everything else is 1G+ or 2.5G+.

Comment 13 Arik 2022-06-19 07:19:09 UTC

(In reply to Germano Veit Michel from comment #12)
> Customer has just one of these on his entire OCP cluster, everything else is
> 1G+ or 2.5G+.

OK, that makes more sense
So I'm not sure about the cost-effectiveness of dealing with that, I see the point of fixing incorrect behavior of VDSM but this change would also require a change on the engine side and that scenario should really not be that common. Nir mentioned possible corruption with metadata volumes in comment 8, which I may have not been involved in its discussions, but even without the corruption we chose not to mess with such small volumes when we got reports of allocating 1G for them on block storage. I don't think this bug really justifies to reconsider that at this point..

Comment 14 Germano Veit Michel 2022-06-19 21:31:52 UTC

Arik,

I'm not sure if I'm missing something, are you saying this should be WONTFIX? 
What on engine side needs to be fixed here? Isn't it just stop using volume_utilization_chunk_mb as initial size in calculate_volume_alloc_size?

But indeed, this is not a severe bug, and I can make the discrepancy tool ignore wasted storage space under 2.5G to not be triggered by these. It's just not very correct...

Comment 15 Arik 2022-06-20 06:16:08 UTC

(In reply to Nir Soffer from comment #2)
> Engine does not need to send initial size in this case. It should send the
> disk size.

OK, I misinterpreted this part if "disk size" refers to the virtual size of the disk, I believe that is sent already and then we don't need to change the engine side
but even then, what's the point of defining thin-provisioned (sparse) disks in this size? we don't want the mailbox mechanism to come into play and monitor those volumes, right? I'm not suggest to close it as WONTFIX, I suggest to apply the same approach that was applied for metadata volumes (regardless of the reason - whether it was to prevent corrupted as Nir wrote or not to allocate the previous chunk size of 1G) which is to create them as preallocated volumes

Comment 16 Germano Veit Michel 2022-06-20 07:47:01 UTC

(In reply to Arik from comment #15) 
> but even then, what's the point of defining thin-provisioned (sparse) disks
> in this size? we don't want the mailbox mechanism to come into play and
> monitor those volumes, right?
True, a tiny sparse disk saves nothing. But it could be extended later (who knows),
a client can do whatever it wants - and RHV should do whatever its being told via
API.

We need to test what the latest version of the oVirt CSI driver is doing, there
might be another bug there.

> I'm not suggest to close it as WONTFIX, I
> suggest to apply the same approach that was applied for metadata volumes
> (regardless of the reason - whether it was to prevent corrupted as Nir wrote
> or not to allocate the previous chunk size of 1G) which is to create them as
> preallocated volumes

You mean to automatically switch thin to prealloc if the size is smaller than X,
regardless of what the client requests? What if its extended later? These are
not system disks and the user/client can in theory do anything with it.

And blocking it may break existing things...

Comment 17 Arik 2022-06-20 08:48:09 UTC

Right, unlike the metadata volumes these volumes can be extended but that's ok, preallocated volumes can be extended. If you mean that they can be extended up to some size for which it would make sense to use thin provisioning - I'm not sure

Comment 18 Michal Skrivanek 2022-07-30 09:37:17 UTC

what's the plan?
it doesn't seem to be worth addressing if it's a single PV in the whole setup, i.e. not frequent.

Comment 19 Arik 2022-07-31 09:48:37 UTC

(In reply to Michal Skrivanek from comment #18)
> it doesn't seem to be worth addressing if it's a single PV in the whole
> setup, i.e. not frequent.

yeah, and Germano changed rhv-image-discrepancies (bz 2094577) so that would be less visible, but it really sounds like fairly easy fix and if with that we can improve the experience of ocp on rhv also in cases of having many workloads that ask for such small volumes then I think it is worth giving a try to what Nir suggested above

Comment 20 meital avital 2022-08-04 07:37:29 UTC

Due to QE capacity, we are not going to cover this issue in our automation

Comment 22 Casper (RHV QE bot) 2022-09-21 08:30:37 UTC

This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).

Comment 25 Casper (RHV QE bot) 2022-10-03 14:00:44 UTC

This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.

Note You need to log in before you can comment on or make changes to this bug.