Description of problem: OCP running on RHV can request some smaller volumes to be used as PVs. I'm opening against engine initially, but the fix may be just on VDSM, unsure if the engine should send an initial size for this or not. See this example. 1. A RHV client sends this via API: DEBUG:root:< POST /ovirt-engine/api/disks HTTP/1.1 DEBUG:root:< <disk> DEBUG:root:< <description>My disk</description> DEBUG:root:< <format>cow</format> DEBUG:root:< <name>mydisk</name> DEBUG:root:< <provisioned_size>10485760</provisioned_size> DEBUG:root:< <storage_domains> DEBUG:root:< <storage_domain> DEBUG:root:< <name>iSCSI</name> DEBUG:root:< </storage_domain> DEBUG:root:< </storage_domains> DEBUG:root:< </disk> So its creating a small 10M disk on a block storage named iSCSI. 2. The engine sends this to the SPM. Not it only sends imageSizeInBytes (10M), but imageInitialSizeInBytes is 0, so up to VDSM to decide what its going to allocated as LV size on the block storage. 2022-06-08 08:07:55,150+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (default task-9) [0763eb29-71b3-409c-a934-3c6b44369595] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{ storagePoolId='f35823fc-e31c-11ec-9be5-525400990001', ignoreFailoverLimit='false', storageDomainId='3a88a033-3ba2-4963-a63e-dca4c3e591f6', imageGroupId='3e2b2a5b-be13-40fb-b95d-8b411fd5899e', imageSizeInBytes='10485760', volumeFormat='COW', newImageId='107f9dfc-a20c-42e3-a04f-31cbffed3aa9', imageType='Sparse', newImageDescription='{"DiskAlias":"mydisk","DiskDescription":"My disk"}', imageInitialSizeInBytes='0', imageId='00000000-0000-0000-0000-000000000000', sourceImageGroupId='00000000-0000-0000-0000-000000000000', shouldAddBitmaps='false', legal='true', sequenceNumber='1', bitmap='null'} ), log id: 2401414f 3. VDSM ends up here to define the LV volumes 319 def calculate_volume_alloc_size( ... 342 if preallocate == sc.SPARSE_VOL: 343 # Sparse qcow2 344 if initial_size: 345 # TODO: if initial_size == max_size, we exceed the max_size 346 # here. This should be fixed, but first we must check that 347 # engine is not assuming that vdsm will increase initial size 348 # like this. 349 alloc_size = int(initial_size * QCOW_OVERHEAD_FACTOR) 350 else: 351 chunk_size_mb = config.getint("irs", 352 "volume_utilization_chunk_mb") <------- gets this, as initial_size (imageInitialSizeInBytes from engine) is 0/None 353 alloc_size = chunk_size_mb * MiB And since this patch, this is now 2.5G instead of 1G. commit a90335c8f6c5719c3d095dd3f6feb69be4c1c184 Date: Tue Feb 22 07:59:36 2022 +0200 config: Increase thin provisioning threshold 4. So we end up with a 2.5G LV holding a 10M qcow2: LV VG Attr LSize 107f9dfc-a20c-42e3-a04f-31cbffed3aa9 3a88a033-3ba2-4963-a63e-dca4c3e591f6 -wi------- 2.50g # qemu-img info /dev/3a88a033-3ba2-4963-a63e-dca4c3e591f6/107f9dfc-a20c-42e3-a04f-31cbffed3aa9 | grep virtual virtual size: 10 MiB (10485760 bytes) CAP=10485760 CTIME=1654639675 DESCRIPTION={"DiskAlias":"mydisk","DiskDescription":"My disk"} DISKTYPE=DATA DOMAIN=3a88a033-3ba2-4963-a63e-dca4c3e591f6 FORMAT=COW GEN=0 IMAGE=3e2b2a5b-be13-40fb-b95d-8b411fd5899e LEGALITY=LEGAL PUUID=00000000-0000-0000-0000-000000000000 SEQ=1 TYPE=SPARSE VOLTYPE=LEAF EOF In the DB: image_guid | size | actual_size --------------------------------------+----------+------------- 107f9dfc-a20c-42e3-a04f-31cbffed3aa9 | 10485760 | 2684354560 And the discrepancy tool picks it up as its designed to pick up a snapshot bug that makes the base grow. Checking storage domain 'iSCSI' (3a88a033-3ba2-4963-a63e-dca4c3e591f6) from data-center 'Default' Image 107f9dfc-a20c-42e3-a04f-31cbffed3aa9 size on the storage(2684354560) is larger than the capacity(10485760) I can make it stop picking this up easily, but the whole thing looks bad. It seems we could have a 128M LV here, still a waste but not as much. And then I can make the discrepancy tool ignore 128M only, not up to 2.5G. Version-Release number of selected component (if applicable): ovirt-engine-4.5.0.7-0.9.el8ev.noarch vdsm-4.50.0.13-1.el8ev.x86_64 How reproducible: Always Steps to Reproduce: 1. Use the below to create a 10M disk, just change 2**30 to 2**20 https://raw.githubusercontent.com/oVirt/python-ovirt-engine-sdk4/main/examples/add_floating_disk.py Actual results: * Bad use of resources, can fill up the VG quickly if there are many Expected results: * Can it use the minimum LV extent size (128M) at least? It's still a waste for small volumes, but not as bad as 2.5G
Engine does not need to send initial size in this case. It should send the disk size. Vdsm should allocate 128m lv for this volume, but I'm not sure we tested allocation of such tiny disks. Avihai tells me that we the smallest disk tests is 1g. I think this should be easy to fix in vdsm so you can move the bug to vdsm.
Right, maybe logic in that calculate_volume_alloc_size() is enough. Moving to VDSM, thank you.
We don't really understand why to use such small size disks but in case they are needed it would be better to use file storage or, if all disks are that small, to change the chunk size
Changing the chunk size to tiny size (e.g. 128 MiB) is very bad and will cause VMs to pause quickly even with recent improvements in the mailbox. This should be a trivial fix and a good first issue.
Nir, would you mind explaning the solution that will be done here? Are you going to allocate 128M for anything thats below 128MB? Asking so I can move on with BZ2094577 to stop flagging about these small volumes. Thanks!
(In reply to Nir Soffer from comment #5) > Changing the chunk size to tiny size (e.g. 128 MiB) is very bad and > will cause VMs to pause quickly even with recent improvements in the > mailbox. Right but note the "if all disks are that small" part that came before that - IIRC the installer of OCP on RHV was changed to create preallocated disks for the VMs so if all the thin provisioned disks are those small PVs.. Anyway, we had a similar issue with the metadata volumes (for hibernated vms or snapshots with memory) and I don't remember all the details of what was discussed back then but eventually we ended up setting them with raw+preallocated, wouldn't it make sense to do the same for those disks?
(In reply to Arik from comment #7) > (In reply to Nir Soffer from comment #5) > > Changing the chunk size to tiny size (e.g. 128 MiB) is very bad and > > will cause VMs to pause quickly even with recent improvements in the > > mailbox. > > Right but note the "if all disks are that small" part that came before that > - IIRC the installer of OCP on RHV was changed to create preallocated disks > for the VMs so if all the thin provisioned disks are those small PVs.. > > Anyway, we had a similar issue with the metadata volumes (for hibernated vms > or snapshots with memory) and I don't remember all the details of what was > discussed back then but eventually we ended up setting them with > raw+preallocated, wouldn't it make sense to do the same for those disks? For metadata disks we wrongly used qcow2 format and then corrupted the disk by writing over the qcow2 header. In this case the user asked to create qcow2 disk with small virtual size (10m). disk = disks_service.add( types.Disk( name='mydisk', description='My disk', format=types.DiskFormat.COW, provisioned_size=10 * 2**20, storage_domains=[ types.StorageDomain( name='mydata' ) ] ) ) Vdsm ignored the requested virtual size and allocated one chunk (2.5g) instead of one extent (128m). This is a trivial error in the code computing the logical volume size for a new disk.
(In reply to Germano Veit Michel from comment #6) > Nir, would you mind explaning the solution that will be done here? Are you > going to allocate 128M for anything thats below 128MB? Yes, with block storage we cannot allocate less than 128 MiB. If you create 10 MiB disk you will get a 128 MiB logical volume with 10 MiB qcow2 image.
(In reply to Nir Soffer from comment #8) > In this case the user asked to create qcow2 disk with small virtual size > (10m). > > disk = disks_service.add( > types.Disk( > name='mydisk', > description='My disk', > format=types.DiskFormat.COW, > provisioned_size=10 * 2**20, > storage_domains=[ > types.StorageDomain( > name='mydata' > ) > ] > ) > ) > > Vdsm ignored the requested virtual size and allocated one chunk > (2.5g) instead of one extent (128m). This is a trivial error in > the code computing the logical volume size for a new disk. Yeah, but in this case it's not really the user that asks that, it's the OCP client that determines the format I suppose Any reason to have them thin-provisioned? from my point of view, converting such small volumes to RAW+preallocated is also ok if it's just for OCP on RHV and they cannot change it at the moment..
(In reply to Arik from comment #10) > (In reply to Nir Soffer from comment #8) > > In this case the user asked to create qcow2 disk with small virtual size > > (10m). > > > > disk = disks_service.add( > > types.Disk( > > name='mydisk', > > description='My disk', > > format=types.DiskFormat.COW, > > provisioned_size=10 * 2**20, > > storage_domains=[ > > types.StorageDomain( > > name='mydata' > > ) > > ] > > ) > > ) > > > > Vdsm ignored the requested virtual size and allocated one chunk > > (2.5g) instead of one extent (128m). This is a trivial error in > > the code computing the logical volume size for a new disk. > > Yeah, but in this case it's not really the user that asks that, it's the OCP > client that determines the format I suppose > Any reason to have them thin-provisioned? from my point of view, converting > such small volumes to RAW+preallocated is also ok if it's just for OCP on > RHV and they cannot change it at the moment.. I don't have any idea why they create small qcow2 disks, but it is clear that vdsm behavior is wrong. Germanno, do you know why these tiny disks are created? I don't see real use case for 10 MiB disk except testing.
(In reply to Nir Soffer from comment #11) > Germanno, do you know why these tiny disks are created? I don't see real use > case for 10 MiB disk except testing. This would come from "apps" running on OCP. I'm not aware of a PVC required by the default cluster components that would need such a small storage. It depends on what is the app running and what are its needs, its not in our hands as one can decide to run anything in OCP. With microservices and friends, it may be expected? Customer has just one of these on his entire OCP cluster, everything else is 1G+ or 2.5G+.
(In reply to Germano Veit Michel from comment #12) > Customer has just one of these on his entire OCP cluster, everything else is > 1G+ or 2.5G+. OK, that makes more sense So I'm not sure about the cost-effectiveness of dealing with that, I see the point of fixing incorrect behavior of VDSM but this change would also require a change on the engine side and that scenario should really not be that common. Nir mentioned possible corruption with metadata volumes in comment 8, which I may have not been involved in its discussions, but even without the corruption we chose not to mess with such small volumes when we got reports of allocating 1G for them on block storage. I don't think this bug really justifies to reconsider that at this point..
Arik, I'm not sure if I'm missing something, are you saying this should be WONTFIX? What on engine side needs to be fixed here? Isn't it just stop using volume_utilization_chunk_mb as initial size in calculate_volume_alloc_size? But indeed, this is not a severe bug, and I can make the discrepancy tool ignore wasted storage space under 2.5G to not be triggered by these. It's just not very correct...
(In reply to Nir Soffer from comment #2) > Engine does not need to send initial size in this case. It should send the > disk size. OK, I misinterpreted this part if "disk size" refers to the virtual size of the disk, I believe that is sent already and then we don't need to change the engine side but even then, what's the point of defining thin-provisioned (sparse) disks in this size? we don't want the mailbox mechanism to come into play and monitor those volumes, right? I'm not suggest to close it as WONTFIX, I suggest to apply the same approach that was applied for metadata volumes (regardless of the reason - whether it was to prevent corrupted as Nir wrote or not to allocate the previous chunk size of 1G) which is to create them as preallocated volumes
(In reply to Arik from comment #15) > but even then, what's the point of defining thin-provisioned (sparse) disks > in this size? we don't want the mailbox mechanism to come into play and > monitor those volumes, right? True, a tiny sparse disk saves nothing. But it could be extended later (who knows), a client can do whatever it wants - and RHV should do whatever its being told via API. We need to test what the latest version of the oVirt CSI driver is doing, there might be another bug there. > I'm not suggest to close it as WONTFIX, I > suggest to apply the same approach that was applied for metadata volumes > (regardless of the reason - whether it was to prevent corrupted as Nir wrote > or not to allocate the previous chunk size of 1G) which is to create them as > preallocated volumes You mean to automatically switch thin to prealloc if the size is smaller than X, regardless of what the client requests? What if its extended later? These are not system disks and the user/client can in theory do anything with it. And blocking it may break existing things...
Right, unlike the metadata volumes these volumes can be extended but that's ok, preallocated volumes can be extended. If you mean that they can be extended up to some size for which it would make sense to use thin provisioning - I'm not sure
what's the plan? it doesn't seem to be worth addressing if it's a single PV in the whole setup, i.e. not frequent.
(In reply to Michal Skrivanek from comment #18) > it doesn't seem to be worth addressing if it's a single PV in the whole > setup, i.e. not frequent. yeah, and Germano changed rhv-image-discrepancies (bz 2094577) so that would be less visible, but it really sounds like fairly easy fix and if with that we can improve the experience of ocp on rhv also in cases of having many workloads that ask for such small volumes then I think it is worth giving a try to what Nir suggested above
Due to QE capacity, we are not going to cover this issue in our automation
This bug has low overall severity and is not going to be further verified by QE. If you believe special care is required, feel free to properly align relevant severity, flags and keywords to raise PM_Score or use one of the Bumps ('PrioBumpField', 'PrioBumpGSS', 'PrioBumpPM', 'PrioBumpQA') in Keywords to raise it's PM_Score above verification threashold (1000).
This bug has low overall severity and passed an automated regression suite, and is not going to be further verified by QE. If you believe special care is required, feel free to re-open to ON_QA status.