2034531 – Cloning a VM from a QCOW based template on a block-based storage domain, results with a VM that has a disk with the same actual and virtual size

Bug 2034531 - Cloning a VM from a QCOW based template on a block-based storage domain, results with a VM that has a disk with the same actual and virtual size

Summary: Cloning a VM from a QCOW based template on a block-based storage domain, resu...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.5.0
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Ilia Markelov
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1459455 1932794 2025585 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-21 08:51 UTC by Amit Sharir
Modified:	2022-04-28 09:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ovirt-engine-4.5.0
Clone Of:
Environment:
Last Closed:	2022-04-28 09:26:34 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5? asharir: testing_plan_complete+ asharir: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44387	0	None	None	None	2021-12-21 08:53:33 UTC
oVirt gerrit	118153	0	master	MERGED	core: fix template measurement	2022-01-25 18:18:56 UTC

Description Amit Sharir 2021-12-21 08:51:21 UTC

Description of problem:

When cloning a VM from a QCOW based template on a block-based storage domain the system creates a VM with a disk that has the same actual and virtual size. While the VM is initializing the VM's disk looks like it has the same actual and virtual size as the template that was used. Only after the system finishes adding the VM, the size of the VM's disk gets updated to actual size == virtual size.

This behavior results in a big waste of storage space for the user.

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.10-0.17.el8ev.noarch


Steps to Reproduce:

1. Create a template of a QCOW2 disk on block-based storage (ISCSI).
2. Validate that the virtual size of the template disk is bigger than the actual size. 
3. Create a VM via UI using the template (when creating the VM enter the resource allocation window and choose: format -> QCOW2, target -> ISCSI, Disk profile -> ISCSI)
4. After you approve the creation of the VM you can see via the UI that the VM's disk virtual and actual size are identical to the template (virtual size > actual size). Wait until the VM completes initialization and then you will see that the actual and virtual sizes are equal.

Actual results:

The actual and virtual sizes of the VM's disk that was created from the template on block-based storage domain using QCOW2 format are the same.

Expected results:

On this type of storage domain with this type of disk format, the virtual size of the disk should be bigger than the actual size (similar to the template actual and virtual sizes)

Additional info:

Info regarding the template disk I used:

 <name>latest-rhel-guest-image-7.9-infra</name>
    <description>latest-rhel-guest-image-7.9-infra (b200508)</description>
    <link href="/ovirt-engine/api/disks/9321e99e-ddd2-4be2-a41c-b35d5c7b5ff5/disksnapshots" rel="disksnapshots"/>
    <link href="/ovirt-engine/api/disks/9321e99e-ddd2-4be2-a41c-b35d5c7b5ff5/permissions" rel="permissions"/>
    <link href="/ovirt-engine/api/disks/9321e99e-ddd2-4be2-a41c-b35d5c7b5ff5/statistics" rel="statistics"/>
    <actual_size>2482417664</actual_size>
    <alias>latest-rhel-guest-image-7.9-infra</alias>
    <backup>none</backup>
    <content_type>data</content_type>
    <format>cow</format>
    <image_id>7bd24e3e-7959-4bb5-b45a-02d361521104</image_id>
    <propagate_errors>false</propagate_errors>
    <provisioned_size>10737418240</provisioned_size>
    <qcow_version>qcow2_v3</qcow_version>
    <shareable>false</shareable>
    <sparse>true</sparse>
    <status>ok</status>
    <storage_type>image</storage_type>
    <total_size>2482417664</total_size>
    <wipe_after_delete>false</wipe_after_delete>
    <disk_profile href="/ovirt-engine/api/diskprofiles/969fb64c-e4ab-4ae2-a8e2-772438eb519e" id="969fb64c-e4ab-4ae2-a8e2-772438eb519e"/>
    <quota href="/ovirt-engine/api/datacenters/3a5f14bd-cf8c-42df-8e84-12a3ba0920dc/quotas/ff9114d3-ca94-428d-b15b-ac000759c0e0" id="ff9114d3-ca94-428d-b15b-ac000759c0e0"/>
    <storage_domains>
        <storage_domain href="/ovirt-engine/api/storagedomains/4bdc8de8-73fc-4787-982d-e1cc181d3afc" id="4bdc8de8-73fc-4787-982d-e1cc181d3afc"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/a52e665a-1472-4e21-a1bb-8e17817305bc" id="a52e665a-1472-4e21-a1bb-8e17817305bc"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/4276d749-c178-4428-bee1-1e5fe4b642bc" id="4276d749-c178-4428-bee1-1e5fe4b642bc"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/94fb6c00-a543-4003-9099-c016634b7e28" id="94fb6c00-a543-4003-9099-c016634b7e28"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/e6ad2044-eb89-4d33-b326-dbeadc120b52" id="e6ad2044-eb89-4d33-b326-dbeadc120b52"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/80fb91a1-ee40-48fb-9703-37d485a38783" id="80fb91a1-ee40-48fb-9703-37d485a38783"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/af38f95c-14d5-46bc-8de2-84c4f35f7825" id="af38f95c-14d5-46bc-8de2-84c4f35f7825"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/3dac53ad-0502-411e-803b-028d5cd02a86" id="3dac53ad-0502-411e-803b-028d5cd02a86"/>
        <storage_domain href="/ovirt-engine/api/storagedomains/fb11a4fb-cff4-441f-8597-6c25d285204b" id="fb11a4fb-cff4-441f-8597-6c25d285204b"/>
    </storage_domains>
</disk>


Info regarding the disk of the VM that I created using the template:

<name>latest-rhel-guest-image-7.9-infra</name>
    <description>latest-rhel-guest-image-7.9-infra (b200508)</description>
    <link href="/ovirt-engine/api/disks/e6745f45-6d14-46a7-b0ba-e628160949a1/disksnapshots" rel="disksnapshots"/>
    <link href="/ovirt-engine/api/disks/e6745f45-6d14-46a7-b0ba-e628160949a1/permissions" rel="permissions"/>
    <link href="/ovirt-engine/api/disks/e6745f45-6d14-46a7-b0ba-e628160949a1/statistics" rel="statistics"/>
    <actual_size>10871635968</actual_size>
    <alias>latest-rhel-guest-image-7.9-infra</alias>
    <backup>none</backup>
    <content_type>data</content_type>
    <format>cow</format>
    <image_id>5d0fbc89-6494-4c6d-9e99-21b23d64b848</image_id>
    <propagate_errors>false</propagate_errors>
    <provisioned_size>10737418240</provisioned_size>
    <qcow_version>qcow2_v3</qcow_version>
    <shareable>false</shareable>
    <sparse>true</sparse>
    <status>ok</status>
    <storage_type>image</storage_type>
    <total_size>10871635968</total_size>
    <wipe_after_delete>false</wipe_after_delete>
    <disk_profile href="/ovirt-engine/api/diskprofiles/79369169-2c78-482e-b529-9ec6f588a64b" id="79369169-2c78-482e-b529-9ec6f588a64b"/>
    <quota href="/ovirt-engine/api/datacenters/3a5f14bd-cf8c-42df-8e84-12a3ba0920dc/quotas/ff9114d3-ca94-428d-b15b-ac000759c0e0" id="ff9114d3-ca94-428d-b15b-ac000759c0e0"/>
    <storage_domains>
        <storage_domain href="/ovirt-engine/api/storagedomains/94fb6c00-a543-4003-9099-c016634b7e28" id="94fb6c00-a543-4003-9099-c016634b7e28"/>
    </storage_domains>
</disk>

Comment 1 Amit Sharir 2021-12-21 08:54:12 UTC

*** Bug 1932794 has been marked as a duplicate of this bug. ***

Comment 2 Amit Sharir 2021-12-21 09:57:59 UTC

*** Bug 2025585 has been marked as a duplicate of this bug. ***

Comment 3 Amit Sharir 2021-12-21 10:00:29 UTC

*** Bug 1459455 has been marked as a duplicate of this bug. ***

Comment 4 Nir Soffer 2021-12-22 09:12:51 UTC

(In reply to Amit Sharir from comment #0)

Are you sure you clone from qcow2 template? I could not reproduce this
on my setup. When I clone qcow2 template I get the expected disk similar
to the template disk.

There are no logs attached to this bug, so we really cannot do anything
with it.

Please add engine and vdsm log showing how you created the template and
how you cloned the vm.

Also please provide output of "qemu-img measure" for both the template disk
and the cloned disk:

    qemu-img measure -O qcow2 /dev/{storage-domain-id}/{lv-name}

Comment 5 Nir Soffer 2021-12-22 09:27:04 UTC

Amit, see https://bugzilla.redhat.com/2025585#c8 - it shows how cloning cow
template creates small logical volume as expected.

Comment 6 Amit Sharir 2021-12-22 09:43:15 UTC

(In reply to Nir Soffer from comment #4)
> (In reply to Amit Sharir from comment #0)
> 
> Are you sure you clone from qcow2 template? I could not reproduce this
> on my setup. When I clone qcow2 template I get the expected disk similar
> to the template disk.
> 
> There are no logs attached to this bug, so we really cannot do anything
> with it.
> 
> Please add engine and vdsm log showing how you created the template and
> how you cloned the vm.
> 
> Also please provide output of "qemu-img measure" for both the template disk
> and the cloned disk:
> 
>     qemu-img measure -O qcow2 /dev/{storage-domain-id}/{lv-name}



Yes, I am sure. You can also see the info regarding the template disk I used in the description of the problem. 
In addition, this is probably not a new issue since I saw old bugs that refer to the same problem (bugs: 1932794, 1459455 - that Ilan and Eyal opened).

I will add the credentials of my environment in a private comment so you could see and reproduce the issue.
I will also add the relevant logs and info you requested.

Comment 12 Nir Soffer 2021-12-22 16:37:45 UTC

The attached log do not include the entire flow.

Searching for new disk API calls:

$ grep dca6948d-f3f3-40e6-b931-9ef858d7c28d *.log | grep START

vdsm1.log:2021-12-22 14:59:07,981+0200 INFO  (jsonrpc/6) [vdsm.api] START getVolumeInfo(sdUUID='e6ad2044-eb89-4d33-b326-dbeadc120b52', spUUID='3a5f14bd-cf8c-42df-8e84-12a3ba0920dc', imgUUID='293fd177-edd1-44e7-96fc-00dd8f5177a5', volUUID='dca6948d-f3f3-40e6-b931-9ef858d7c28d') from=::ffff:10.46.12.152,60714, flow_id=75093178-091a-422f-9846-8d0138a2c458, task_id=602ac622-9dcb-49a3-b137-de432e89ea07 (api:48)

vdsm1.log:2021-12-22 14:59:08,303+0200 INFO  (jsonrpc/0) [vdsm.api] START sdm_copy_data(job_id='e6626162-6f9e-4e09-9283-c3a53202c573', source={'endpoint_type': 'div', 'prepared': False, 'sd_id': 'e6ad2044-eb89-4d33-b326-dbeadc120b52', 'img_id': '9321e99e-ddd2-4be2-a41c-b35d5c7b5ff5', 'vol_id': '7bd24e3e-7959-4bb5-b45a-02d361521104'}, destination={'generation': 0, 'endpoint_type': 'div', 'prepared': False, 'sd_id': 'e6ad2044-eb89-4d33-b326-dbeadc120b52', 'img_id': '293fd177-edd1-44e7-96fc-00dd8f5177a5', 'vol_id': 'dca6948d-f3f3-40e6-b931-9ef858d7c28d'}, copy_bitmaps=False) from=::ffff:10.46.12.152,60714, flow_id=75093178-091a-422f-9846-8d0138a2c458, task_id=61c29d55-57a4-403b-a6ea-faa6235175ca (api:48)

Searching for template API calls:

$ grep 7bd24e3e-7959-4bb5-b45a-02d361521104 *.log | grep START
vdsm1.log:2021-12-22 14:59:08,303+0200 INFO  (jsonrpc/0) [vdsm.api] START sdm_copy_data(job_id='e6626162-6f9e-4e09-9283-c3a53202c573', source={'endpoint_type': 'div', 'prepared': False, 'sd_id': 'e6ad2044-eb89-4d33-b326-dbeadc120b52', 'img_id': '9321e99e-ddd2-4be2-a41c-b35d5c7b5ff5', 'vol_id': '7bd24e3e-7959-4bb5-b45a-02d361521104'}, destination={'generation': 0, 'endpoint_type': 'div', 'prepared': False, 'sd_id': 'e6ad2044-eb89-4d33-b326-dbeadc120b52', 'img_id': '293fd177-edd1-44e7-96fc-00dd8f5177a5', 'vol_id': 'dca6948d-f3f3-40e6-b931-9ef858d7c28d'}, copy_bitmaps=False) from=::ffff:10.46.12.152,60714, flow_id=75093178-091a-422f-9846-8d0138a2c458, task_id=61c29d55-57a4-403b-a6ea-faa6235175ca (api:48)

Interesting, we see one measure log:

$ grep measure *.log
vdsm2.log:2021-12-22 12:05:40,318+0200 INFO  (jsonrpc/7) [vdsm.api] START measure(sdUUID='f956805b-af47-471e-a17e-6d6435dfecb4', imgUUID='551dadfa-8834-45f6-98fd-fb6818399efe', volUUID='f8e164a5-9eb5-4af4-a407-ebdc11c8d5fd', dest_format=4, backing=True) from=::ffff:10.46.12.140,42934, flow_id=bc8ebace-2934-46fd-9671-a74de32787be, task_id=75231874-37c0-404e-9b87-ae884398d620 (api:48)
vdsm2.log:2021-12-22 12:05:40,456+0200 INFO  (jsonrpc/7) [vdsm.api] FINISH measure return={'result': {'bitmaps': 0, 'required': 3393847296, 'fully-allocated': 10739318784}} from=::ffff:10.46.12.140,42934, flow_id=bc8ebace-2934-46fd-9671-a74de32787be, task_id=75231874-37c0-404e-9b87-ae884398d620 (api:54)

But this is for another volume on another storage domain, and it was done 3
hours before the copy_data job.

The 3 logs starts at:

$ head -1 *.log
==> vdsm1.log <==
2021-12-22 13:01:03,292+0200 DEBUG (check/loop) [storage.check] START check '/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__storage5__nfs__1/4bdc8de8-73fc-4787-982d-e1cc181d3afc/dom_md/metadata' (delay=0.01) (check:289)

==> vdsm2.log <==
2021-12-22 12:01:02,773+0200 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,35614 (api:48)

==> vdsm3.log <==
2021-12-22 14:01:01,620+0200 DEBUG (check/loop) [storage.check] START check '/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__storage2__nfs__1/22bba973-e222-4e33-aca8-d78cd53cb260/dom_md/metadata' (delay=0.01) (check:289)

And ends at:

$ tail -n1 *.log
==> vdsm1.log <==
2021-12-22 15:02:40,596+0200 DEBUG (periodic/0) [storage.TaskManager.Task] (Task='d762362f-dd3a-4191-946d-9abe3170639f') ref 0 aborting False (task:1000)

==> vdsm2.log <==
2021-12-22 15:27:22,107+0200 WARN  (dhcp-monitor) [root] Nic ovirtmgmt is not configured for IPv6 monitoring. (dhcp_monitor:174)

==> vdsm3.log <==
2021-12-22 15:31:40,543+0200 INFO  (ioprocess/1942204) [IOProcessClient] (glusterSD/gluster01.lab.eng.tlv2.redhat.com:_GE__storage2__volume02) ioprocess was terminated by signal 9 (__init__:200)

Based on this, engine did not measure the template before the copy, which
explains why we create a new volume with the wrong size.

But I don't see the createVolume log, which must happen few seconds before copy_data
and this does not make any sense.

Amit, are these complete unmodified logs from the 3 hosts in your env?

Comment 13 Nir Soffer 2021-12-22 19:55:47 UTC

After debugging on Amit enviroment, the issue is this:

If we create a new template and it exists only on one block storage domain,
clone from the template creates new disk in the correct size:

# lvs -o vg_name,lv_name,size,tags | grep bd99b34b-3bfc-4053-86ea-03dbdd854902
  678d1452-27f0-4497-9a75-aa55837061be 7aa3d821-2b32-4451-8f16-9c6ae01850c5   2.62g IU_bd99b34b-3bfc-4053-86ea-03dbdd854902,MD_9,PU_00000000-0000-0000-0000-000000000000

In vdsm log, we see:

# grep measure vdsm-new-temlate.log 
2021-12-22 19:50:09,752+0200 INFO  (jsonrpc/3) [vdsm.api] START measure(sdUUID='678d1452-27f0-4497-9a75-aa55837061be', imgUUID='5b752f2d-e246-4020-855b-7cd69d35d717', volUUID='2d4868b1-2566-4b07-a462-519baa5a4723', dest_format=4, backing=True) from=::ffff:10.46.12.152,50592, flow_id=bfb58300-f349-4144-92c6-5a195e6317fa, task_id=532fa56c-ee8d-4ed7-8364-1f81c8c919ed (api:48)
2021-12-22 19:50:09,939+0200 INFO  (jsonrpc/3) [vdsm.api] FINISH measure return={'result': {'bitmaps': 0, 'required': 2473590784, 'fully-allocated': 10739318784}} from=::ffff:10.46.12.152,50592, flow_id=bfb58300-f349-4144-92c6-5a195e6317fa, task_id=532fa56c-ee8d-4ed7-8364-1f81c8c919ed (api:54)

And engine asks to create the destination volume with the expected initial_size:

# grep 'START createVolume' vdsm-new-temlate.log 
2021-12-22 19:50:09,976+0200 INFO  (jsonrpc/1) [vdsm.api] START createVolume(sdUUID='678d1452-27f0-4497-9a75-aa55837061be', spUUID='2ece5cff-fa68-4310-b779-0fbf831e09e3', imgUUID='bd99b34b-3bfc-4053-86ea-03dbdd854902', size='10737418240', volFormat=4, preallocate=2, diskType='DATA', volUUID='7aa3d821-2b32-4451-8f16-9c6ae01850c5', desc='{"DiskAlias":"","DiskDescription":""}', srcImgUUID='00000000-0000-0000-0000-000000000000', srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize='2473590784', addBitmaps=False) from=::ffff:10.46.12.152,51370, flow_id=bfb58300-f349-4144-92c6-5a195e6317fa, task_id=6113c0a9-4fef-4df3-8f4f-4b2d9e3bc28b (api:48)

But if we use latest-rhel-guest-image-7.9-infra, which have disks on
9 storage domains (iscsi_0, iscsi_1, iscsi_2, nfs_0, nfs_1, nfs_2,
test_glsuter_0, test_gluster_1, test_gluster_2), we don't run measure
during the copy flow:

# grep measure vdsm-old-temlate.log
(nothing)

And engine ask to create the destination with intial size of 9.0G:

# grep 'START createVolume' vdsm-old-temlate.log 
2021-12-22 20:17:24,712+0200 INFO  (jsonrpc/5) [vdsm.api] START createVolume(sdUUID='678d1452-27f0-4497-9a75-aa55837061be', spUUID='2ece5cff-fa68-4310-b779-0fbf831e09e3', imgUUID='df14019b-69fd-4e5c-ad45-0a989a731448', size='10737418240', volFormat=4, preallocate=2, diskType='DATA', volUUID='21c23eff-1ca0-4c87-afba-52ee36be9d1f', desc='{"DiskAlias":"","DiskDescription":""}', srcImgUUID='00000000-0000-0000-0000-000000000000', srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize='9761289310', addBitmaps=False) from=::ffff:10.46.12.152,51370, flow_id=60490c7d-2aa5-4bd9-8206-708f976d50cf, task_id=8537e87f-6ef8-47cd-a826-d88b61eb856a (api:48)

Vdsm will create a volume of 9761289310 * 1.1, which gives a volume size
of 10.12g.

Why engine does not call measure?

Looking in CopyImageGroupWithDataCommand.java:

    private Long determineTotalImageInitialSize(DiskImage sourceImage,
            VolumeFormat destFormat,
            Guid srcDomain) {
        // Check if we have a host in the DC capable of running the measure volume verb,
        // otherwise fallback to the legacy method
        Guid hostId = imagesHandler.getHostForMeasurement(sourceImage.getStoragePoolId(),
                sourceImage.getId());
        // We are collapsing the chain, so we want to measure the leaf to get the size
        // of the entire chain

When copying a template image, there are no chains - template has single
image.

This search return all the images with the template disk id from multiple
storage domains. Not all the snapshots of a single disk.

        List<DiskImage> images = diskImageDao.getAllSnapshotsForImageGroup(sourceImage.getId());

This sort is probably meaning less, since unrelated images from different storage
domain have no order.

        imagesHandler.sortImageList(images);

This assumes that the last image is the leaf - but since we have images
from multiple storage domains, we just get a random image from the list,
from random storage domain.

        DiskImage leaf = images.get(images.size() - 1);

Since we have random image from random domain, and we have 3 block domains
and 6 file domains, we have good chance to get an image on file domain.

        if (hostId == null || (leaf.getActive() && !leaf.getStorageTypes().get(0).isBlockDomain())) {

So we get into this legacy code, that use some crappy heuristics 
to determine the size.

            return imagesHandler.determineTotalImageInitialSize(getDiskImage(),
                    getParameters().getDestinationFormat(),
                    getParameters().getSrcDomain(),
                    getParameters().getDestDomain());
        } else {

Instead of this code, using qemu-img measure to get a good estimate:

            MeasureVolumeParameters parameters = new MeasureVolumeParameters(leaf.getStoragePoolId(),
                    srcDomain,
                    leaf.getId(),
                    leaf.getImageId(),
                    destFormat.getValue());
            parameters.setParentCommand(getActionType());
            parameters.setEndProcedure(EndProcedure.PARENT_MANAGED);
            parameters.setVdsRunningOn(hostId);
            parameters.setCorrelationId(getCorrelationId());
            ActionReturnValue actionReturnValue =
                    runInternalAction(ActionType.MeasureVolume, parameters,
                            ExecutionHandler.createDefaultContextForTasks(getContext()));

            if (!actionReturnValue.getSucceeded()) {
                throw new RuntimeException("Could not measure volume");
            }

            return actionReturnValue.getActionReturnValue();
        }
    }

I think the fix is:

- choose which storage domain we want to copy from
- check if the source image is a template
- for template, measure the single image
- otherwise use get the snapshots for the image group id for the
  selected storage domain and sort them to get the leaf

Benny, what do you think?

Comment 16 Amit Sharir 2021-12-23 08:08:55 UTC

Removing the need info since Nir finished the investigation in #c13

Comment 17 Benny Zlotnik 2021-12-28 16:58:12 UTC

(In reply to Nir Soffer from comment #13)
> After debugging on Amit enviroment, the issue is this:
> 
> If we create a new template and it exists only on one block storage domain,
> clone from the template creates new disk in the correct size:
> 
> # lvs -o vg_name,lv_name,size,tags | grep
> bd99b34b-3bfc-4053-86ea-03dbdd854902
>   678d1452-27f0-4497-9a75-aa55837061be 7aa3d821-2b32-4451-8f16-9c6ae01850c5 
> 2.62g
> IU_bd99b34b-3bfc-4053-86ea-03dbdd854902,MD_9,PU_00000000-0000-0000-0000-
> 000000000000
> 
> In vdsm log, we see:
> 
> # grep measure vdsm-new-temlate.log 
> 2021-12-22 19:50:09,752+0200 INFO  (jsonrpc/3) [vdsm.api] START
> measure(sdUUID='678d1452-27f0-4497-9a75-aa55837061be',
> imgUUID='5b752f2d-e246-4020-855b-7cd69d35d717',
> volUUID='2d4868b1-2566-4b07-a462-519baa5a4723', dest_format=4, backing=True)
> from=::ffff:10.46.12.152,50592,
> flow_id=bfb58300-f349-4144-92c6-5a195e6317fa,
> task_id=532fa56c-ee8d-4ed7-8364-1f81c8c919ed (api:48)
> 2021-12-22 19:50:09,939+0200 INFO  (jsonrpc/3) [vdsm.api] FINISH measure
> return={'result': {'bitmaps': 0, 'required': 2473590784, 'fully-allocated':
> 10739318784}} from=::ffff:10.46.12.152,50592,
> flow_id=bfb58300-f349-4144-92c6-5a195e6317fa,
> task_id=532fa56c-ee8d-4ed7-8364-1f81c8c919ed (api:54)
> 
> And engine asks to create the destination volume with the expected
> initial_size:
> 
> # grep 'START createVolume' vdsm-new-temlate.log 
> 2021-12-22 19:50:09,976+0200 INFO  (jsonrpc/1) [vdsm.api] START
> createVolume(sdUUID='678d1452-27f0-4497-9a75-aa55837061be',
> spUUID='2ece5cff-fa68-4310-b779-0fbf831e09e3',
> imgUUID='bd99b34b-3bfc-4053-86ea-03dbdd854902', size='10737418240',
> volFormat=4, preallocate=2, diskType='DATA',
> volUUID='7aa3d821-2b32-4451-8f16-9c6ae01850c5',
> desc='{"DiskAlias":"","DiskDescription":""}',
> srcImgUUID='00000000-0000-0000-0000-000000000000',
> srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize='2473590784',
> addBitmaps=False) from=::ffff:10.46.12.152,51370,
> flow_id=bfb58300-f349-4144-92c6-5a195e6317fa,
> task_id=6113c0a9-4fef-4df3-8f4f-4b2d9e3bc28b (api:48)
> 
> But if we use latest-rhel-guest-image-7.9-infra, which have disks on
> 9 storage domains (iscsi_0, iscsi_1, iscsi_2, nfs_0, nfs_1, nfs_2,
> test_glsuter_0, test_gluster_1, test_gluster_2), we don't run measure
> during the copy flow:
> 
> # grep measure vdsm-old-temlate.log
> (nothing)
> 
> And engine ask to create the destination with intial size of 9.0G:
> 
> # grep 'START createVolume' vdsm-old-temlate.log 
> 2021-12-22 20:17:24,712+0200 INFO  (jsonrpc/5) [vdsm.api] START
> createVolume(sdUUID='678d1452-27f0-4497-9a75-aa55837061be',
> spUUID='2ece5cff-fa68-4310-b779-0fbf831e09e3',
> imgUUID='df14019b-69fd-4e5c-ad45-0a989a731448', size='10737418240',
> volFormat=4, preallocate=2, diskType='DATA',
> volUUID='21c23eff-1ca0-4c87-afba-52ee36be9d1f',
> desc='{"DiskAlias":"","DiskDescription":""}',
> srcImgUUID='00000000-0000-0000-0000-000000000000',
> srcVolUUID='00000000-0000-0000-0000-000000000000', initialSize='9761289310',
> addBitmaps=False) from=::ffff:10.46.12.152,51370,
> flow_id=60490c7d-2aa5-4bd9-8206-708f976d50cf,
> task_id=8537e87f-6ef8-47cd-a826-d88b61eb856a (api:48)
> 
> Vdsm will create a volume of 9761289310 * 1.1, which gives a volume size
> of 10.12g.
> 
> Why engine does not call measure?
> 
> Looking in CopyImageGroupWithDataCommand.java:
> 
>     private Long determineTotalImageInitialSize(DiskImage sourceImage,
>             VolumeFormat destFormat,
>             Guid srcDomain) {
>         // Check if we have a host in the DC capable of running the measure
> volume verb,
>         // otherwise fallback to the legacy method
>         Guid hostId =
> imagesHandler.getHostForMeasurement(sourceImage.getStoragePoolId(),
>                 sourceImage.getId());
>         // We are collapsing the chain, so we want to measure the leaf to
> get the size
>         // of the entire chain
> 
> When copying a template image, there are no chains - template has single
> image.
> 
> This search return all the images with the template disk id from multiple
> storage domains. Not all the snapshots of a single disk.
> 
>         List<DiskImage> images =
> diskImageDao.getAllSnapshotsForImageGroup(sourceImage.getId());
> 
> This sort is probably meaning less, since unrelated images from different
> storage
> domain have no order.
> 
>         imagesHandler.sortImageList(images);
> 
> This assumes that the last image is the leaf - but since we have images
> from multiple storage domains, we just get a random image from the list,
> from random storage domain.
> 
>         DiskImage leaf = images.get(images.size() - 1);
> 
> Since we have random image from random domain, and we have 3 block domains
> and 6 file domains, we have good chance to get an image on file domain.
> 
>         if (hostId == null || (leaf.getActive() &&
> !leaf.getStorageTypes().get(0).isBlockDomain())) {
> 
> So we get into this legacy code, that use some crappy heuristics 
> to determine the size.
> 
>             return
> imagesHandler.determineTotalImageInitialSize(getDiskImage(),
>                     getParameters().getDestinationFormat(),
>                     getParameters().getSrcDomain(),
>                     getParameters().getDestDomain());
>         } else {
> 
> Instead of this code, using qemu-img measure to get a good estimate:
> 
>             MeasureVolumeParameters parameters = new
> MeasureVolumeParameters(leaf.getStoragePoolId(),
>                     srcDomain,
>                     leaf.getId(),
>                     leaf.getImageId(),
>                     destFormat.getValue());
>             parameters.setParentCommand(getActionType());
>             parameters.setEndProcedure(EndProcedure.PARENT_MANAGED);
>             parameters.setVdsRunningOn(hostId);
>             parameters.setCorrelationId(getCorrelationId());
>             ActionReturnValue actionReturnValue =
>                     runInternalAction(ActionType.MeasureVolume, parameters,
>                            
> ExecutionHandler.createDefaultContextForTasks(getContext()));
> 
>             if (!actionReturnValue.getSucceeded()) {
>                 throw new RuntimeException("Could not measure volume");
>             }
> 
>             return actionReturnValue.getActionReturnValue();
>         }
>     }
> 
> I think the fix is:
> 
> - choose which storage domain we want to copy from
> - check if the source image is a template
> - for template, measure the single image
> - otherwise use get the snapshots for the image group id for the
>   selected storage domain and sort them to get the leaf
> 
> Benny, what do you think?
Sounds right, posted proposed patch https://gerrit.ovirt.org/c/ovirt-engine/+/118153

Comment 18 Arik 2022-03-16 16:57:13 UTC

Benny, would you post https://gerrit.ovirt.org/c/ovirt-engine/+/118266 on GitHub?

Comment 19 Benny Zlotnik 2022-03-17 09:05:47 UTC

(In reply to Arik from comment #18)
> Benny, would you post https://gerrit.ovirt.org/c/ovirt-engine/+/118266 on
> GitHub?

As discussed, this patch is an attempt at an optimization, not directly related to this bug so I'm moving it to modified

Comment 22 Ilia Markelov 2022-04-25 10:53:17 UTC

Verified.

Virtual size of the disk is bigger than an actual size after reproducing all the steps.

Versions:
ovirt-engine-4.5.0.2-0.7.el8ev 
vdsm-4.50.0.12-1.el8ev.x86_64

Comment 23 Sandro Bonazzola 2022-04-28 09:26:34 UTC

This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.