Bug 1993454 - Improve ImageIO import performance
Summary: Improve ImageIO import performance
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.8.1
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.10.0
Assignee: Matthew Arnold
QA Contact: Tzahi Ashkenazi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-13 08:01 UTC by Tzahi Ashkenazi
Modified: 2022-03-16 15:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 15:51:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 2052 0 None open Use ImageIO extents API to copy raw images more efficiently. 2021-12-16 19:23:00 UTC
Red Hat Product Errata RHSA-2022:0947 0 None None None 2022-03-16 15:53:09 UTC

Description Tzahi Ashkenazi 2021-08-13 08:01:47 UTC
Description of problem:

During some first baseline results of migration from RHV to MTV using rhev as a provider, we noticed some suspicious results.

Test details : 
the VMs OS version is RHEL  7.6 they are located on FC Storage Domain using single host (rhev) version > rhv-release-4.4.7-6, target storage class is OCS.
I've checked the VMS were made correctly, they are filled correctly, the disk types are correct.

attached are the results from the baseline cycles :

1. 100 GB Sparse Thin disk of which 2GB actual size rhel76-100gb-os-only-thin  migration took: 2m26s  /  2m34s
2. 100 GB Sparse Thin disk of which 35GB actual size (rhel76-100gb-30usage-thin) migration took:  4m9s / 4m10s
3. 100 GB Sparse Thin disk of which 72GB actual size rhel76-100gb-70usage-thin migration took:  3m38s /  3m48s


Open questions :
- why does it take 2.5 mins to transfer 2GB disk?
- why is 72 GB sparse disk faster than 35GB sparse disk?
- we see the same behavior on different OCP clouds  cloud20 & cloud38 


Importer issues are:
1. not using image extents - performs one GET request,
2. copying the zeroes over the wire, and likely writing all the zeroes to the target disk.
3. importer pod first asks for the TotalSize of the oVirt disk object, and only checks Content-Length if the total size was zero, should 
4. Importing pod progress does not update correctly - Tzahi needs to check if importer memory is decreasing as disk flushes/writes happen.


feedback from  Nir soffer:
1. not using image extents - performs one GET request,
2. copying the zeroes over the wire, and likely writing all the zeroes to the target disk.
3. importer pod first asks for the TotalSize of the oVirt disk object, and only checks Content-Length if the total size was zero

the full logs from the baseline results can be found under : 
    https://drive.google.com/drive/folders/1d1PPrwNZ8XHmU-SPwhLBI89lPguKTeXp?usp=sharing


logs : 
1. ovirt-imageio-daemon.log
2. rhel76-100gb-30usage-thin.log
3. rhel76-100gb-70usage-thin.log
4. rhel76-100gb-os-only-2gb-usage-thin.log
5. vdsm.log


Version-Release number of selected component (if applicable):
Cloud20
OCP-48
CNV-48-451
OCS-48
MTV-2.1.0-44

Comment 1 Fabien Dupont 2021-09-08 13:26:23 UTC
Retargeting to 4.10.0 to have enough time to investigate the solution and test it. It will also allow time to use govirtclient internally.

Comment 2 Matthew Arnold 2022-01-25 17:20:39 UTC
This should be fixed in CNV v4.10.0-524.

Comment 3 Tzahi Ashkenazi 2022-01-30 10:18:21 UTC
Tested again using red02 as provider with  the same VMs : 

1. 100 GB Sparse Thin disk of which 2GB actual size rhel76-100gb-os-only-thin  migration took: 00:00:18 sec
   data copy  speed - 111.1 MB/s 
2. 100 GB Sparse Thin disk of which 35GB actual size (rhel76-100gb-30usage-thin) migration took: 00:04:59 min
   data copy  speed -100.33 MB/s 
3. 100 GB Sparse Thin disk of which 72GB actual size rhel76-100gb-70usage-thin migration took:  00:10:02 min
   data copy  speed -116.28 MB/s 

those results are more reasonable then the previous: 

 * on the previous cycle  it took 2.5 mins to transfer 2GB disk,  now it took 18 sec on the same data 
 * 72 GB sparse disk faster than 35GB sparse disk  , now the 35GB is faster then the 72GB as expected 

the average data copy speed is 100MB/s ( 800mb/s ) 

Version-Release number of selected component (if applicable):
* Cloud10
* OCP-4.10
* OCS-4.9.1
* CNV-v4.10.0-598
* MTV-2.3.0-15

the full logs from the baseline results can be found under : 
          https://drive.google.com/drive/folders/1IT7dKmfacIvMpyccQ593DgMQuu0vI-Ym?usp=sharing

Comment 8 errata-xmlrpc 2022-03-16 15:51:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.