1993454 – Improve ImageIO import performance

Bug 1993454 - Improve ImageIO import performance

Summary: Improve ImageIO import performance

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Matthew Arnold
QA Contact:	Tzahi Ashkenazi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-13 08:01 UTC by Tzahi Ashkenazi
Modified:	2022-03-16 15:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-16 15:51:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 2052	0	None	open	Use ImageIO extents API to copy raw images more efficiently.	2021-12-16 19:23:00 UTC
Red Hat Product Errata	RHSA-2022:0947	0	None	None	None	2022-03-16 15:53:09 UTC

Description Tzahi Ashkenazi 2021-08-13 08:01:47 UTC

Description of problem:

During some first baseline results of migration from RHV to MTV using rhev as a provider, we noticed some suspicious results.

Test details :
the VMs OS version is RHEL 7.6 they are located on FC Storage Domain using single host (rhev) version > rhv-release-4.4.7-6, target storage class is OCS.
I've checked the VMS were made correctly, they are filled correctly, the disk types are correct.

attached are the results from the baseline cycles :

1. 100 GB Sparse Thin disk of which 2GB actual size rhel76-100gb-os-only-thin migration took: 2m26s / 2m34s
2. 100 GB Sparse Thin disk of which 35GB actual size (rhel76-100gb-30usage-thin) migration took: 4m9s / 4m10s
3. 100 GB Sparse Thin disk of which 72GB actual size rhel76-100gb-70usage-thin migration took: 3m38s / 3m48s

Open questions :
- why does it take 2.5 mins to transfer 2GB disk?
- why is 72 GB sparse disk faster than 35GB sparse disk?
- we see the same behavior on different OCP clouds cloud20 & cloud38

Importer issues are:
1. not using image extents - performs one GET request,
2. copying the zeroes over the wire, and likely writing all the zeroes to the target disk.
3. importer pod first asks for the TotalSize of the oVirt disk object, and only checks Content-Length if the total size was zero, should
4. Importing pod progress does not update correctly - Tzahi needs to check if importer memory is decreasing as disk flushes/writes happen.

feedback from Nir soffer:
1. not using image extents - performs one GET request,
2. copying the zeroes over the wire, and likely writing all the zeroes to the target disk.
3. importer pod first asks for the TotalSize of the oVirt disk object, and only checks Content-Length if the total size was zero

the full logs from the baseline results can be found under :
https://drive.google.com/drive/folders/1d1PPrwNZ8XHmU-SPwhLBI89lPguKTeXp?usp=sharing

logs :
1. ovirt-imageio-daemon.log
2. rhel76-100gb-30usage-thin.log
3. rhel76-100gb-70usage-thin.log
4. rhel76-100gb-os-only-2gb-usage-thin.log
5. vdsm.log

Version-Release number of selected component (if applicable):
Cloud20
OCP-48
CNV-48-451
OCS-48
MTV-2.1.0-44

Comment 1 Fabien Dupont 2021-09-08 13:26:23 UTC

Retargeting to 4.10.0 to have enough time to investigate the solution and test it. It will also allow time to use govirtclient internally.

Comment 2 Matthew Arnold 2022-01-25 17:20:39 UTC

This should be fixed in CNV v4.10.0-524.

Comment 3 Tzahi Ashkenazi 2022-01-30 10:18:21 UTC

Tested again using red02 as provider with  the same VMs : 

1. 100 GB Sparse Thin disk of which 2GB actual size rhel76-100gb-os-only-thin  migration took: 00:00:18 sec
   data copy  speed - 111.1 MB/s 
2. 100 GB Sparse Thin disk of which 35GB actual size (rhel76-100gb-30usage-thin) migration took: 00:04:59 min
   data copy  speed -100.33 MB/s 
3. 100 GB Sparse Thin disk of which 72GB actual size rhel76-100gb-70usage-thin migration took:  00:10:02 min
   data copy  speed -116.28 MB/s 

those results are more reasonable then the previous: 

 * on the previous cycle  it took 2.5 mins to transfer 2GB disk,  now it took 18 sec on the same data 
 * 72 GB sparse disk faster than 35GB sparse disk  , now the 35GB is faster then the 72GB as expected 

the average data copy speed is 100MB/s ( 800mb/s ) 

Version-Release number of selected component (if applicable):
* Cloud10
* OCP-4.10
* OCS-4.9.1
* CNV-v4.10.0-598
* MTV-2.3.0-15

the full logs from the baseline results can be found under : 
          https://drive.google.com/drive/folders/1IT7dKmfacIvMpyccQ593DgMQuu0vI-Ym?usp=sharing

Comment 8 errata-xmlrpc 2022-03-16 15:51:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947

Note You need to log in before you can comment on or make changes to this bug.