1511891 – qemu-img: slow disk move/clone/import

Bug 1511891 - qemu-img: slow disk move/clone/import

Summary: qemu-img: slow disk move/clone/import

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	ovirt-4.3.0
Target Release:	4.3.0
Assignee:	Nir Soffer
QA Contact:	guy chen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1621211
TreeView+	depends on / blocked

Reported:	2017-11-10 11:23 UTC by Andreas Bleischwitz
Modified:	2021-12-10 15:24 UTC (History)
CC List:	17 users (show)
Fixed In Version:	v4.30.3
Doc Type:	Enhancement
Doc Text:	Previously, copying volumes to preallocated disks was slower than necessary and did not make optimal use of available network resources. In the current release, qemu-img uses out-of-order writing to improve the speed of write operations by up to six times. These operations include importing, moving, and copying large disks to preallocated storage.
Clone Of:
Clones:	1621211 (view as bug list)
Environment:
Last Closed:	2019-05-08 12:35:59 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
Detailed test results 100/19g sparse image (2.98 KB, text/plain) 2018-08-15 20:45 UTC, Nir Soffer	no flags	Details
Detailed test results 100/52g sparse image (3.57 KB, text/plain) 2018-08-15 20:47 UTC, Nir Soffer	no flags	Details
Detailed test results 100/86g sparse image (4.48 KB, text/plain) 2018-08-15 20:47 UTC, Nir Soffer	no flags	Details
Parallel dd test script for file storage (244 bytes, application/x-shellscript) 2018-08-15 20:49 UTC, Nir Soffer	no flags	Details
Parallel dd test script for block storage (231 bytes, application/x-shellscript) 2018-08-15 20:49 UTC, Nir Soffer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43462	None	None	None	2021-09-09 12:51:13 UTC
Red Hat Product Errata	RHBA-2019:1077	None	None	None	2019-05-08 12:36:29 UTC
oVirt gerrit	93787	'None'	MERGED	storage: Optimize copy to raw block volumes	2021-01-25 22:14:45 UTC
oVirt gerrit	93846	'None'	MERGED	qemuimg: Add unordered_writes convert option	2021-01-25 22:14:45 UTC
oVirt gerrit	93847	'None'	MERGED	storage: Optimize copy to raw block volumes	2021-01-25 22:14:44 UTC
oVirt gerrit	93861	'None'	MERGED	sd: Move supportsSparseness to StorageDomainManifest	2021-01-25 22:15:28 UTC

Description Andreas Bleischwitz 2017-11-10 11:23:51 UTC

Description of problem:
Currently qemu-img has long run-times when importing disks from a export-domain. During run-time we were unable to identify the bottleneck which is causing such high run-time.

Version-Release number of selected component (if applicable):
vdsm-4.19.31-1.el7ev.x86_64
qemu-img-rhev-2.9.0-16.el7_4.8.x86_64

How reproducible:
Any time during import of a larger disk from an NFS based export-domain

Steps to Reproduce:
1. Export a sufficiently large enough (100G) virtual machine to a NFS export domain
2. Import that machine to a FC-based storage domain
3.

Actual results:
Import is running and takes a serious amount of time.
- Network is *not* saturated
- FC-device is *not* saturated
- CPU is *not* running on 100% load
- There is lots of free memory

Expected results:
- Either network or FC-device or CPU is reaching a limit

Additional info:

Comment 7 Allon Mureinik 2017-11-14 10:21:43 UTC

returning needinfo to signify we're still waiting for the logs

Comment 11 Yaniv Lavi 2018-04-02 08:35:57 UTC

How slow is the import. Can you provide numbers for this process?

Comment 12 Yaniv Lavi 2018-04-02 08:36:31 UTC

I also suggest you tell the customer to use a data domain with its import ability.

Comment 16 Tal Nisan 2018-05-03 10:37:36 UTC

Nir, can you please have a look? We might need some tweaking of qemu-img there

Comment 17 Nir Soffer 2018-05-28 19:16:10 UTC

Andreas, what is the original disk format?

I have seen very slow qemu-img copies on fast server and storage (XtrmIO) when
copying raw preallocated volume to raw preallocated volume.

Comment 18 Andreas Bleischwitz 2018-05-29 07:11:56 UTC

Hi Nir,

we initially went the way with exporting the vms from SAN to a NFS-based export domain - I assume it will use QCOW2 for that. While the export was not remarkable slow, the import lasted much longer.
As we then have been told to use a additional storage-domain for that migration, we used a second SAN based storage-domain and copied the disks from the old to the new storage-domain. This turned out to be even slower than the import from the NFS-export domain.
I can no longer provide any numbers and the migration is now close to finished so that we do not even have the ability to re-run a decent export/import process.

The effect should be visible regardless their environment. All they had was a vm with close to 2TB of disk.

Comment 26 Nir Soffer 2018-06-12 17:08:50 UTC

Mordechay, you did not mention how you copied the image - did you use qemu-img
manually or move disk via engine?

Also, the content of the image matters. Can you attach to this bug the output of:

    qemu-img info /path/to/image

    qemu-img map --output json /path/to/image

We need to run this on the source image *before* the copy.

Finally, you did not mention which NFS version was used. NFS 4.2 supports
sparseness, so qemu-img can copy sparse parts much much faster (using fallocate()
instead of copying zeros).

It will also be interesting to compare the same copy using ovirt-imageio new cio
code:

You can test using this patch:
https://gerrit.ovirt.org/#/c/85640/
 
To install this, you can download the patch from gerrit:

    git fetch git://gerrit.ovirt.org/ovirt-imageio refs/changes/40/85640/26 && \
        git checkout FETCH_HEAD

Then run this from the common directory:

    export PYTHONPATH=.

    time python test/tdd.py /path/to/source /path/to/destination

Comment 29 Nir Soffer 2018-06-14 12:54:50 UTC

Raz, we need to reproduce this on real hardware and storage. Mordancy did some
tests (see comment 23) but we don't have enough info about the tests.

For testing I'll need a decent host (leopard04/03 would be best, but buri/ucs
should also be good, and iSCSI/FC/NFS storage (XtremIO would be best).

Comment 43 Nir Soffer 2018-06-22 18:35:12 UTC

Andreas, can you give details about the destination storage server?

In comment 32 we learned that the destination storage server is a VM. Is this the
same setup that you reported, or a different setup?

If the issue is running NFS server on a VM, this bug should more to qemu, it is
not related to qemu-img.

Comment 44 Nir Soffer 2018-06-22 18:38:45 UTC

Adding back needinfo for Raz, removed by mistake by some commented.

We are blocked waiting for a fast server and storage for reproducing this issue.

Comment 47 Raz Tamir 2018-06-22 19:46:10 UTC

Daniel,

As this is a performance related issue, please provide the required HW for testing

Comment 49 Raz Tamir 2018-06-23 21:04:10 UTC

Setting the needinfo again

Comment 57 Nir Soffer 2018-08-15 20:38:20 UTC

I tested copy image performance with raw format, using the new -W option
in qemu-img convert.

I did not test copying qcow2 to raw/qcow2 files, for tow reasons; qemu
is the only tool that can read qcow2 format, and the new -W option cause
fragmentation of the qcow2 file, and I'm not sure how this effects
performance of the guest.


## Tested images

I tested copying 3 versions of sparse image:

size  format  data    #holes
----------------------------
100G    raw    19%      6352
100G    raw    52%     15561
100G    raw    86%     24779

For reference here is a Fedora 27 image created by virt-builder.

  6G    raw    19%        73

The images are fairly fragmented - these make it harder for qemu-img to
get good performance, since qemu-img has to deal with lot of small
chunks of data.

The 19G image was created like this:

- Install Fedora 28 server on 100G FC disk
- yum-builddep kernel
- get current kernel tree
- configure using "make olddefconfig"
- make

The 52G image was created from the 19G image by duplicating the linux
built tree twice.

The 86G image was created from the 52G image by adding 2 more duplicates
of the linux tree.


## Tested hardware

Tested on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz server with 40
cores, connected to XtremIO storage via 4G FC HBAs, and 4 paths to
storage.

The NFS server is another server with same spec, exporting a LUN from
XtremIO formatted using xfs over single 10G nic. The NFS server is
mounted using NFS 4.2.


## Tested commands

I compared these commands:

1. qemu-img

  qemu-img convert p -f raw -O raw -t none -T none src-img dst-img

  This is how RHV copies images since 3.6.

2. qemu-img/-W

  qemu-img convert p -f raw -O raw -t none -T none -W src-img dst-img

3. dd

For block:

  blkdiscard -z -p 32m dst-img

  dd if=src-img of=dst-img bs=8 iflag=direct oflag=direct conv=sparse,fsync

For file:

  truncate -s 0 dst-img

  truncate -s 100g dst-img

  dd if=src-img of=dst-img bs=8 iflag=direct oflag=direct conv=sparse,fsync

This command is not the same as qemu-img - it treats holes smaller then
the block size (8M) as data. But I think this is good enough.

4. parallel dd

For block:

  blkdiscard -z -p 32m dst-img

  dd if=src-img of=dst-img bs=8 count=6400 iflag=direct oflag=direct \ 
      conv=sparse,fsync &

  dd if=src-img of=dst-img bs=8 count=6400 seek=6400 skip=6400 iflag=direct \
      oflag=direct conv=sparse,fsync &

For file:

  truncate -s 0 dst-img

  truncate -s 100g dst-img

  dd if=src-img of=dst-img bs=8 count=6400 iflag=direct oflag=direct \
      conv=notrunc,sparse,fsync &

  dd if=src-img of=dst-img bs=8 count=6400 seek=6400 skip=6400 \
      iflag=direct oflag=direct conv=notrunc,sparse,fsync &


The parallel dd commands are not very efficient with very sparse images
since one process finish before the other, but they are good way to show
if possible improvement.


## Versions

# rpm -q qemu-img-rhev coreutils
qemu-img-rhev-2.10.0-21.el7_5.4.x86_64
coreutils-8.22-21.el7.x86_64

# uname -r
3.10.0-862.6.3.el7.x86_64


## Setup

Before testing copy to FC volume, I discarded the volume:

   blkdiscard -p 32m dst-img

When copying to NFS, I truncated the volume:

   truncate -s0 dst-img


## Basic read/write throughput

For reference, here is the rate we can read or write on this setup:

# dd if=/nfs/100-86g.img of=/dev/null bs=8M count=12800 iflag=direct conv=sparse
107374182400 bytes (107 GB) copied, 116.292 s, 923 MB/s

# dd if=/dev/zero of=dst-fc1 bs=8M count=12800 oflag=direct conv=fsync
107374182400 bytes (107 GB) copied, 151.491 s, 709 MB/s

# dd if=/dev/zero of=/nfs/upload.img bs=8M count=12800 oflag=direct conv=fsync
107374182400 bytes (107 GB) copied, 296.105 s, 363 MB/s


## Copying from NFS 4.2 to FC storage domain

This is how raw templates are copied from export domain of from NFS data
domain to FC domain, mentioned in comment 0, or how disks are copied
when moving disks between storage domains.

Time in seconds.

image       qemu-img    qemu-img/-W      dd    parallel-dd
----------------------------------------------------------
100/19G          242             41      165           128
100/52G          658            119      197           144
100/86G         1230            189      238           132

We can see that qemu-img give poor results, and it is worse for less
sparse images. This reproduces the issue mentioned in comment 0. 1230
seconds for 100G is 83 MiB/s.

With the new -W option qemu-img is the fastest with very sparse image,
since it does not need to read the holes, using SEEK_DATA/SEEK_HOLE.
I did not test NFS < 4.2, where we qemu has to read all the data and
detect zeros manually like dd.

But we can see that simple parallel dd can be faster for fully allocated
images, when qemu-img has to read most of the image. This show there is
room for optimization in qemu-img, even with -W.


## Copying from FC storage domain to FC storage domain

This is how disks are copied between storage domains.

Time in seconds.

image       qemu-img    qemu-img/-W      dd    parallel-dd
----------------------------------------------------------
100/19G          383            194      178           141
100/52G          802            282      230           167
100/86G         1229            371      287           154

In this case qemu-img and dd do not have any info on sparseness of the
source image and must detect zeros manually.

qemu-img with the -W option is again significantly faster, but even
simple dd is faster. The difference is bigger as the image contains more
data.


## Copying from FC storage domain to NFS 4.2 storage domain

This is how disks are copied between storage domains, or how disks are
copied to export domain, mentioned in comment 0.

Time in seconds.

image       qemu-img    qemu-img/-W      dd    parallel-dd
----------------------------------------------------------
100/19G          215            194      200           n/a
100/52G          347            292      301           n/a
100/86G          493            379      398           340

qemu-img with the new -W option is faster like simple dd, but parallel
dd is faster.

However using -W will cause fragmentation in the destination file
system, so I don't think we should use use this option. Maybe we need to
test how VM performance is effected by disks copied using -W to NFS
storage.


## Summary

qemu-img without the -W option is very slow now. When we moved to use
qemu-img in 3.6 it was faster than dd. Maybe we did not test it properly
(we used 1M buffer size in dd), or maybe there was a performance
regression in qemu-img since RHEL 7.2.

This is the patch moving to use only qemu-img for copying images:
https://github.com/oVirt/vdsm/commit/0b61c4851a528fd6354d9ab77a68085c41f35dc9

We should use -W for coping to raw volumes on block storage.

Using dd for block-to-block copy and block-to-nfs is faster, but we want
to use single tool for coping images. We will try to improve qemu-img
performance for this use case.

qemu-img 3.0 supports copy offloading, we need to test if it give better
performance for block to block copy.

I'll open qmeu-img bug to track the performance issues.

Comment 58 Nir Soffer 2018-08-15 20:45:54 UTC

Created attachment 1476302 [details]
Detailed test results 100/19g sparse image

Comment 59 Nir Soffer 2018-08-15 20:47:03 UTC

Created attachment 1476303 [details]
Detailed test results 100/52g sparse image

Comment 60 Nir Soffer 2018-08-15 20:47:35 UTC

Created attachment 1476304 [details]
Detailed test results 100/86g sparse image

Comment 61 Nir Soffer 2018-08-15 20:49:03 UTC

Created attachment 1476305 [details]
Parallel dd test script for file storage

Comment 62 Nir Soffer 2018-08-15 20:49:37 UTC

Created attachment 1476306 [details]
Parallel dd test script for block storage

Comment 63 Yaniv Lavi 2018-08-23 13:07:08 UTC

We are in blocker only stage of 4.2.6.
This change requires full regression testing as this is a key flow.
Therefore I think we should wait for 4.2.7 to merge this.

Comment 64 Elad 2018-08-23 13:35:24 UTC

Removing qa_ack+ as this won't be part of 4.2.6

Comment 66 Yaniv Kaul 2018-08-27 05:41:54 UTC

It'd be interesting to test with ddpt and friends from sg3 libs (instead of parallel dd). Also, parallel dd can run up to 4-8 - one per path to the storage, for example.

Comment 67 Daniel Gur 2018-09-05 10:29:36 UTC

Guy , Nir tested it on our 
Leopards with NFS you gave him

Comment 68 Assen Totin 2018-09-19 09:19:34 UTC

I just stumbled onto this ticket while looking for a reason why disk image copy is so slow. I have oVirt 4.2.5 and one of my hypervisors is a new dual Xeon machine with 256 GB RAM connected to a SAN over 10 Gbps dedicated link using iSCSI (plus additional 1 Gbps for external networking). 

I installed a minimal Linux OS over 500 GB disk image with thick provisioning, shut down the VM and ran a disk copy to the same SAN volume using the oVirt UI. With no VMs running on this host the copying rate is less than 100 Mbps as per my Zabbix monitoring (as expected, all traffic comes and goes over the 10 Gbps SAN link). CPU load is 0.2-0.3 and only fraction of the memory is used. 

This result is really, really bad. No CPU, network or memory saturation. I don't really understand why qemu should be involved at all in a simple disk image copy (or move) if there is no image conversion, resizing, the disk is thick-provisioned etc. RHEV should be smart enough to figure out the best strategy for disk copy/move and utilise the available resources.

Comment 69 guy chen 2018-10-31 14:17:25 UTC

I run the following setup :

VM with 2 disks :

disk 1:
Preallocated
Size 10 GB
disk 2 :
Thin provisioned 
Virtual size 10 GB
Actual size 3 GB 

I have a system with 1 fiber channel SD and an export NFS domain.
Tested import and exports, one time with 4.2.7 (vdsm-4.20.43-1), and second with run with 4.2.6 (vdsm-4.20.39.1-1).
On 4.2.6 Import took one minute and 8 seconds and export took 2 minutes and 49 seconds.
On 4.2.7 Import took 49 seconds and export took 2 minutes and 18 seconds.

So we do see an improvement between 4.2.6 and 4.2.7.

Comment 70 Nir Soffer 2018-10-31 15:34:53 UTC

(In reply to guy chen from comment #69)
> I run the following setup :
> 
> VM with 2 disks :
> 
> disk 1:
> Preallocated
> Size 10 GB
> disk 2 :
> Thin provisioned 
> Virtual size 10 GB
> Actual size 3 GB 

Can you test with bigger disks, like 100G?

Comment 0 mention 2T disk - I don't think we need to waste time on such huge disk,
but using 10G disk lot of time is spent in engine inefficient polling.

Also, the improvement is only when the destination is raw format on block storage,
so we don't expect faster export, only faster import or move / copy disk.

Comment 71 Nir Soffer 2018-11-23 19:51:27 UTC

(In reply to Assen Totin from comment #68)
RHV is using the most sophisticated tool for copying images, supporting all images
formats and using most efficient code for reading, writing, and zeroing on many
file systems and storage backends.

Did you try with 4.2.7? copying raw images from iSCSI to iSCSI should be much 
faster now, see comment 57 for examples sing FC storage.

According to your description, the copy speed in your case depends on the read
throughput from the source image, and since the image is mostly unallocated, the
speed of writing zeros to the storage.

If you want to understand more why copies are slow on your system, please create
a 500g raw disk for testing, and activate both the source and destination disks
on storage:

    lvchange -ay domain-uuid/src-volume-uuid
    lvchange -ay domain-uuid/dst-volume-uuid

Before running these tests, please find the dm-NNN devices for these lvs, and
run collect io stats while performing io. You can find it using:

    ls -l /dev/domain-uuid/{src-volume-uuid,dst-volume-uuid}

Checking how fast we can read data and detect zeros from storage:

    iostat -xdm dm-xxx 2 >> read-iostat.log

    time dd if=/dev/domain-uuid/src-volume-uuid of=/dev/null \
        bs=8M iflag=direct conv=sparse status=progress

Checking how fast we can zero with:

    iostat -xdm dm-yyy 2 >> zero-iostat.log

    time blkdiscard -z -p 32m /dev/domain-uuid/dst-volume-uuid

Checking how fast dd can copy the image:

    iostat -xdm dm-xxx dm-yyy 2 >> dd-copy-iostat.log

    time dd if=/dev/domain-uuid/src-volume-uuid \
        of=/dev/domain-uuid/dst-volume-uuuid \
        bs=8M iflag=direct oflag=direct conv=sparse status=progress

Checking how fast qemu-img copy the image:

    iostat -xdm dm-xxx dm-yyy 2 >> qemu-img-convert-iostat.log

    time qemu-img convert -p -f raw -O raw -t none -T none -W \
        /dev/domain-uuid/src-volume-uuid /dev/domain-uuid/dst-volume-uuuid

Please share the output of the commands and iostat logs.

If qemu-img convert is not fast enough when copying raw images, this should be
improved in qemu-img, not in RHV by using another tool. This will benefit all
users instead of only RHV users.

In 4.3 we will support storage offloading using cinderlib. If the Cinder driver
for your storage supports efficient cloning, such operation may be much faster
or even instantaneous (e.g using copy on write).

Comment 72 guy chen 2018-11-27 14:56:13 UTC

I have tested import a VM with 100GB preallocated disk on 4.30.3 vs 4.20.34 vdsm, it shows nice improvement : 

vdsm-4.30.3-1
Duration : 1m18s

vdsm-4.20.34-1
Started: Nov 27, 2018, 12:33:27 PM
Completed: Nov 27, 2018, 12:35:16 PM
Duration : 1m49s

Comment 73 Daniel Gur 2018-11-29 09:53:30 UTC

Thank you Guy 
I see it about the same improvement rate (30%) as in the clone of this BZ that we verified  in 4.2.7. 
(Bug 1621211 )
Moving to verified.

Comment 76 errata-xmlrpc 2019-05-08 12:35:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1077

Comment 77 Daniel Gur 2019-08-28 13:11:27 UTC

sync2jira

Comment 78 Daniel Gur 2019-08-28 13:15:38 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.