1621211 – [downstream clone - 4.2.7] qemu-img: slow disk move/clone/import

Bug 1621211 - [downstream clone - 4.2.7] qemu-img: slow disk move/clone/import

Summary: [downstream clone - 4.2.7] qemu-img: slow disk move/clone/import

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.1.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	guy chen
Docs Contact:
URL:
Whiteboard:
Depends On:	1511891
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-23 14:18 UTC by RHV bug bot
Modified:	2021-12-10 17:07 UTC (History)
CC List:	19 users (show)
Fixed In Version:	v4.20.40
Doc Type:	Enhancement
Doc Text:	Previously, copying volumes to preallocated disks was slower than necessary and did not make optimal use of available network resources. In the current release, qemu-img uses out-of-order writing to improve the speed of write operations by up to six times. These operations include importing, moving, or copying large disks to preallocated storage.
Clone Of:	1511891
Environment:
Last Closed:	2018-11-05 15:02:07 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43586	None	None	None	2021-09-09 15:26:53 UTC
Red Hat Product Errata	RHEA-2018:3478	None	None	None	2018-11-05 15:02:51 UTC
oVirt gerrit	93787	master	MERGED	storage: Optimize copy to raw block volumes	2020-05-27 21:10:00 UTC
oVirt gerrit	93846	ovirt-4.2	MERGED	qemuimg: Add unordered_writes convert option	2020-05-27 21:10:00 UTC
oVirt gerrit	93847	ovirt-4.2	MERGED	storage: Optimize copy to raw block volumes	2020-05-27 21:09:59 UTC
oVirt gerrit	93861	ovirt-4.2	MERGED	sd: Move supportsSparseness to StorageDomainManifest	2020-05-27 21:09:59 UTC

Description RHV bug bot 2018-08-23 14:18:34 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1511891 +++
======================================================================

Description of problem:
Currently qemu-img has long run-times when importing disks from a export-domain. During run-time we were unable to identify the bottleneck which is causing such high run-time.

Version-Release number of selected component (if applicable):
vdsm-4.19.31-1.el7ev.x86_64
qemu-img-rhev-2.9.0-16.el7_4.8.x86_64

How reproducible:
Any time during import of a larger disk from an NFS based export-domain

Steps to Reproduce:
1. Export a sufficiently large enough (100G) virtual machine to a NFS export domain
2. Import that machine to a FC-based storage domain
3.

Actual results:
Import is running and takes a serious amount of time.
- Network is *not* saturated
- FC-device is *not* saturated
- CPU is *not* running on 100% load
- There is lots of free memory

Expected results:
- Either network or FC-device or CPU is reaching a limit

Additional info:

(Originally by Andreas Bleischwitz)

Comment 8 RHV bug bot 2018-08-23 14:20:21 UTC

returning needinfo to signify we're still waiting for the logs

(Originally by amureini)

Comment 12 RHV bug bot 2018-08-23 14:21:19 UTC

How slow is the import. Can you provide numbers for this process?

(Originally by ylavi)

Comment 13 RHV bug bot 2018-08-23 14:21:34 UTC

I also suggest you tell the customer to use a data domain with its import ability.

(Originally by ylavi)

Comment 17 RHV bug bot 2018-08-23 14:22:32 UTC

Nir, can you please have a look? We might need some tweaking of qemu-img there

(Originally by Tal Nisan)

Comment 18 RHV bug bot 2018-08-23 14:22:48 UTC

Andreas, what is the original disk format?

I have seen very slow qemu-img copies on fast server and storage (XtrmIO) when
copying raw preallocated volume to raw preallocated volume.

(Originally by Nir Soffer)

Comment 19 RHV bug bot 2018-08-23 14:23:04 UTC

Hi Nir,

we initially went the way with exporting the vms from SAN to a NFS-based export domain - I assume it will use QCOW2 for that. While the export was not remarkable slow, the import lasted much longer.
As we then have been told to use a additional storage-domain for that migration, we used a second SAN based storage-domain and copied the disks from the old to the new storage-domain. This turned out to be even slower than the import from the NFS-export domain.
I can no longer provide any numbers and the migration is now close to finished so that we do not even have the ability to re-run a decent export/import process.

The effect should be visible regardless their environment. All they had was a vm with close to 2TB of disk.

(Originally by Andreas Bleischwitz)

Comment 27 RHV bug bot 2018-08-23 14:24:57 UTC

Mordechay, you did not mention how you copied the image - did you use qemu-img
manually or move disk via engine?

Also, the content of the image matters. Can you attach to this bug the output of:

    qemu-img info /path/to/image

    qemu-img map --output json /path/to/image

We need to run this on the source image *before* the copy.

Finally, you did not mention which NFS version was used. NFS 4.2 supports
sparseness, so qemu-img can copy sparse parts much much faster (using fallocate()
instead of copying zeros).

It will also be interesting to compare the same copy using ovirt-imageio new cio
code:

You can test using this patch:
https://gerrit.ovirt.org/#/c/85640/
 
To install this, you can download the patch from gerrit:

    git fetch git://gerrit.ovirt.org/ovirt-imageio refs/changes/40/85640/26 && \
        git checkout FETCH_HEAD

Then run this from the common directory:

    export PYTHONPATH=.

    time python test/tdd.py /path/to/source /path/to/destination

(Originally by Nir Soffer)

Comment 30 RHV bug bot 2018-08-23 14:25:40 UTC

Raz, we need to reproduce this on real hardware and storage. Mordancy did some
tests (see comment 23) but we don't have enough info about the tests.

For testing I'll need a decent host (leopard04/03 would be best, but buri/ucs
should also be good, and iSCSI/FC/NFS storage (XtremIO would be best).

(Originally by Nir Soffer)

Comment 44 RHV bug bot 2018-08-23 14:28:58 UTC

Andreas, can you give details about the destination storage server?

In comment 32 we learned that the destination storage server is a VM. Is this the
same setup that you reported, or a different setup?

If the issue is running NFS server on a VM, this bug should more to qemu, it is
not related to qemu-img.

(Originally by Nir Soffer)

Comment 45 RHV bug bot 2018-08-23 14:29:14 UTC

Adding back needinfo for Raz, removed by mistake by some commented.

We are blocked waiting for a fast server and storage for reproducing this issue.

(Originally by Nir Soffer)

Comment 48 RHV bug bot 2018-08-23 14:29:54 UTC

Daniel,

As this is a performance related issue, please provide the required HW for testing

(Originally by Raz Tamir)

Comment 50 RHV bug bot 2018-08-23 14:30:25 UTC

Setting the needinfo again

(Originally by Raz Tamir)

Comment 58 RHV bug bot 2018-08-23 14:32:18 UTC

I tested copy image performance with raw format, using the new -W option
in qemu-img convert.

I did not test copying qcow2 to raw/qcow2 files, for tow reasons; qemu
is the only tool that can read qcow2 format, and the new -W option cause
fragmentation of the qcow2 file, and I'm not sure how this effects
performance of the guest.


## Tested images

I tested copying 3 versions of sparse image:

size  format  data    #holes
----------------------------
100G    raw    19%      6352
100G    raw    52%     15561
100G    raw    86%     24779

For reference here is a Fedora 27 image created by virt-builder.

  6G    raw    19%        73

The images are fairly fragmented - these make it harder for qemu-img to
get good performance, since qemu-img has to deal with lot of small
chunks of data.

The 19G image was created like this:

- Install Fedora 28 server on 100G FC disk
- yum-builddep kernel
- get current kernel tree
- configure using "make olddefconfig"
- make

The 52G image was created from the 19G image by duplicating the linux
built tree twice.

The 86G image was created from the 52G image by adding 2 more duplicates
of the linux tree.


## Tested hardware

Tested on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz server with 40
cores, connected to XtremIO storage via 4G FC HBAs, and 4 paths to
storage.

The NFS server is another server with same spec, exporting a LUN from
XtremIO formatted using xfs over single 10G nic. The NFS server is
mounted using NFS 4.2.


## Tested commands

I compared these commands:

1. qemu-img

  qemu-img convert p -f raw -O raw -t none -T none src-img dst-img

  This is how RHV copies images since 3.6.

2. qemu-img/-W

  qemu-img convert p -f raw -O raw -t none -T none -W src-img dst-img

3. dd

For block:

  blkdiscard -z -p 32m dst-img

  dd if=src-img of=dst-img bs=8 iflag=direct oflag=direct conv=sparse,fsync

For file:

  truncate -s 0 dst-img

  truncate -s 100g dst-img

  dd if=src-img of=dst-img bs=8 iflag=direct oflag=direct conv=sparse,fsync

This command is not the same as qemu-img - it treats holes smaller then
the block size (8M) as data. But I think this is good enough.

4. parallel dd

For block:

  blkdiscard -z -p 32m dst-img

  dd if=src-img of=dst-img bs=8 count=6400 iflag=direct oflag=direct \ 
      conv=sparse,fsync &

  dd if=src-img of=dst-img bs=8 count=6400 seek=6400 skip=6400 iflag=direct \
      oflag=direct conv=sparse,fsync &

For file:

  truncate -s 0 dst-img

  truncate -s 100g dst-img

  dd if=src-img of=dst-img bs=8 count=6400 iflag=direct oflag=direct \
      conv=notrunc,sparse,fsync &

  dd if=src-img of=dst-img bs=8 count=6400 seek=6400 skip=6400 \
      iflag=direct oflag=direct conv=notrunc,sparse,fsync &


The parallel dd commands are not very efficient with very sparse images
since one process finish before the other, but they are good way to show
if possible improvement.


## Versions

# rpm -q qemu-img-rhev coreutils
qemu-img-rhev-2.10.0-21.el7_5.4.x86_64
coreutils-8.22-21.el7.x86_64

# uname -r
3.10.0-862.6.3.el7.x86_64


## Setup

Before testing copy to FC volume, I discarded the volume:

   blkdiscard -p 32m dst-img

When copying to NFS, I truncated the volume:

   truncate -s0 dst-img


## Basic read/write throughput

For reference, here is the rate we can read or write on this setup:

# dd if=/nfs/100-86g.img of=/dev/null bs=8M count=12800 iflag=direct conv=sparse
107374182400 bytes (107 GB) copied, 116.292 s, 923 MB/s

# dd if=/dev/zero of=dst-fc1 bs=8M count=12800 oflag=direct conv=fsync
107374182400 bytes (107 GB) copied, 151.491 s, 709 MB/s

# dd if=/dev/zero of=/nfs/upload.img bs=8M count=12800 oflag=direct conv=fsync
107374182400 bytes (107 GB) copied, 296.105 s, 363 MB/s


## Copying from NFS 4.2 to FC storage domain

This is how raw templates are copied from export domain of from NFS data
domain to FC domain, mentioned in comment 0, or how disks are copied
when moving disks between storage domains.

Time in seconds.

image       qemu-img    qemu-img/-W      dd    parallel-dd
----------------------------------------------------------
100/19G          242             41      165           128
100/52G          658            119      197           144
100/86G         1230            189      238           132

We can see that qemu-img give poor results, and it is worse for less
sparse images. This reproduces the issue mentioned in comment 0. 1230
seconds for 100G is 83 MiB/s.

With the new -W option qemu-img is the fastest with very sparse image,
since it does not need to read the holes, using SEEK_DATA/SEEK_HOLE.
I did not test NFS < 4.2, where we qemu has to read all the data and
detect zeros manually like dd.

But we can see that simple parallel dd can be faster for fully allocated
images, when qemu-img has to read most of the image. This show there is
room for optimization in qemu-img, even with -W.


## Copying from FC storage domain to FC storage domain

This is how disks are copied between storage domains.

Time in seconds.

image       qemu-img    qemu-img/-W      dd    parallel-dd
----------------------------------------------------------
100/19G          383            194      178           141
100/52G          802            282      230           167
100/86G         1229            371      287           154

In this case qemu-img and dd do not have any info on sparseness of the
source image and must detect zeros manually.

qemu-img with the -W option is again significantly faster, but even
simple dd is faster. The difference is bigger as the image contains more
data.


## Copying from FC storage domain to NFS 4.2 storage domain

This is how disks are copied between storage domains, or how disks are
copied to export domain, mentioned in comment 0.

Time in seconds.

image       qemu-img    qemu-img/-W      dd    parallel-dd
----------------------------------------------------------
100/19G          215            194      200           n/a
100/52G          347            292      301           n/a
100/86G          493            379      398           340

qemu-img with the new -W option is faster like simple dd, but parallel
dd is faster.

However using -W will cause fragmentation in the destination file
system, so I don't think we should use use this option. Maybe we need to
test how VM performance is effected by disks copied using -W to NFS
storage.


## Summary

qemu-img without the -W option is very slow now. When we moved to use
qemu-img in 3.6 it was faster than dd. Maybe we did not test it properly
(we used 1M buffer size in dd), or maybe there was a performance
regression in qemu-img since RHEL 7.2.

This is the patch moving to use only qemu-img for copying images:
https://github.com/oVirt/vdsm/commit/0b61c4851a528fd6354d9ab77a68085c41f35dc9

We should use -W for coping to raw volumes on block storage.

Using dd for block-to-block copy and block-to-nfs is faster, but we want
to use single tool for coping images. We will try to improve qemu-img
performance for this use case.

qemu-img 3.0 supports copy offloading, we need to test if it give better
performance for block to block copy.

I'll open qmeu-img bug to track the performance issues.

(Originally by Nir Soffer)

Comment 59 RHV bug bot 2018-08-23 14:32:35 UTC

Created attachment 1476302 [details]
Detailed test results 100/19g sparse image

(Originally by Nir Soffer)

Comment 60 RHV bug bot 2018-08-23 14:32:50 UTC

Created attachment 1476303 [details]
Detailed test results 100/52g sparse image

(Originally by Nir Soffer)

Comment 61 RHV bug bot 2018-08-23 14:33:06 UTC

Created attachment 1476304 [details]
Detailed test results 100/86g sparse image

(Originally by Nir Soffer)

Comment 62 RHV bug bot 2018-08-23 14:33:20 UTC

Created attachment 1476305 [details]
Parallel dd test script for file storage

(Originally by Nir Soffer)

Comment 63 RHV bug bot 2018-08-23 14:33:33 UTC

Created attachment 1476306 [details]
Parallel dd test script for block storage

(Originally by Nir Soffer)

Comment 64 RHV bug bot 2018-08-23 14:33:47 UTC

We are in blocker only stage of 4.2.6.
This change requires full regression testing as this is a key flow.
Therefore I think we should wait for 4.2.7 to merge this.

(Originally by ylavi)

Comment 65 RHV bug bot 2018-08-23 14:34:02 UTC

Removing qa_ack+ as this won't be part of 4.2.6

(Originally by Elad Ben Aharon)

Comment 66 Daniel Gur 2018-09-05 09:33:47 UTC

Nir Are you merging it in the coming 4.2.7 build,  this Sprint

Comment 67 Daniel Gur 2018-09-05 10:34:51 UTC

Guy , Nir tested it on our 
Leopards with NFS you gave him

## Tested hardware

Tested on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz server with 40
cores, connected to XtremIO storage via 4G FC HBAs, and 4 paths to
storage.

The NFS server is another server with same spec, exporting a LUN from
XtremIO formatted using xfs over single 10G nic. The NFS server is
mounted using NFS 4.2.

Comment 68 Nir Soffer 2018-09-05 17:45:53 UTC

(In reply to Daniel Gur from comment #66)
> Nir Are you merging it in the coming 4.2.7 build,  this Sprint

This was already merged, should be available in first 4.2.7 build.

Comment 69 Steffen Froemer 2018-09-10 19:11:09 UTC

(In reply to Nir Soffer from comment #68)
> (In reply to Daniel Gur from comment #66)
> > Nir Are you merging it in the coming 4.2.7 build,  this Sprint
> 
> This was already merged, should be available in first 4.2.7 build.

What will be required regarding RHEL hosts? Will only be the command used for storage-migration will change, or are there any dependencies in packages on hypervisor?

Comment 71 Nir Soffer 2018-10-11 10:54:10 UTC

(In reply to Steffen Froemer from comment #69)
> What will be required regarding RHEL hosts?
There is no new requirements, we use qemu-img available options introduced in latest
version which is already required by vdsm.

Comment 72 Steve Goodman 2018-10-29 08:30:11 UTC

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field.

 

The documentation team will review, edit, and approve the text.

 

If this bug does not require doc text, please set the 'requires_doc_text' flag to -.

Comment 73 Nir Soffer 2018-11-04 12:04:34 UTC

(In reply to Steve Goodman from comment #72)
Doc text updated.

Comment 74 guy chen 2018-11-05 08:39:56 UTC

I run the following setup :

VM with 2 disks :

disk 1:
Preallocated
Size 100 GB
disk 2 :
Thin provisioned 
Virtual size 10 GB
Actual size 3 GB 

I have a system with 1 fiber channel SD and an export NFS domain version 4.2.
Tested import from NFS, one time with 4.2.7 (vdsm-4.20.43-1), and second with run with 4.2.6 (vdsm-4.20.39.1-1).
On 4.2.6 Import took 7 minute and 41 seconds.
On 4.2.7 Import took 5 minute and 48 seconds.

Thus on 4.2.7 import significantly improved.

Comment 76 errata-xmlrpc 2018-11-05 15:02:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3478

Comment 79 Daniel Gur 2019-08-28 13:13:23 UTC

sync2jira

Comment 80 Daniel Gur 2019-08-28 13:17:36 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.