Description of problem: Currently qemu-img has long run-times when importing disks from a export-domain. During run-time we were unable to identify the bottleneck which is causing such high run-time. Version-Release number of selected component (if applicable): vdsm-4.19.31-1.el7ev.x86_64 qemu-img-rhev-2.9.0-16.el7_4.8.x86_64 How reproducible: Any time during import of a larger disk from an NFS based export-domain Steps to Reproduce: 1. Export a sufficiently large enough (100G) virtual machine to a NFS export domain 2. Import that machine to a FC-based storage domain 3. Actual results: Import is running and takes a serious amount of time. - Network is *not* saturated - FC-device is *not* saturated - CPU is *not* running on 100% load - There is lots of free memory Expected results: - Either network or FC-device or CPU is reaching a limit Additional info:
returning needinfo to signify we're still waiting for the logs
How slow is the import. Can you provide numbers for this process?
I also suggest you tell the customer to use a data domain with its import ability.
Nir, can you please have a look? We might need some tweaking of qemu-img there
Andreas, what is the original disk format? I have seen very slow qemu-img copies on fast server and storage (XtrmIO) when copying raw preallocated volume to raw preallocated volume.
Hi Nir, we initially went the way with exporting the vms from SAN to a NFS-based export domain - I assume it will use QCOW2 for that. While the export was not remarkable slow, the import lasted much longer. As we then have been told to use a additional storage-domain for that migration, we used a second SAN based storage-domain and copied the disks from the old to the new storage-domain. This turned out to be even slower than the import from the NFS-export domain. I can no longer provide any numbers and the migration is now close to finished so that we do not even have the ability to re-run a decent export/import process. The effect should be visible regardless their environment. All they had was a vm with close to 2TB of disk.
Mordechay, you did not mention how you copied the image - did you use qemu-img manually or move disk via engine? Also, the content of the image matters. Can you attach to this bug the output of: qemu-img info /path/to/image qemu-img map --output json /path/to/image We need to run this on the source image *before* the copy. Finally, you did not mention which NFS version was used. NFS 4.2 supports sparseness, so qemu-img can copy sparse parts much much faster (using fallocate() instead of copying zeros). It will also be interesting to compare the same copy using ovirt-imageio new cio code: You can test using this patch: https://gerrit.ovirt.org/#/c/85640/ To install this, you can download the patch from gerrit: git fetch git://gerrit.ovirt.org/ovirt-imageio refs/changes/40/85640/26 && \ git checkout FETCH_HEAD Then run this from the common directory: export PYTHONPATH=. time python test/tdd.py /path/to/source /path/to/destination
Raz, we need to reproduce this on real hardware and storage. Mordancy did some tests (see comment 23) but we don't have enough info about the tests. For testing I'll need a decent host (leopard04/03 would be best, but buri/ucs should also be good, and iSCSI/FC/NFS storage (XtremIO would be best).
Andreas, can you give details about the destination storage server? In comment 32 we learned that the destination storage server is a VM. Is this the same setup that you reported, or a different setup? If the issue is running NFS server on a VM, this bug should more to qemu, it is not related to qemu-img.
Adding back needinfo for Raz, removed by mistake by some commented. We are blocked waiting for a fast server and storage for reproducing this issue.
Daniel, As this is a performance related issue, please provide the required HW for testing
Setting the needinfo again
I tested copy image performance with raw format, using the new -W option in qemu-img convert. I did not test copying qcow2 to raw/qcow2 files, for tow reasons; qemu is the only tool that can read qcow2 format, and the new -W option cause fragmentation of the qcow2 file, and I'm not sure how this effects performance of the guest. ## Tested images I tested copying 3 versions of sparse image: size format data #holes ---------------------------- 100G raw 19% 6352 100G raw 52% 15561 100G raw 86% 24779 For reference here is a Fedora 27 image created by virt-builder. 6G raw 19% 73 The images are fairly fragmented - these make it harder for qemu-img to get good performance, since qemu-img has to deal with lot of small chunks of data. The 19G image was created like this: - Install Fedora 28 server on 100G FC disk - yum-builddep kernel - get current kernel tree - configure using "make olddefconfig" - make The 52G image was created from the 19G image by duplicating the linux built tree twice. The 86G image was created from the 52G image by adding 2 more duplicates of the linux tree. ## Tested hardware Tested on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz server with 40 cores, connected to XtremIO storage via 4G FC HBAs, and 4 paths to storage. The NFS server is another server with same spec, exporting a LUN from XtremIO formatted using xfs over single 10G nic. The NFS server is mounted using NFS 4.2. ## Tested commands I compared these commands: 1. qemu-img qemu-img convert p -f raw -O raw -t none -T none src-img dst-img This is how RHV copies images since 3.6. 2. qemu-img/-W qemu-img convert p -f raw -O raw -t none -T none -W src-img dst-img 3. dd For block: blkdiscard -z -p 32m dst-img dd if=src-img of=dst-img bs=8 iflag=direct oflag=direct conv=sparse,fsync For file: truncate -s 0 dst-img truncate -s 100g dst-img dd if=src-img of=dst-img bs=8 iflag=direct oflag=direct conv=sparse,fsync This command is not the same as qemu-img - it treats holes smaller then the block size (8M) as data. But I think this is good enough. 4. parallel dd For block: blkdiscard -z -p 32m dst-img dd if=src-img of=dst-img bs=8 count=6400 iflag=direct oflag=direct \ conv=sparse,fsync & dd if=src-img of=dst-img bs=8 count=6400 seek=6400 skip=6400 iflag=direct \ oflag=direct conv=sparse,fsync & For file: truncate -s 0 dst-img truncate -s 100g dst-img dd if=src-img of=dst-img bs=8 count=6400 iflag=direct oflag=direct \ conv=notrunc,sparse,fsync & dd if=src-img of=dst-img bs=8 count=6400 seek=6400 skip=6400 \ iflag=direct oflag=direct conv=notrunc,sparse,fsync & The parallel dd commands are not very efficient with very sparse images since one process finish before the other, but they are good way to show if possible improvement. ## Versions # rpm -q qemu-img-rhev coreutils qemu-img-rhev-2.10.0-21.el7_5.4.x86_64 coreutils-8.22-21.el7.x86_64 # uname -r 3.10.0-862.6.3.el7.x86_64 ## Setup Before testing copy to FC volume, I discarded the volume: blkdiscard -p 32m dst-img When copying to NFS, I truncated the volume: truncate -s0 dst-img ## Basic read/write throughput For reference, here is the rate we can read or write on this setup: # dd if=/nfs/100-86g.img of=/dev/null bs=8M count=12800 iflag=direct conv=sparse 107374182400 bytes (107 GB) copied, 116.292 s, 923 MB/s # dd if=/dev/zero of=dst-fc1 bs=8M count=12800 oflag=direct conv=fsync 107374182400 bytes (107 GB) copied, 151.491 s, 709 MB/s # dd if=/dev/zero of=/nfs/upload.img bs=8M count=12800 oflag=direct conv=fsync 107374182400 bytes (107 GB) copied, 296.105 s, 363 MB/s ## Copying from NFS 4.2 to FC storage domain This is how raw templates are copied from export domain of from NFS data domain to FC domain, mentioned in comment 0, or how disks are copied when moving disks between storage domains. Time in seconds. image qemu-img qemu-img/-W dd parallel-dd ---------------------------------------------------------- 100/19G 242 41 165 128 100/52G 658 119 197 144 100/86G 1230 189 238 132 We can see that qemu-img give poor results, and it is worse for less sparse images. This reproduces the issue mentioned in comment 0. 1230 seconds for 100G is 83 MiB/s. With the new -W option qemu-img is the fastest with very sparse image, since it does not need to read the holes, using SEEK_DATA/SEEK_HOLE. I did not test NFS < 4.2, where we qemu has to read all the data and detect zeros manually like dd. But we can see that simple parallel dd can be faster for fully allocated images, when qemu-img has to read most of the image. This show there is room for optimization in qemu-img, even with -W. ## Copying from FC storage domain to FC storage domain This is how disks are copied between storage domains. Time in seconds. image qemu-img qemu-img/-W dd parallel-dd ---------------------------------------------------------- 100/19G 383 194 178 141 100/52G 802 282 230 167 100/86G 1229 371 287 154 In this case qemu-img and dd do not have any info on sparseness of the source image and must detect zeros manually. qemu-img with the -W option is again significantly faster, but even simple dd is faster. The difference is bigger as the image contains more data. ## Copying from FC storage domain to NFS 4.2 storage domain This is how disks are copied between storage domains, or how disks are copied to export domain, mentioned in comment 0. Time in seconds. image qemu-img qemu-img/-W dd parallel-dd ---------------------------------------------------------- 100/19G 215 194 200 n/a 100/52G 347 292 301 n/a 100/86G 493 379 398 340 qemu-img with the new -W option is faster like simple dd, but parallel dd is faster. However using -W will cause fragmentation in the destination file system, so I don't think we should use use this option. Maybe we need to test how VM performance is effected by disks copied using -W to NFS storage. ## Summary qemu-img without the -W option is very slow now. When we moved to use qemu-img in 3.6 it was faster than dd. Maybe we did not test it properly (we used 1M buffer size in dd), or maybe there was a performance regression in qemu-img since RHEL 7.2. This is the patch moving to use only qemu-img for copying images: https://github.com/oVirt/vdsm/commit/0b61c4851a528fd6354d9ab77a68085c41f35dc9 We should use -W for coping to raw volumes on block storage. Using dd for block-to-block copy and block-to-nfs is faster, but we want to use single tool for coping images. We will try to improve qemu-img performance for this use case. qemu-img 3.0 supports copy offloading, we need to test if it give better performance for block to block copy. I'll open qmeu-img bug to track the performance issues.
Created attachment 1476302 [details] Detailed test results 100/19g sparse image
Created attachment 1476303 [details] Detailed test results 100/52g sparse image
Created attachment 1476304 [details] Detailed test results 100/86g sparse image
Created attachment 1476305 [details] Parallel dd test script for file storage
Created attachment 1476306 [details] Parallel dd test script for block storage
We are in blocker only stage of 4.2.6. This change requires full regression testing as this is a key flow. Therefore I think we should wait for 4.2.7 to merge this.
Removing qa_ack+ as this won't be part of 4.2.6
It'd be interesting to test with ddpt and friends from sg3 libs (instead of parallel dd). Also, parallel dd can run up to 4-8 - one per path to the storage, for example.
Guy , Nir tested it on our Leopards with NFS you gave him
I just stumbled onto this ticket while looking for a reason why disk image copy is so slow. I have oVirt 4.2.5 and one of my hypervisors is a new dual Xeon machine with 256 GB RAM connected to a SAN over 10 Gbps dedicated link using iSCSI (plus additional 1 Gbps for external networking). I installed a minimal Linux OS over 500 GB disk image with thick provisioning, shut down the VM and ran a disk copy to the same SAN volume using the oVirt UI. With no VMs running on this host the copying rate is less than 100 Mbps as per my Zabbix monitoring (as expected, all traffic comes and goes over the 10 Gbps SAN link). CPU load is 0.2-0.3 and only fraction of the memory is used. This result is really, really bad. No CPU, network or memory saturation. I don't really understand why qemu should be involved at all in a simple disk image copy (or move) if there is no image conversion, resizing, the disk is thick-provisioned etc. RHEV should be smart enough to figure out the best strategy for disk copy/move and utilise the available resources.
I run the following setup : VM with 2 disks : disk 1: Preallocated Size 10 GB disk 2 : Thin provisioned Virtual size 10 GB Actual size 3 GB I have a system with 1 fiber channel SD and an export NFS domain. Tested import and exports, one time with 4.2.7 (vdsm-4.20.43-1), and second with run with 4.2.6 (vdsm-4.20.39.1-1). On 4.2.6 Import took one minute and 8 seconds and export took 2 minutes and 49 seconds. On 4.2.7 Import took 49 seconds and export took 2 minutes and 18 seconds. So we do see an improvement between 4.2.6 and 4.2.7.
(In reply to guy chen from comment #69) > I run the following setup : > > VM with 2 disks : > > disk 1: > Preallocated > Size 10 GB > disk 2 : > Thin provisioned > Virtual size 10 GB > Actual size 3 GB Can you test with bigger disks, like 100G? Comment 0 mention 2T disk - I don't think we need to waste time on such huge disk, but using 10G disk lot of time is spent in engine inefficient polling. Also, the improvement is only when the destination is raw format on block storage, so we don't expect faster export, only faster import or move / copy disk.
(In reply to Assen Totin from comment #68) RHV is using the most sophisticated tool for copying images, supporting all images formats and using most efficient code for reading, writing, and zeroing on many file systems and storage backends. Did you try with 4.2.7? copying raw images from iSCSI to iSCSI should be much faster now, see comment 57 for examples sing FC storage. According to your description, the copy speed in your case depends on the read throughput from the source image, and since the image is mostly unallocated, the speed of writing zeros to the storage. If you want to understand more why copies are slow on your system, please create a 500g raw disk for testing, and activate both the source and destination disks on storage: lvchange -ay domain-uuid/src-volume-uuid lvchange -ay domain-uuid/dst-volume-uuid Before running these tests, please find the dm-NNN devices for these lvs, and run collect io stats while performing io. You can find it using: ls -l /dev/domain-uuid/{src-volume-uuid,dst-volume-uuid} Checking how fast we can read data and detect zeros from storage: iostat -xdm dm-xxx 2 >> read-iostat.log time dd if=/dev/domain-uuid/src-volume-uuid of=/dev/null \ bs=8M iflag=direct conv=sparse status=progress Checking how fast we can zero with: iostat -xdm dm-yyy 2 >> zero-iostat.log time blkdiscard -z -p 32m /dev/domain-uuid/dst-volume-uuid Checking how fast dd can copy the image: iostat -xdm dm-xxx dm-yyy 2 >> dd-copy-iostat.log time dd if=/dev/domain-uuid/src-volume-uuid \ of=/dev/domain-uuid/dst-volume-uuuid \ bs=8M iflag=direct oflag=direct conv=sparse status=progress Checking how fast qemu-img copy the image: iostat -xdm dm-xxx dm-yyy 2 >> qemu-img-convert-iostat.log time qemu-img convert -p -f raw -O raw -t none -T none -W \ /dev/domain-uuid/src-volume-uuid /dev/domain-uuid/dst-volume-uuuid Please share the output of the commands and iostat logs. If qemu-img convert is not fast enough when copying raw images, this should be improved in qemu-img, not in RHV by using another tool. This will benefit all users instead of only RHV users. In 4.3 we will support storage offloading using cinderlib. If the Cinder driver for your storage supports efficient cloning, such operation may be much faster or even instantaneous (e.g using copy on write).
I have tested import a VM with 100GB preallocated disk on 4.30.3 vs 4.20.34 vdsm, it shows nice improvement : vdsm-4.30.3-1 Duration : 1m18s vdsm-4.20.34-1 Started: Nov 27, 2018, 12:33:27 PM Completed: Nov 27, 2018, 12:35:16 PM Duration : 1m49s
Thank you Guy I see it about the same improvement rate (30%) as in the clone of this BZ that we verified in 4.2.7. (Bug 1621211 ) Moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1077
sync2jira