Description of problem: Downloading images using imageio-client example[1] is much slower compared to uploading. Here is example uploading image from file system on fast ssd to another file system on same fast ssd. Storage is set up as describe in the example documentation. $ hyperfine "./imageio-client --insecure upload /scratch/fedora34.qcow2 https://localhost:54322/images/dst" Benchmark #1: ./imageio-client --insecure upload /scratch/fedora34.qcow2 https://localhost:54322/images/dst Time (mean ± σ): 28.141 s ± 0.739 s [User: 3.265 s, System: 25.765 s] Range (min … max): 26.543 s … 29.169 s 10 runs Downloading the same file: $ hyperfine "./imageio-client --insecure download --format=raw https://localhost:54322/images/src /scratch/dst.raw" Benchmark #1: ./imageio-client --insecure download --format=raw https://localhost:54322/images/src /scratch/dst.raw Time (mean ± σ): 75.367 s ± 1.487 s [User: 3.490 s, System: 35.968 s] Range (min … max): 71.802 s … 77.123 s 10 runs Download is 2.67 times slower. Upload pipeline is: src image -> imageio client -> imageio server -> qemu-nbd -> dst image Download pipeline is: src image -> qemu-nbd -> imageio server -> imageio client -> dst image So we expect similar throughput. Version-Release number of selected component (if applicable): 2.2.0 How reproducible: Always Steps to Reproduce: 1. Download disk or full backup 2. Compare to upload disk [1] https://github.com/oVirt/ovirt-imageio/blob/master/examples/imageio-client
Testing this require storage with consistent performance on both sides, so it is best done in the scale lab. We need to test single and multiple downloads (or full backups), local and remove. When testing remove download, ovirtmgmt must be on fast on fast network (e.g. 10g). There is no point to test on 1g network.
Nir, is this fix already delivered in the new imageio version? Can we move it to ON_QA?
The fix is merged but I did not release a version with this fix yet. I can create a release (ovirt-imageio 2.3.0) for 4.4.9.
(In reply to Nir Soffer from comment #1) > Testing this require storage with consistent performance on both sides, > so it is best done in the scale lab. > > We need to test single and multiple downloads (or full backups), local > and remove. > > When testing remove download, ovirtmgmt must be on fast on fast network > (e.g. 10g). There is no point to test on 1g network. Mordechai, according to this comment by Nir we need scale lab stability and performance team skills. Can you please assist?
(In reply to Avihai from comment #4) > (In reply to Nir Soffer from comment #1) > > Testing this require storage with consistent performance on both sides, > > so it is best done in the scale lab. > > > > We need to test single and multiple downloads (or full backups), local > > and remove. > > > > When testing remove download, ovirtmgmt must be on fast on fast network > > (e.g. 10g). There is no point to test on 1g network. > > Mordechai, according to this comment by Nir we need scale lab stability and > performance team skills. > Can you please assist? ACK - moving to Tzahi.
attached are the results for a baseline for a single disk ( without the fix ) : it seems that the download speed is faster than the upload speed when the source file location and the destination file location is set to Nvme device using xfs as file system : [root@f02-h22-000-r640 ]# df -TH |grep nvme /dev/nvme0n1p1 xfs 3.2T 453G 2.8T 15% /Nvme_Disk * Version : rhv-release-4.4.9-3-001.noarch * Ovirtmgmt interface speed : 25Gb * 72GB disk size * Download time: 97 sec * Download speed using ibmonitor tool on ovirtmgmt interface > 733MB ~ * Upload time: 121 sec * Upload speed using ibmonitor tool on ovirtmgmt > 720MB ~ * Nvme device for download location & for source file location Disk info : qemu-img info /Nvme_Disk/images/src/rhel76.qcow2 image: /Nvme_Disk/images/src/rhel76.qcow2 file format: qcow2 virtual size: 100 GiB (107374182400 bytes) disk size: 72.2 GiB cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Nvme_Disk/images/dst/rhel76_download.qcow2 [ 0.0 ] Connecting... [ 0.3 ] Creating image transfer... [ 2.2 ] Transfer ID: 78e8d056-aba3-4786-819f-171bada75d74 [ 2.2 ] Transfer host name: f01-h07-000-r640.rdu2.scalelab.redhat.com [ 2.2 ] Downloading disk... [ 100.00% ] 100.00 GiB, 94.08 seconds, 1.06 GiB/s [ 96.3 ] Finalizing image transfer... [ 97.3 ] Download completed successfully real 1m37.462s user 0m53.649s sys 2m1.754s [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c my-engine --disk-sparse --disk-format qcow2 --sd-name L0_Group_0 /Nvme_Disk/images/src/rhel76.qcow2 [ 0.0 ] Checking image... [ 0.1 ] Image format: qcow2 [ 0.1 ] Disk format: cow [ 0.1 ] Disk content type: data [ 0.1 ] Disk provisioned size: 107374182400 [ 0.1 ] Disk initial size: 70458408960 [ 0.1 ] Disk name: rhel76.qcow2 [ 0.1 ] Disk backup: False [ 0.1 ] Connecting... [ 0.1 ] Creating disk... [ 17.6 ] Disk ID: be1b3ff0-c9b1-4f28-9158-a8e64fe60f17 [ 17.6 ] Creating image transfer... [ 19.6 ] Transfer ID: 6f216723-b74f-4a67-b60b-3b6d6f6083cf [ 19.6 ] Transfer host name: f01-h07-000-r640.rdu2.scalelab.redhat.com [ 19.6 ] Uploading image... [ 100.00% ] 100.00 GiB, 102.01 seconds, 1003.78 MiB/s [ 121.6 ] Finalizing image transfer... [ 123.6 ] Upload completed successfully real 2m3.766s user 0m27.414s sys 0m55.710s
I see now that I tested this only with raw format: $ hyperfine "./imageio-client --insecure download --format=raw https://localhost:54322/images/src /scratch/dst.raw" It is possible that the issue exist only with raw format. Lets repeat this test using: python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py \ -c my-engine \ --disk-format raw \ --sd-name L0_Group_0 \ /Nvme_Disk/images/src/rhel76.qcow2 python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py \ -c my-engine \ --format raw \ 28ef5767-7a21-4f23-81a4-248f5eb33f66 \ /Nvme_Disk/images/dst/rhel76_download.raw If this does not reproduce the issue, we need to test with another storage. I think the most useful setup will be fast NFS server, since most users will download images to NFS and not to local storage.
using the --format-disk as raw doesnt make any different on the results , still the download is faster than the upload : [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine --format raw 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Nvme_Disk/images/dst/rhel76_download3.qcow2 [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 7.0 ] Transfer ID: a9736bd4-7492-4419-b53b-066d0781c4cc [ 7.0 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 7.0 ] Downloading disk... [ 100.00% ] 100.00 GiB, 96.53 seconds, 1.04 GiB/s [ 103.6 ] Finalizing image transfer... [ 106.6 ] Download completed successfully real 1m46.769s user 0m4.051s sys 1m40.957s [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c my-engine --disk-format raw --disk-sparse --disk-format qcow2 --sd-name L0_Group_0 /Nvme_Disk/images/src/rhel76.qcow2 [ 0.0 ] Checking image... [ 0.1 ] Image format: qcow2 [ 0.1 ] Disk format: cow [ 0.1 ] Disk content type: data [ 0.1 ] Disk provisioned size: 107374182400 [ 0.1 ] Disk initial size: 70458408960 [ 0.1 ] Disk name: rhel76.qcow2 [ 0.1 ] Disk backup: False [ 0.1 ] Connecting... [ 0.1 ] Creating disk... [ 18.8 ] Disk ID: 23728563-c286-433f-83e8-1b30b2ea3961 [ 18.8 ] Creating image transfer... [ 21.7 ] Transfer ID: 9e135d82-216f-44c2-84e7-cf964fa8103d [ 21.7 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 21.7 ] Uploading image... [ 100.00% ] 100.00 GiB, 96.89 seconds, 1.03 GiB/s [ 118.6 ] Finalizing image transfer... [ 131.7 ] Upload completed successfully real 2m11.900s user 0m3.502s sys 1m6.453s
Tested again the Download process to NFS share by nir request : NFS server running on server > f01-h07-000-r640.rdu2.scalelab.redhat.com [root@f02-h22-000-r640 ~]# df -TH Filesystem Type Size Used Avail Use% Mounted on 172.16.12.7:/Nvme_NFS/nfs_share nfs4 3.2T 93G 3.2T 3% /Download_Folder_f01-h07 [root@f02-h22-000-r640 ~]# showmount -e 172.16.12.7 Export list for 172.16.12.7: /Nvme_NFS/nfs_share * Download to NFS share : [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine --format raw 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Download_Folder_f01-h07/rhel76.qcow2 [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 4.3 ] Transfer ID: 6a988189-2c17-4589-9cda-7f2467f2e820 [ 4.3 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 4.3 ] Downloading disk... [ 100.00% ] 100.00 GiB, 105.90 seconds, 966.93 MiB/s [ 110.2 ] Finalizing image transfer... [ 117.3 ] Download completed successfully real 1m57.472s user 0m3.888s sys 2m3.904s Download to local Nvme disk using xfs filesystem : [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine --format raw 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Nvme_Disk/images/dst/rhel76_download3.qcow2 [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 7.0 ] Transfer ID: a9736bd4-7492-4419-b53b-066d0781c4cc [ 7.0 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 7.0 ] Downloading disk... [ 100.00% ] 100.00 GiB, 96.53 seconds, 1.04 GiB/s [ 103.6 ] Finalizing image transfer... [ 106.6 ] Download completed successfully real 1m46.769s user 0m4.051s sys 1m40.957s the download time to NFS share is 11 sec slower Comparison to the local NVMe disk
It looks like qemu-nbd version tested includes this upstream fix: https://github.com/qemu/qemu/commit/09615257058a0ae87b837bb041f56f7312d9ead8 But this version is expected in qemu-img 6.2.0. Or the tested version is not qemu-nbd 6.0.0, where we found the performance issue. Maybe this was a regression in qemu 6.0.0. On my rhel 8.5 nightly host I have: $ rpm -q qemu-img qemu-img-6.0.0-31.module+el8.5.0+12787+aaa8bdfa.x86_64 Which qemu-img package did you test? To avoid the confusion next time, please always include output of "rpm -qa" to the bug. Additionally, how the NFS server is configured? Please share output of "exportfs -v" on the NFS server. For example if you used the "async" option, it will hide the issue with unwanted If this was not tested with qemu-img 6.0.0, please make sure you are using the right repos for RHEL 8.5 nightly and upgrade qemu-kvm to latest version. On setup reproducing this issue, copying an image to qemu-nbd is much slower with the defaults, using --cache=writetrough. To ensure that we can reproduce the issue, lets test first copying with qemu-nbd. 1. Create destination image qemu-img create /mountpoint/dst.img 100g 2. Start qemu-nbd with default cache qemu-nbd -t -f raw -k /tmp/nbd.sock /mountpoint/dst.img 3. Copy the image using qemu-img time qemu-img convert -f qcow2 -O raw -W /path/to/src.qcow2 nbd+unix://?socket=/tmp/nbd.sock Repeat this test when when using --cache=writethrough and --cache=writeback: qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /mountpoint/dst.img and: qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writeback/mountpoint/dst.img If we don't see any difference with qemu-img convert, we will not be able to reproduce the issue in this environment.
[root@f02-h22-000-r640 ~]# rpm -qa |grep 'imageio\|rhv\|qemu-img' qemu-img-6.0.0-31.module+el8.5.0+12787+aaa8bdfa.x86_64 ovirt-imageio-daemon-2.2.0-1.el8ev.x86_64 rhv-release-4.4.9-3-001.noarch [root@f01-h07-000-r640 ~]# exportfs -v /Nvme_NFS/nfs_share <world>(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash) The dst.img file is on NFS share : ====================== 1. --cache=writethrough [root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock real 1m54.411s user 0m4.568s sys 0m35.419s 2. --cache=writeback [root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock real 1m46.236s user 0m4.140s sys 0m34.223s 3. With "-T NONE" flag on the qemu-img command : [root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -T none -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock real 1m48.589s user 0m4.702s sys 0m18.806s The dst.img file in on /tmp on the local SSD of the OS : ========================================================= 1. --cache=writethrough [root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock real 2m46.037s user 0m4.448s sys 0m32.398s 2. --cache=writeback [root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock real 2m33.580s user 0m4.042s sys 0m32.810s
I looks like we cannot reproduce the slowdown with --cache=writethrough in this environment. In both cases (NFS, local SSD) --cache=writeback is 9.2% faster. In my tests it was 260% faster. Interesting detail is the "qemu-img convert" from local file to NFS took 114 seconds while downloading the same image from imageio to NFS took 106. To complete testing we need to upgrade to imageio 2.3.0 and compare the download times for the same image to local NVMe and NFS server.
Tested again on the version which contains the fix: rhv-release-4.4.9-4 [root@f02-h22-000-r640 yum.repos.d]# rpm -qa |grep 'imageio\|rhv\|qemu-img' ovirt-imageio-client-2.3.0-1.el8ev.x86_64 qemu-img-6.0.0-31.module+el8.5.0+12787+aaa8bdfa.x86_64 rhv-release-4.4.9-4-001.noarch ovirt-imageio-daemon-2.3.0-1.el8ev.x86_64 ovirt-imageio-common-2.3.0-1.el8ev.x86_64 File source ( located on Local Nvme device : [root@f02-h22-000-r640 yum.repos.d]# qemu-img info /Nvme_Disk/images/src/rhel76.qcow2 image: /Nvme_Disk/images/src/rhel76.qcow2 file format: qcow2 virtual size: 100 GiB (107374182400 bytes) disk size: 72.2 GiB cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false NFS test results : qemu-nbd method > --writethrough ( defulat ) destiantion file located on NFS share ( on secondery host implement on Nvme disk ) : qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /Download_Folder_f01-h07/dst.img [root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock (100.00/100%) real 1m57.154s user 0m3.755s sys 0m30.081s qemu-nbd method > --writeback destiantion file located on NFS share ( on secondery host implement on Nvme disk ) : qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writeback /Download_Folder_f01-h07/dst.img [root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock (100.00/100%) real 1m41.654s user 0m4.312s sys 0m32.607s Local Nvme disk results : qemu-nbd method > --writethrough ( defulat ) destiantion file located on Nvme disk 3TB file system : xfs qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /Nvme_Disk/images/dst/dst.img [root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock (100.00/100%) real 0m55.259s user 0m4.598s sys 0m34.488s qemu-nbd method > --writeback destiantion file located on Nvme disk 3TB file system : xfs qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writeback /Nvme_Disk/images/dst/dst.img [root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock (100.00/100%) real 0m39.708s user 0m4.087s sys 0m33.705s Local SSD disk results ( on the OS disk : /tmp ) : qemu-nbd method > --writethrough ( defulat ) destiantion file located on SSD disk 500GB file system : xfs qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /tmp/dst.img [root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock (100.00/100%) real 2m36.232s user 0m4.759s sys 0m34.133s qemu-nbd method > --writeback destiantion file located on SSD disk 500GB file system : xfs [root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock (100.00/100%) real 2m35.635s user 0m4.050s sys 0m32.260s
please ignore comments 15-16 I have attached the correct table results fixed the section for "Destination Local Nvme " for version rhv-4.4.9-3 imageio-2.2.0-1 which is N/A was only tested using the download_disk.py for this version and not using > qemu-img convert: summary comparison on both versions : * source file located on > Nvme disk file system xfs > /Nvme_Disk/images/src/rhel76.qcow2 +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Version | Destination NFS server | Destination Local SSD | Destination Local Nvme | +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | | cache=writeback | cache=writethrough | cache=writeback | cache=writethrough | cache=writeback | cache=writethrough | | rhv-4.4.9-3 imageio-2.2.0-1 | 1m46.236s | 1m54.411s | 2m33.580s | 2m46.037s | N/A | N/A | | rhv-4.4.9-4 imageio-2.3.0-1 | 1m41.654s | 1m57.154s | 2m35.635s | 2m36.232s | 0m39.708s | 0m55.259s | +------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------+
I think we can move to verified. We could not reproduce the original issue but we don't see any significant difference between upload and download times.
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.