Bug 1990656
| Summary: | Downloading images much slower than uploading | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-imageio | Reporter: | Nir Soffer <nsoffer> |
| Component: | Client | Assignee: | Nir Soffer <nsoffer> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Tzahi Ashkenazi <tashkena> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.2.0 | CC: | aefrat, ahadas, bugs, eshenitz, mlehrer, nsoffer, sgordon, tashkena, tnisan |
| Target Milestone: | ovirt-4.4.9 | Keywords: | Performance, ZStream |
| Target Release: | 2.3.0 | Flags: | pm-rhel:
ovirt-4.4+
|
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-imageio-2.3.0-1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-21 07:27:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Nir Soffer
2021-08-05 20:54:52 UTC
Testing this require storage with consistent performance on both sides, so it is best done in the scale lab. We need to test single and multiple downloads (or full backups), local and remove. When testing remove download, ovirtmgmt must be on fast on fast network (e.g. 10g). There is no point to test on 1g network. Nir, is this fix already delivered in the new imageio version? Can we move it to ON_QA? The fix is merged but I did not release a version with this fix yet. I can create a release (ovirt-imageio 2.3.0) for 4.4.9. (In reply to Nir Soffer from comment #1) > Testing this require storage with consistent performance on both sides, > so it is best done in the scale lab. > > We need to test single and multiple downloads (or full backups), local > and remove. > > When testing remove download, ovirtmgmt must be on fast on fast network > (e.g. 10g). There is no point to test on 1g network. Mordechai, according to this comment by Nir we need scale lab stability and performance team skills. Can you please assist? (In reply to Avihai from comment #4) > (In reply to Nir Soffer from comment #1) > > Testing this require storage with consistent performance on both sides, > > so it is best done in the scale lab. > > > > We need to test single and multiple downloads (or full backups), local > > and remove. > > > > When testing remove download, ovirtmgmt must be on fast on fast network > > (e.g. 10g). There is no point to test on 1g network. > > Mordechai, according to this comment by Nir we need scale lab stability and > performance team skills. > Can you please assist? ACK - moving to Tzahi. attached are the results for a baseline for a single disk ( without the fix ) :
it seems that the download speed is faster than the upload speed when the source file location and the destination file location is set to Nvme device using xfs as file system :
[root@f02-h22-000-r640 ]# df -TH |grep nvme
/dev/nvme0n1p1 xfs 3.2T 453G 2.8T 15% /Nvme_Disk
* Version : rhv-release-4.4.9-3-001.noarch
* Ovirtmgmt interface speed : 25Gb
* 72GB disk size
* Download time: 97 sec
* Download speed using ibmonitor tool on ovirtmgmt interface > 733MB ~
* Upload time: 121 sec
* Upload speed using ibmonitor tool on ovirtmgmt > 720MB ~
* Nvme device for download location & for source file location
Disk info :
qemu-img info /Nvme_Disk/images/src/rhel76.qcow2
image: /Nvme_Disk/images/src/rhel76.qcow2
file format: qcow2
virtual size: 100 GiB (107374182400 bytes)
disk size: 72.2 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false
[root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Nvme_Disk/images/dst/rhel76_download.qcow2
[ 0.0 ] Connecting...
[ 0.3 ] Creating image transfer...
[ 2.2 ] Transfer ID: 78e8d056-aba3-4786-819f-171bada75d74
[ 2.2 ] Transfer host name: f01-h07-000-r640.rdu2.scalelab.redhat.com
[ 2.2 ] Downloading disk...
[ 100.00% ] 100.00 GiB, 94.08 seconds, 1.06 GiB/s
[ 96.3 ] Finalizing image transfer...
[ 97.3 ] Download completed successfully
real 1m37.462s
user 0m53.649s
sys 2m1.754s
[root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c my-engine --disk-sparse --disk-format qcow2 --sd-name L0_Group_0 /Nvme_Disk/images/src/rhel76.qcow2
[ 0.0 ] Checking image...
[ 0.1 ] Image format: qcow2
[ 0.1 ] Disk format: cow
[ 0.1 ] Disk content type: data
[ 0.1 ] Disk provisioned size: 107374182400
[ 0.1 ] Disk initial size: 70458408960
[ 0.1 ] Disk name: rhel76.qcow2
[ 0.1 ] Disk backup: False
[ 0.1 ] Connecting...
[ 0.1 ] Creating disk...
[ 17.6 ] Disk ID: be1b3ff0-c9b1-4f28-9158-a8e64fe60f17
[ 17.6 ] Creating image transfer...
[ 19.6 ] Transfer ID: 6f216723-b74f-4a67-b60b-3b6d6f6083cf
[ 19.6 ] Transfer host name: f01-h07-000-r640.rdu2.scalelab.redhat.com
[ 19.6 ] Uploading image...
[ 100.00% ] 100.00 GiB, 102.01 seconds, 1003.78 MiB/s
[ 121.6 ] Finalizing image transfer...
[ 123.6 ] Upload completed successfully
real 2m3.766s
user 0m27.414s
sys 0m55.710s
I see now that I tested this only with raw format:
$ hyperfine "./imageio-client --insecure download --format=raw https://localhost:54322/images/src /scratch/dst.raw"
It is possible that the issue exist only with raw format. Lets repeat this test
using:
python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py \
-c my-engine \
--disk-format raw \
--sd-name L0_Group_0 \
/Nvme_Disk/images/src/rhel76.qcow2
python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py \
-c my-engine \
--format raw \
28ef5767-7a21-4f23-81a4-248f5eb33f66 \
/Nvme_Disk/images/dst/rhel76_download.raw
If this does not reproduce the issue, we need to test with another storage.
I think the most useful setup will be fast NFS server, since most users will
download images to NFS and not to local storage.
using the --format-disk as raw doesnt make any different on the results , still the download is faster than the upload : [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine --format raw 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Nvme_Disk/images/dst/rhel76_download3.qcow2 [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 7.0 ] Transfer ID: a9736bd4-7492-4419-b53b-066d0781c4cc [ 7.0 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 7.0 ] Downloading disk... [ 100.00% ] 100.00 GiB, 96.53 seconds, 1.04 GiB/s [ 103.6 ] Finalizing image transfer... [ 106.6 ] Download completed successfully real 1m46.769s user 0m4.051s sys 1m40.957s [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/upload_disk.py -c my-engine --disk-format raw --disk-sparse --disk-format qcow2 --sd-name L0_Group_0 /Nvme_Disk/images/src/rhel76.qcow2 [ 0.0 ] Checking image... [ 0.1 ] Image format: qcow2 [ 0.1 ] Disk format: cow [ 0.1 ] Disk content type: data [ 0.1 ] Disk provisioned size: 107374182400 [ 0.1 ] Disk initial size: 70458408960 [ 0.1 ] Disk name: rhel76.qcow2 [ 0.1 ] Disk backup: False [ 0.1 ] Connecting... [ 0.1 ] Creating disk... [ 18.8 ] Disk ID: 23728563-c286-433f-83e8-1b30b2ea3961 [ 18.8 ] Creating image transfer... [ 21.7 ] Transfer ID: 9e135d82-216f-44c2-84e7-cf964fa8103d [ 21.7 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 21.7 ] Uploading image... [ 100.00% ] 100.00 GiB, 96.89 seconds, 1.03 GiB/s [ 118.6 ] Finalizing image transfer... [ 131.7 ] Upload completed successfully real 2m11.900s user 0m3.502s sys 1m6.453s Tested again the Download process to NFS share by nir request : NFS server running on server > f01-h07-000-r640.rdu2.scalelab.redhat.com [root@f02-h22-000-r640 ~]# df -TH Filesystem Type Size Used Avail Use% Mounted on 172.16.12.7:/Nvme_NFS/nfs_share nfs4 3.2T 93G 3.2T 3% /Download_Folder_f01-h07 [root@f02-h22-000-r640 ~]# showmount -e 172.16.12.7 Export list for 172.16.12.7: /Nvme_NFS/nfs_share * Download to NFS share : [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine --format raw 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Download_Folder_f01-h07/rhel76.qcow2 [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 4.3 ] Transfer ID: 6a988189-2c17-4589-9cda-7f2467f2e820 [ 4.3 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 4.3 ] Downloading disk... [ 100.00% ] 100.00 GiB, 105.90 seconds, 966.93 MiB/s [ 110.2 ] Finalizing image transfer... [ 117.3 ] Download completed successfully real 1m57.472s user 0m3.888s sys 2m3.904s Download to local Nvme disk using xfs filesystem : [root@f02-h22-000-r640 ~]# time python3 /usr/share/doc/python3-ovirt-engine-sdk4/examples/download_disk.py -c my-engine --format raw 28ef5767-7a21-4f23-81a4-248f5eb33f66 /Nvme_Disk/images/dst/rhel76_download3.qcow2 [ 0.0 ] Connecting... [ 0.2 ] Creating image transfer... [ 7.0 ] Transfer ID: a9736bd4-7492-4419-b53b-066d0781c4cc [ 7.0 ] Transfer host name: f02-h22-000-r640.rdu2.scalelab.redhat.com [ 7.0 ] Downloading disk... [ 100.00% ] 100.00 GiB, 96.53 seconds, 1.04 GiB/s [ 103.6 ] Finalizing image transfer... [ 106.6 ] Download completed successfully real 1m46.769s user 0m4.051s sys 1m40.957s the download time to NFS share is 11 sec slower Comparison to the local NVMe disk It looks like qemu-nbd version tested includes this upstream fix: https://github.com/qemu/qemu/commit/09615257058a0ae87b837bb041f56f7312d9ead8 But this version is expected in qemu-img 6.2.0. Or the tested version is not qemu-nbd 6.0.0, where we found the performance issue. Maybe this was a regression in qemu 6.0.0. On my rhel 8.5 nightly host I have: $ rpm -q qemu-img qemu-img-6.0.0-31.module+el8.5.0+12787+aaa8bdfa.x86_64 Which qemu-img package did you test? To avoid the confusion next time, please always include output of "rpm -qa" to the bug. Additionally, how the NFS server is configured? Please share output of "exportfs -v" on the NFS server. For example if you used the "async" option, it will hide the issue with unwanted If this was not tested with qemu-img 6.0.0, please make sure you are using the right repos for RHEL 8.5 nightly and upgrade qemu-kvm to latest version. On setup reproducing this issue, copying an image to qemu-nbd is much slower with the defaults, using --cache=writetrough. To ensure that we can reproduce the issue, lets test first copying with qemu-nbd. 1. Create destination image qemu-img create /mountpoint/dst.img 100g 2. Start qemu-nbd with default cache qemu-nbd -t -f raw -k /tmp/nbd.sock /mountpoint/dst.img 3. Copy the image using qemu-img time qemu-img convert -f qcow2 -O raw -W /path/to/src.qcow2 nbd+unix://?socket=/tmp/nbd.sock Repeat this test when when using --cache=writethrough and --cache=writeback: qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /mountpoint/dst.img and: qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writeback/mountpoint/dst.img If we don't see any difference with qemu-img convert, we will not be able to reproduce the issue in this environment. [root@f02-h22-000-r640 ~]# rpm -qa |grep 'imageio\|rhv\|qemu-img'
qemu-img-6.0.0-31.module+el8.5.0+12787+aaa8bdfa.x86_64
ovirt-imageio-daemon-2.2.0-1.el8ev.x86_64
rhv-release-4.4.9-3-001.noarch
[root@f01-h07-000-r640 ~]# exportfs -v
/Nvme_NFS/nfs_share
<world>(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
The dst.img file is on NFS share :
======================
1. --cache=writethrough
[root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
real 1m54.411s
user 0m4.568s
sys 0m35.419s
2. --cache=writeback
[root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
real 1m46.236s
user 0m4.140s
sys 0m34.223s
3. With "-T NONE" flag on the qemu-img command :
[root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -T none -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
real 1m48.589s
user 0m4.702s
sys 0m18.806s
The dst.img file in on /tmp on the local SSD of the OS :
=========================================================
1. --cache=writethrough
[root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
real 2m46.037s
user 0m4.448s
sys 0m32.398s
2. --cache=writeback
[root@f02-h22-000-r640 ~]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
real 2m33.580s
user 0m4.042s
sys 0m32.810s
I looks like we cannot reproduce the slowdown with --cache=writethrough in this environment. In both cases (NFS, local SSD) --cache=writeback is 9.2% faster. In my tests it was 260% faster. Interesting detail is the "qemu-img convert" from local file to NFS took 114 seconds while downloading the same image from imageio to NFS took 106. To complete testing we need to upgrade to imageio 2.3.0 and compare the download times for the same image to local NVMe and NFS server. Tested again on the version which contains the fix: rhv-release-4.4.9-4
[root@f02-h22-000-r640 yum.repos.d]# rpm -qa |grep 'imageio\|rhv\|qemu-img'
ovirt-imageio-client-2.3.0-1.el8ev.x86_64
qemu-img-6.0.0-31.module+el8.5.0+12787+aaa8bdfa.x86_64
rhv-release-4.4.9-4-001.noarch
ovirt-imageio-daemon-2.3.0-1.el8ev.x86_64
ovirt-imageio-common-2.3.0-1.el8ev.x86_64
File source ( located on Local Nvme device :
[root@f02-h22-000-r640 yum.repos.d]# qemu-img info /Nvme_Disk/images/src/rhel76.qcow2
image: /Nvme_Disk/images/src/rhel76.qcow2
file format: qcow2
virtual size: 100 GiB (107374182400 bytes)
disk size: 72.2 GiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false
NFS test results :
qemu-nbd method > --writethrough ( defulat )
destiantion file located on NFS share ( on secondery host implement on Nvme disk ) :
qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /Download_Folder_f01-h07/dst.img
[root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
(100.00/100%)
real 1m57.154s
user 0m3.755s
sys 0m30.081s
qemu-nbd method > --writeback
destiantion file located on NFS share ( on secondery host implement on Nvme disk ) :
qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writeback /Download_Folder_f01-h07/dst.img
[root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
(100.00/100%)
real 1m41.654s
user 0m4.312s
sys 0m32.607s
Local Nvme disk results :
qemu-nbd method > --writethrough ( defulat )
destiantion file located on Nvme disk 3TB file system : xfs
qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /Nvme_Disk/images/dst/dst.img
[root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
(100.00/100%)
real 0m55.259s
user 0m4.598s
sys 0m34.488s
qemu-nbd method > --writeback
destiantion file located on Nvme disk 3TB file system : xfs
qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writeback /Nvme_Disk/images/dst/dst.img
[root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
(100.00/100%)
real 0m39.708s
user 0m4.087s
sys 0m33.705s
Local SSD disk results ( on the OS disk : /tmp ) :
qemu-nbd method > --writethrough ( defulat )
destiantion file located on SSD disk 500GB file system : xfs
qemu-nbd -t -f raw -k /tmp/nbd.sock --cache=writethrough /tmp/dst.img
[root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
(100.00/100%)
real 2m36.232s
user 0m4.759s
sys 0m34.133s
qemu-nbd method > --writeback
destiantion file located on SSD disk 500GB file system : xfs
[root@f02-h22-000-r640 yum.repos.d]# time qemu-img convert -f qcow2 -p -O raw -W /Nvme_Disk/images/src/rhel76.qcow2 nbd+unix://?socket=/tmp/nbd.sock
(100.00/100%)
real 2m35.635s
user 0m4.050s
sys 0m32.260s
please ignore comments 15-16 I have attached the correct table results fixed the section for "Destination Local Nvme " for version rhv-4.4.9-3 imageio-2.2.0-1 which is N/A was only tested using the download_disk.py for this version and not using > qemu-img convert: summary comparison on both versions : * source file located on > Nvme disk file system xfs > /Nvme_Disk/images/src/rhel76.qcow2 +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Version | Destination NFS server | Destination Local SSD | Destination Local Nvme | +------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | | cache=writeback | cache=writethrough | cache=writeback | cache=writethrough | cache=writeback | cache=writethrough | | rhv-4.4.9-3 imageio-2.2.0-1 | 1m46.236s | 1m54.411s | 2m33.580s | 2m46.037s | N/A | N/A | | rhv-4.4.9-4 imageio-2.3.0-1 | 1m41.654s | 1m57.154s | 2m35.635s | 2m36.232s | 0m39.708s | 0m55.259s | +------------------------------+-------------------------+---------------------------------------------------------------------------------------------------------------+ I think we can move to verified. We could not reproduce the original issue but we don't see any significant difference between upload and download times. This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |