Description of problem: When pulling in an image from URL the time is relatively quick. When trying to clone the same image can take two times as long. URL I0921 02:55:45.864641 1 importer.go:51] Starting importer I0921 02:55:45.864703 1 importer.go:112] begin import process I0921 02:55:47.270804 1 data-processor.go:277] Calculating available size I0921 02:55:47.270846 1 data-processor.go:289] Checking out file system volume size. I0921 02:55:47.270857 1 data-processor.go:297] Request image size not empty. I0921 02:55:47.270867 1 data-processor.go:302] Target size 63999987712. I0921 02:55:47.270937 1 data-processor.go:206] New phase: Convert I0921 02:55:47.270947 1 data-processor.go:212] Validating image I0921 02:55:47.305466 1 qemu.go:212] 0.00 I0921 02:55:50.229006 1 qemu.go:212] 1.00 I0921 02:55:52.395782 1 qemu.go:212] 2.01 I0921 02:55:55.187488 1 qemu.go:212] 3.01 I0921 02:55:57.186483 1 qemu.go:212] 4.01 I0921 02:55:59.466189 1 qemu.go:212] 5.01 I0921 02:56:01.727444 1 qemu.go:212] 6.01 ... I0921 02:59:46.867102 1 qemu.go:212] 98.52 I0921 02:59:49.740095 1 qemu.go:212] 99.52 I0921 02:59:53.023658 1 data-processor.go:206] New phase: Resize W0921 02:59:53.032533 1 data-processor.go:264] Available space less than requested size, resizing image to available space 63999987712. I0921 02:59:53.032546 1 data-processor.go:270] Expanding image size to: 63999987712 I0921 02:59:53.038400 1 data-processor.go:206] New phase: Complete I0921 02:59:53.038473 1 importer.go:175] Import complete ~ 4 min Clone VOLUME_MODE=filesystem MOUNT_POINT=/var/run/cdi/clone/source /var/run/cdi/clone/source / UPLOAD_BYTES=63999991808 ./ I0921 02:44:33.570091 11 clone-source.go:107] content_type is "filesystem-clone" I0921 02:44:33.570143 11 clone-source.go:108] upload_bytes is 63999991808 I0921 02:44:33.570151 11 clone-source.go:118] Starting cloner target ./disk.img I0921 02:44:33.897187 11 clone-source.go:130] Set header to filesystem-clone I0921 02:44:34.570339 11 prometheus.go:69] 0.03 I0921 02:44:35.570494 11 prometheus.go:69] 0.06 I0921 02:44:36.570672 11 prometheus.go:69] 0.09 I0921 02:44:37.570815 11 prometheus.go:69] 0.11 I0921 02:44:38.570915 11 prometheus.go:69] 0.14 I0921 02:44:39.571040 11 prometheus.go:69] 0.16 I0921 02:44:40.571179 11 prometheus.go:69] 0.18 I0921 02:44:41.571291 11 prometheus.go:69] 0.20 ... I0921 02:53:53.638620 11 prometheus.go:69] 15.12 I0921 02:53:54.130793 11 clone-source.go:97] Wrote 9686231040 bytes I0921 02:53:54.138618 11 clone-source.go:148] Response body: I0921 02:53:54.138630 11 clone-source.go:150] clone complete ~ 10 min Clone Small VOLUME_MODE=filesystem MOUNT_POINT=/var/run/cdi/clone/source /var/run/cdi/clone/source / UPLOAD_BYTES=13857230848 ./ I0921 03:39:17.785854 11 clone-source.go:107] content_type is "filesystem-clone" I0921 03:39:17.785903 11 clone-source.go:108] upload_bytes is 13857230848 I0921 03:39:17.785909 11 clone-source.go:118] Starting cloner target ./disk.img I0921 03:39:18.136714 11 clone-source.go:130] Set header to filesystem-clone I0921 03:39:18.786125 11 prometheus.go:69] 0.11 I0921 03:39:19.786244 11 prometheus.go:69] 0.26 I0921 03:39:20.786381 11 prometheus.go:69] 0.43 I0921 03:39:21.786481 11 prometheus.go:69] 0.52 ... I0921 03:49:25.858458 11 prometheus.go:69] 69.69 I0921 03:49:26.858543 11 prometheus.go:69] 69.87 I0921 03:49:27.126005 11 clone-source.go:97] Wrote 9686231040 bytes I0921 03:49:27.133934 11 clone-source.go:148] Response body: I0921 03:49:27.133944 11 clone-source.go:150] clone complete ~ 10 min Version-Release number of selected component (if applicable): 2.5 How reproducible: 100% Steps to Reproduce: 1. Install CNV 2. Pull in an image from URL (time the process) 3. Clone same image (time the process) 4. Compare results Actual results: Cloning is a lot slower for same sized image Expected results: Would think since the image is same size that the two methods would be comparable in speed. Additional info: Backend storage in this case is PortWorx - however I suspect we would see same behavior no matter what storage backend is used
I wonder if this is due to CPU resource limits on the clone source or target pods since we are filtering through gzip to reduce network traffic. Michael could you please take a look? Cloning is even more important for 2.5 and I'd like to make sure we're performing acceptably.
Before investigating CPU limits it would be helpful to know a little more about the import use case. What is the format of the file (qcow2 or raw)? What is the physical size? Looks like the virtual size is 63G.
Here is a snippet of the datavolume they were testing when instantiating a VM: #dataVolumeTemplates: # - apiVersion: cdi.kubevirt.io/v1alpha1 # kind: DataVolume # metadata: # name: fedora-32-disk-0 # spec: # pvc: # accessModes: # - ReadWriteMany # resources: # requests: # storage: 10Gi # storageClassName: px-repl2-file # #volumeMode: Filesystem # source: # http: # url: >- # https://files.caas.<customer redacted>.com:9443/ova/Fedora-Cloud-Base-32-1.6.x86_64.qcow2 dataVolumeTemplates: - apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: fedora-32-disk-0 spec: pvc: accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: px-repl2-file #volumeMode: Filesystem source: pvc: name: fedora-cloud-base-32-v1 namespace: vm-images They were switching between using the fedora image and then cloning from a pvc in the example above. My thought was maybe the cloning process is cloning all 10gb pvc whereas the Fedora qcow is just the size of the initial qcow2 which I believe is >1gb.
After doing some profiling, I discovered that clone process using a lot of CPU in gzip. Migrating to snappy made clone performance comparable to http import. I suspect this fix will go a long way in addressing the performance variation described in this bug, but of course cannot be sure as there are a lot of different variables at play. But for my test, there was an approx 6X improvement in cloning the base fedora cloud image.
Compared timings for hostpath-provisioner SC with Fedora32 image both pre & post this change (HCO-v2.5.0-270 VS 329) and definitely seems like there is an improvement: Pre-change: Import: 1:35 minutes Clone: 5:33 minutes Post-change: Import: 1:03 minutes Clone: 38 seconds Keep in mind this is not BM (PSI) Verified on CNV 2.5.0, KubeVirt: v0.34.0-13-g2cc7d61, CDI: Containerized Data Importer v1.23.6 HCO image: registry.redhat.io/container-native-virtualization/hyperconverged-cluster-operator@sha256:631d9868743d52fba5fb5a089acc418328e0c33c55ae2319e5e06c24a4323124 CSV creation time: 2020-10-11 19:19:07 @Peter I see Adam already has a clone for this purpose, I will try to get my hands on a 2.4.X environment to sample pre-change behavior
Release notes PR is now merged: https://github.com/openshift/openshift-docs/pull/27019
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:5127