Bug 1885964

Summary: Image Cloning Slow Compared to URL
Product: Container Native Virtualization (CNV) Reporter: Benjamin Schmaus <bschmaus>
Component: StorageAssignee: Michael Henriksen <mhenriks>
Status: CLOSED ERRATA QA Contact: Alex Kalenyuk <akalenyu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.5.0CC: akalenyu, alitke, cnv-qe-bugs, mhenriks, pelauter, sjhala
Target Milestone: ---Flags: pelauter: needinfo-
Target Release: 2.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: virt-cdi-operator-container-v2.5.0-17 hco-bundle-registry-container-v2.5.0-329 Doc Type: Release Note
Doc Text:
Host-assisted cloning performance has been improved by using a more efficient compression algorithm
Story Points: ---
Clone Of:
: 1887844 (view as bug list) Environment:
Last Closed: 2020-11-17 13:24:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1887844    

Description Benjamin Schmaus 2020-10-07 11:04:23 UTC
Description of problem:

When pulling in an image from URL the time is relatively quick.  When trying to clone the same image can take two times as long.

 URL

I0921 02:55:45.864641       1 importer.go:51] Starting importer

I0921 02:55:45.864703       1 importer.go:112] begin import process

I0921 02:55:47.270804       1 data-processor.go:277] Calculating available size

I0921 02:55:47.270846       1 data-processor.go:289] Checking out file system volume size.

I0921 02:55:47.270857       1 data-processor.go:297] Request image size not empty.

I0921 02:55:47.270867       1 data-processor.go:302] Target size 63999987712.

I0921 02:55:47.270937       1 data-processor.go:206] New phase: Convert

I0921 02:55:47.270947       1 data-processor.go:212] Validating image

I0921 02:55:47.305466       1 qemu.go:212] 0.00

I0921 02:55:50.229006       1 qemu.go:212] 1.00

I0921 02:55:52.395782       1 qemu.go:212] 2.01

I0921 02:55:55.187488       1 qemu.go:212] 3.01

I0921 02:55:57.186483       1 qemu.go:212] 4.01

I0921 02:55:59.466189       1 qemu.go:212] 5.01

I0921 02:56:01.727444       1 qemu.go:212] 6.01

...

I0921 02:59:46.867102       1 qemu.go:212] 98.52

I0921 02:59:49.740095       1 qemu.go:212] 99.52

I0921 02:59:53.023658       1 data-processor.go:206] New phase: Resize

W0921 02:59:53.032533       1 data-processor.go:264] Available space less than requested size, resizing image to available space 63999987712.

I0921 02:59:53.032546       1 data-processor.go:270] Expanding image size to: 63999987712

I0921 02:59:53.038400       1 data-processor.go:206] New phase: Complete

I0921 02:59:53.038473       1 importer.go:175] Import complete

~ 4 min

 Clone

VOLUME_MODE=filesystem

MOUNT_POINT=/var/run/cdi/clone/source

/var/run/cdi/clone/source /

UPLOAD_BYTES=63999991808

./

I0921 02:44:33.570091      11 clone-source.go:107] content_type is "filesystem-clone"

I0921 02:44:33.570143      11 clone-source.go:108] upload_bytes is 63999991808

I0921 02:44:33.570151      11 clone-source.go:118] Starting cloner target

./disk.img

I0921 02:44:33.897187      11 clone-source.go:130] Set header to filesystem-clone

I0921 02:44:34.570339      11 prometheus.go:69] 0.03

I0921 02:44:35.570494      11 prometheus.go:69] 0.06

I0921 02:44:36.570672      11 prometheus.go:69] 0.09

I0921 02:44:37.570815      11 prometheus.go:69] 0.11

I0921 02:44:38.570915      11 prometheus.go:69] 0.14

I0921 02:44:39.571040      11 prometheus.go:69] 0.16

I0921 02:44:40.571179      11 prometheus.go:69] 0.18

I0921 02:44:41.571291      11 prometheus.go:69] 0.20

...

I0921 02:53:53.638620      11 prometheus.go:69] 15.12

I0921 02:53:54.130793      11 clone-source.go:97] Wrote 9686231040 bytes

I0921 02:53:54.138618      11 clone-source.go:148] Response body:

I0921 02:53:54.138630      11 clone-source.go:150] clone complete

~ 10 min



 Clone Small

VOLUME_MODE=filesystem

MOUNT_POINT=/var/run/cdi/clone/source

/var/run/cdi/clone/source /

UPLOAD_BYTES=13857230848

./

I0921 03:39:17.785854      11 clone-source.go:107] content_type is "filesystem-clone"

I0921 03:39:17.785903      11 clone-source.go:108] upload_bytes is 13857230848

I0921 03:39:17.785909      11 clone-source.go:118] Starting cloner target

./disk.img

I0921 03:39:18.136714      11 clone-source.go:130] Set header to filesystem-clone

I0921 03:39:18.786125      11 prometheus.go:69] 0.11

I0921 03:39:19.786244      11 prometheus.go:69] 0.26

I0921 03:39:20.786381      11 prometheus.go:69] 0.43

I0921 03:39:21.786481      11 prometheus.go:69] 0.52

...

I0921 03:49:25.858458      11 prometheus.go:69] 69.69

I0921 03:49:26.858543      11 prometheus.go:69] 69.87

I0921 03:49:27.126005      11 clone-source.go:97] Wrote 9686231040 bytes

I0921 03:49:27.133934      11 clone-source.go:148] Response body:

I0921 03:49:27.133944      11 clone-source.go:150] clone complete

~ 10 min


Version-Release number of selected component (if applicable):
2.5

How reproducible:
100%

Steps to Reproduce:
1. Install CNV
2. Pull in an image from URL (time the process)
3. Clone same image (time the process)
4. Compare results

Actual results:
Cloning is a lot slower for same sized image

Expected results:
Would think since the image is same size that the two methods would be comparable in speed.

Additional info:
Backend storage in this case is PortWorx - however I suspect we would see same behavior no matter what storage backend is used

Comment 1 Adam Litke 2020-10-07 12:43:34 UTC
I wonder if this is due to CPU resource limits on the clone source or target pods since we are filtering through gzip to reduce network traffic.  Michael could you please take a look?  Cloning is even more important for 2.5 and I'd like to make sure we're performing acceptably.

Comment 2 Michael Henriksen 2020-10-07 16:19:56 UTC
Before investigating CPU limits it would be helpful to know a little more about the import use case.  

What is the format of the file (qcow2 or raw)?  What is the physical size?  Looks like the virtual size is 63G.

Comment 3 Benjamin Schmaus 2020-10-07 20:03:38 UTC
Here is a snippet of the datavolume they were testing when instantiating a VM:

  #dataVolumeTemplates:
  #  - apiVersion: cdi.kubevirt.io/v1alpha1
  #    kind: DataVolume
  #    metadata:
  #      name: fedora-32-disk-0
  #    spec:
  #      pvc:
  #        accessModes:
  #          - ReadWriteMany
  #        resources:
  #          requests:
  #            storage: 10Gi
  #        storageClassName: px-repl2-file
  #        #volumeMode: Filesystem
  #      source:
  #        http:
  #          url: >-
  #            https://files.caas.<customer redacted>.com:9443/ova/Fedora-Cloud-Base-32-1.6.x86_64.qcow2
  dataVolumeTemplates:
    - apiVersion: cdi.kubevirt.io/v1alpha1
      kind: DataVolume
      metadata:
        name: fedora-32-disk-0
      spec:
        pvc:
          accessModes:
            - ReadWriteMany
          resources:
            requests:
              storage: 10Gi
          storageClassName: px-repl2-file
          #volumeMode: Filesystem
        source:
          pvc:
            name: fedora-cloud-base-32-v1
            namespace: vm-images

They were switching between using the fedora image and then cloning from a pvc in the example above.

My thought was maybe the cloning process is cloning all 10gb pvc whereas the Fedora qcow is just the size of the initial qcow2 which I believe is >1gb.

Comment 4 Michael Henriksen 2020-10-08 14:26:59 UTC
After doing some profiling, I discovered that clone process using a lot of CPU in gzip.  Migrating to snappy made clone performance comparable to http import.

I suspect this fix will go a long way in addressing the performance variation described in this bug, but of course cannot be sure as there are a lot of different variables at play.  But for my test, there was an approx 6X improvement in cloning the base fedora cloud image.

Comment 8 Alex Kalenyuk 2020-10-13 13:59:18 UTC
Compared timings for hostpath-provisioner SC with Fedora32 image both pre & post this change (HCO-v2.5.0-270 VS 329)
and definitely seems like there is an improvement:

Pre-change:
Import: 1:35 minutes
Clone: 5:33 minutes

Post-change:
Import: 1:03 minutes
Clone: 38 seconds

Keep in mind this is not BM (PSI)

Verified on CNV 2.5.0, KubeVirt: v0.34.0-13-g2cc7d61, CDI: Containerized Data Importer v1.23.6
HCO image: registry.redhat.io/container-native-virtualization/hyperconverged-cluster-operator@sha256:631d9868743d52fba5fb5a089acc418328e0c33c55ae2319e5e06c24a4323124
CSV creation time: 2020-10-11 19:19:07

@Peter
I see Adam already has a clone for this purpose, I will try to get my hands on a 2.4.X environment to sample pre-change behavior

Comment 12 Shikha Jhala 2020-11-05 18:58:46 UTC
Release notes PR is now merged: https://github.com/openshift/openshift-docs/pull/27019

Comment 15 errata-xmlrpc 2020-11-17 13:24:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Virtualization 2.5.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:5127