1827793 – CDI: invalid large qcow2 is imported successfully instead of being rejected

Bug 1827793 - CDI: invalid large qcow2 is imported successfully instead of being rejected

Summary: CDI: invalid large qcow2 is imported successfully instead of being rejected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Maya Rashish
QA Contact:	dalia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-24 19:46 UTC by Natalie Gavrielov
Modified:	2021-07-27 14:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 14:20:49 UTC
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	alitke: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 1466	None	closed	Validate image fits in filesystem in a lot more cases. take filesystem overhead into account when resizing.	2021-02-14 18:19:43 UTC
Github	kubevirt containerized-data-importer pull 1612	None	closed	[release-v1.28] Validate image fits in filesystem in a lot more cases. take filesystem overhead into account when resizi...	2021-03-29 12:21:19 UTC
Github	kubevirt containerized-data-importer pull 1759	None	closed	Remove invalid-qcow-large-size.img tests.	2021-04-26 20:00:50 UTC
Red Hat Product Errata	RHSA-2021:2920	None	None	None	2021-07-27 14:21:52 UTC

Description Natalie Gavrielov 2020-04-24 19:46:11 UTC

Description of problem:
Previously the import of an invalid qcow large size would fail - data volume status would be "failed".
After code changes we expect the data volume to be in status "import in progress", but now the import of an invalid qcow large size finishes successfully and data volume status is "succeeded".


Version-Release number of selected component:
CNV 2.3

How reproducible:
100%

Steps to Reproduce:
Create a Data volume that imports an invalid qcow:
$ qemu-img info invalid-qcow-large-size.img
image: invalid-qcow-large-size.img
file format: qcow
virtual size: 152 TiB (167125767422464 bytes)
disk size: 4 KiB
cluster_size: 4096

This is dv.yaml:
$ cat invalid-qcow2.yaml
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: invalid-qcow2
spec:
  source:
      http:
         url: "<URL>/invalid-qcow-large-size.img"
  pvc:
    storageClassName: "hostpath-provisioner"
    volumeMode: Filesystem
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: "1Gi"


Actual results:
Import finishes successfully.

Expected results:
- For the import to fail.
- A message in the log explaining what is wrong.
- The data volume should be in status import in progress (since failed status is no longer available).

Additional info:

$ qemu-img info invalid-qcow-large-size.img
image: invalid-qcow-large-size.img
file format: qcow
virtual size: 152 TiB (167125767422464 bytes)
disk size: 4 KiB
cluster_size: 4096

$ oc logs -f importer-invalid-qcow2
I0423 18:22:35.872083       1 importer.go:51] Starting importer
I0423 18:22:35.872296       1 importer.go:107] begin import process
I0423 18:22:35.891275       1 data-processor.go:275] Calculating available size
I0423 18:22:35.893735       1 data-processor.go:283] Checking out file system volume size.
I0423 18:22:35.893779       1 data-processor.go:287] Request image size not empty.
I0423 18:22:35.893808       1 data-processor.go:292] Target size 1Gi.
I0423 18:22:35.893954       1 data-processor.go:205] New phase: Convert
I0423 18:22:35.893966       1 data-processor.go:211] Validating image
I0423 18:22:35.972342       1 qemu.go:212] 0.00
I0423 18:22:36.006629       1 data-processor.go:205] New phase: Resize
I0423 18:22:36.024594       1 data-processor.go:268] Expanding image size to: 1Gi
I0423 18:22:36.060015       1 data-processor.go:205] New phase: Complete
I0423 18:22:36.061071       1 importer.go:160] Import complete
[cloud-user@ocp-psi-executor cnv-tests]$ oc get pods
No resources found in default namespace.
[cloud-user@ocp-psi-executor cnv-tests]$ oc get dv
NAME            PHASE       PROGRESS   AGE
invalid-qcow2   Succeeded   100.0%     24s

$ oc get dv invalid-qcow2 -oyaml
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  creationTimestamp: "2020-04-23T18:22:31Z"
  generation: 5
  name: invalid-qcow2
  namespace: default
  resourceVersion: "13225787"
  selfLink: /apis/cdi.kubevirt.io/v1alpha1/namespaces/default/datavolumes/invalid-qcow2
  uid: 8c9f1818-daa7-43b2-87f0-ff7bb37fd423
spec:
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 1Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  source:
    http:
      url: <URL>s/invalid_qcow_images/invalid-qcow-large-size.img
status:
  phase: Succeeded
  progress: 100.0%

Comment 1 Nelly Credi 2020-04-26 07:21:56 UTC

Since this is a regression, Im targeting to 2.3,
but this is a negative flow and I don't think the impact is big, 
unless this will eventually block some other flow. 

@Natalie, do you see a risk in pushing this out of 2.3? Can the DV be deleted manually/when CDI is removed or upgraded? (I assume the answer is yes)
what would have happened if you tried to create a VM using this DV?

Comment 2 Natalie Gavrielov 2020-05-06 12:25:03 UTC

Adam, can you please take a look and maybe assign it to someone?

Comment 3 Maya Rashish 2020-05-20 13:28:11 UTC

So, there are several reasons that could have caused this particular qcow2 to fail:

1. The method of generation (afl) means it might be rejected as invalid by newer qemu-img, awels@ mentioned this happens for newer qemu-img.
2. The backing store is ext4, which cannot support single files over some amount of terabytes, so a 152TB sparse file image isn't possible to create
3. For one version, CNV didn't allow sparse images.

I changed #3 causing this regression. However, I think we want to keep using sparse files.
If we are relying on #2 alone to cause this failure then it's too dependent on the user's setup.

#1 IMO, should probably not be relied upon for testing purposes, because it's possible to generate a valid file with the same characteristics.
It's also not testing CDI, but rather a component it uses (qemu-img).


The original intention when creating the large qcow was to use a lot of CPU resources.
It might be worthwhile to create a test that uses a non-sparse huge image to cause the same failure.

Comment 4 Adam Litke 2020-05-27 18:13:48 UTC

@Maya.  I think we should understand which specific case it is in order to understand if we should keep the test.  This test used to pass and now it doesn't and I would like to know why.

Comment 5 Maya Rashish 2020-05-28 00:58:25 UTC

The test is marked as being linked to this bugzilla bug report now:

@pytest.mark.parametrize(
    ("dv_name", "file_name"),
    [
        pytest.param(
            "large-size",
            "invalid-qcow-large-size.img",
            marks=(
                pytest.mark.polarion("CNV-2553"),
                pytest.mark.bugzilla(
                    1827793, skip_when=lambda bug: bug.status not in BUG_STATUS_CLOSED, 
                ), 
            ),
        ),


The change that made it fail is enabling sparse images again

Comment 6 Ying Cui 2020-06-03 08:10:56 UTC

(In reply to Nelly Credi from comment #1)
> Since this is a regression, Im targeting to 2.3,
> but this is a negative flow and I don't think the impact is big, 
> unless this will eventually block some other flow. 
> 
> @Natalie, do you see a risk in pushing this out of 2.3? Can the DV be
> deleted manually/when CDI is removed or upgraded? (I assume the answer is
> yes)
> what would have happened if you tried to create a VM using this DV?

clean the needinfo, already moved to 2.4.

Comment 7 Maya Rashish 2020-06-03 13:44:43 UTC

It's possible I was wrong, I'll have to spend some time thinking about this issue and look at the surrounding code.
Adam said we want to fail when the virtual size is larger than the space we have, even if we can theoretically handle it until the VM uses all the space.

Comment 8 Adam Litke 2020-06-23 19:59:40 UTC

Is there a PR for this bug yet?

Comment 9 Adam Litke 2020-06-23 20:00:02 UTC

Missed 2.4.  Pushing out.

Comment 10 Maya Rashish 2020-06-24 06:53:53 UTC

There's no PR for it yet.
I am not able to reproduce it on the version of CDI that became 2.4, nor on the latest master.
I am wondering whether the problem was limited to 2.3

Comment 11 Maya Rashish 2020-07-01 12:30:02 UTC

Might depend on storage class, I'll retry.

Comment 12 Adam Litke 2020-10-12 18:55:07 UTC

Not a blocker and we've run out of time so pushing.

Comment 13 Adam Litke 2021-02-18 21:15:15 UTC

Moving back to POST since the bug won't be fixed until the cherrypick PR #1612 is merged.

Comment 14 Maya Rashish 2021-03-31 10:51:40 UTC

This is failing QA.

My likely response is going to be trying to convince myself and others that this scenario shouldn't fail, because it is better for users.
I need to look into it a bit, which will take some time, but I would like to work on something else I believe is more important first.

This image is fairly contrived (152TB, happened to fail because of an ext4 filesystem limit for a while - see comment #3).
There is a test for this scenario which isn't as contrived upstream:
[test_id:2329] Should fail to import images that require too much space fail given a large virtual size QCOW2 file
[test_id:2329] Should fail to import images that require too much space fail given a large physical size QCOW2 file

That uses a 2GB image on a 500Mi DV instead of a malformed 152TB image that fails for various reasons:
- In the past it failed downstream because of using ext4 that cannot create 152TB images (way outside of our scope of testing)
- In newer qemu-img and nbdkit versions considering it a malformed file

Can we kick this out of 2.6.1?

Comment 15 Maya Rashish 2021-03-31 12:53:56 UTC

Importer logs for convenience
I0331 11:41:42.940380       1 importer.go:52] Starting importer
I0331 11:41:42.940481       1 importer.go:134] begin import process
I0331 11:41:42.950297       1 data-processor.go:357] Calculating available size
I0331 11:41:42.950349       1 data-processor.go:369] Checking out file system volume size.
I0331 11:41:42.950367       1 data-processor.go:377] Request image size not empty.
I0331 11:41:42.950379       1 data-processor.go:382] Target size 500Mi.
I0331 11:41:42.950480       1 util.go:39] deleting file: /data/lost+found
I0331 11:41:42.952174       1 data-processor.go:239] New phase: Convert
I0331 11:41:42.952200       1 data-processor.go:245] Validating image
I0331 11:41:42.984712       1 qemu.go:237] 0.00
I0331 11:41:43.014400       1 data-processor.go:239] New phase: Resize
W0331 11:41:43.023240       1 data-processor.go:343] Available space less than requested size, resizing image to available space 495452160.
I0331 11:41:43.023259       1 data-processor.go:349] Expanding image size to: 495452160
I0331 11:41:43.037423       1 data-processor.go:245] Validating image
I0331 11:41:43.044954       1 data-processor.go:239] New phase: Complete
I0331 11:41:43.045063       1 importer.go:212] Import Complete

Comment 16 Adam Litke 2021-04-14 12:41:14 UTC

Maya, what are the next steps?

Comment 17 Maya Rashish 2021-04-18 14:03:13 UTC

An interesting tidbit about this image is that qemu-img from the CNV downstream packages identifies the image differently.

With fedora (similar to upstream):
$ qemu-img --version; qemu-img info invalid-qcow-large-size.img 
qemu-img version 5.1.0 (qemu-5.1.0-9.fc33)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
image: invalid-qcow-large-size.img
file format: qcow
virtual size: 152 TiB (167125767422464 bytes)
disk size: 4 KiB
cluster_size: 4096

With downstream's qemu-img:
$ ./usr/bin/qemu-img --version; ./usr/bin/qemu-img info invalid-qcow-large-size.img 
qemu-img version 5.1.0 (qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
image: invalid-qcow-large-size.img
file format: raw
virtual size: 1 KiB (1024 bytes)
disk size: 4 KiB

It considers it so invalid that it isn't a valid qcow2, and is treated as a RAW file.
I am inclined to remove this check.
We do have tests for importing a too large image, without using a malformed image.

Comment 18 Adam Litke 2021-04-26 20:02:13 UTC

Hi Dalia.  We decided that the failing test case was invalid and have removed it.  Moving to ON_QA for your approval.  If you approve you can mark it VERIFIED and we'll close.

Comment 21 errata-xmlrpc 2021-07-27 14:20:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920

Note You need to log in before you can comment on or make changes to this bug.