Description of problem: Previously the import of an invalid qcow large size would fail - data volume status would be "failed". After code changes we expect the data volume to be in status "import in progress", but now the import of an invalid qcow large size finishes successfully and data volume status is "succeeded". Version-Release number of selected component: CNV 2.3 How reproducible: 100% Steps to Reproduce: Create a Data volume that imports an invalid qcow: $ qemu-img info invalid-qcow-large-size.img image: invalid-qcow-large-size.img file format: qcow virtual size: 152 TiB (167125767422464 bytes) disk size: 4 KiB cluster_size: 4096 This is dv.yaml: $ cat invalid-qcow2.yaml apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: invalid-qcow2 spec: source: http: url: "<URL>/invalid-qcow-large-size.img" pvc: storageClassName: "hostpath-provisioner" volumeMode: Filesystem accessModes: - ReadWriteOnce resources: requests: storage: "1Gi" Actual results: Import finishes successfully. Expected results: - For the import to fail. - A message in the log explaining what is wrong. - The data volume should be in status import in progress (since failed status is no longer available). Additional info: $ qemu-img info invalid-qcow-large-size.img image: invalid-qcow-large-size.img file format: qcow virtual size: 152 TiB (167125767422464 bytes) disk size: 4 KiB cluster_size: 4096 $ oc logs -f importer-invalid-qcow2 I0423 18:22:35.872083 1 importer.go:51] Starting importer I0423 18:22:35.872296 1 importer.go:107] begin import process I0423 18:22:35.891275 1 data-processor.go:275] Calculating available size I0423 18:22:35.893735 1 data-processor.go:283] Checking out file system volume size. I0423 18:22:35.893779 1 data-processor.go:287] Request image size not empty. I0423 18:22:35.893808 1 data-processor.go:292] Target size 1Gi. I0423 18:22:35.893954 1 data-processor.go:205] New phase: Convert I0423 18:22:35.893966 1 data-processor.go:211] Validating image I0423 18:22:35.972342 1 qemu.go:212] 0.00 I0423 18:22:36.006629 1 data-processor.go:205] New phase: Resize I0423 18:22:36.024594 1 data-processor.go:268] Expanding image size to: 1Gi I0423 18:22:36.060015 1 data-processor.go:205] New phase: Complete I0423 18:22:36.061071 1 importer.go:160] Import complete [cloud-user@ocp-psi-executor cnv-tests]$ oc get pods No resources found in default namespace. [cloud-user@ocp-psi-executor cnv-tests]$ oc get dv NAME PHASE PROGRESS AGE invalid-qcow2 Succeeded 100.0% 24s $ oc get dv invalid-qcow2 -oyaml apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: creationTimestamp: "2020-04-23T18:22:31Z" generation: 5 name: invalid-qcow2 namespace: default resourceVersion: "13225787" selfLink: /apis/cdi.kubevirt.io/v1alpha1/namespaces/default/datavolumes/invalid-qcow2 uid: 8c9f1818-daa7-43b2-87f0-ff7bb37fd423 spec: pvc: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi storageClassName: hostpath-provisioner volumeMode: Filesystem source: http: url: <URL>s/invalid_qcow_images/invalid-qcow-large-size.img status: phase: Succeeded progress: 100.0%
Since this is a regression, Im targeting to 2.3, but this is a negative flow and I don't think the impact is big, unless this will eventually block some other flow. @Natalie, do you see a risk in pushing this out of 2.3? Can the DV be deleted manually/when CDI is removed or upgraded? (I assume the answer is yes) what would have happened if you tried to create a VM using this DV?
Adam, can you please take a look and maybe assign it to someone?
So, there are several reasons that could have caused this particular qcow2 to fail: 1. The method of generation (afl) means it might be rejected as invalid by newer qemu-img, awels@ mentioned this happens for newer qemu-img. 2. The backing store is ext4, which cannot support single files over some amount of terabytes, so a 152TB sparse file image isn't possible to create 3. For one version, CNV didn't allow sparse images. I changed #3 causing this regression. However, I think we want to keep using sparse files. If we are relying on #2 alone to cause this failure then it's too dependent on the user's setup. #1 IMO, should probably not be relied upon for testing purposes, because it's possible to generate a valid file with the same characteristics. It's also not testing CDI, but rather a component it uses (qemu-img). The original intention when creating the large qcow was to use a lot of CPU resources. It might be worthwhile to create a test that uses a non-sparse huge image to cause the same failure.
@Maya. I think we should understand which specific case it is in order to understand if we should keep the test. This test used to pass and now it doesn't and I would like to know why.
The test is marked as being linked to this bugzilla bug report now: @pytest.mark.parametrize( ("dv_name", "file_name"), [ pytest.param( "large-size", "invalid-qcow-large-size.img", marks=( pytest.mark.polarion("CNV-2553"), pytest.mark.bugzilla( 1827793, skip_when=lambda bug: bug.status not in BUG_STATUS_CLOSED, ), ), ), The change that made it fail is enabling sparse images again
(In reply to Nelly Credi from comment #1) > Since this is a regression, Im targeting to 2.3, > but this is a negative flow and I don't think the impact is big, > unless this will eventually block some other flow. > > @Natalie, do you see a risk in pushing this out of 2.3? Can the DV be > deleted manually/when CDI is removed or upgraded? (I assume the answer is > yes) > what would have happened if you tried to create a VM using this DV? clean the needinfo, already moved to 2.4.
It's possible I was wrong, I'll have to spend some time thinking about this issue and look at the surrounding code. Adam said we want to fail when the virtual size is larger than the space we have, even if we can theoretically handle it until the VM uses all the space.
Is there a PR for this bug yet?
Missed 2.4. Pushing out.
There's no PR for it yet. I am not able to reproduce it on the version of CDI that became 2.4, nor on the latest master. I am wondering whether the problem was limited to 2.3
Might depend on storage class, I'll retry.
Not a blocker and we've run out of time so pushing.
Moving back to POST since the bug won't be fixed until the cherrypick PR #1612 is merged.
This is failing QA. My likely response is going to be trying to convince myself and others that this scenario shouldn't fail, because it is better for users. I need to look into it a bit, which will take some time, but I would like to work on something else I believe is more important first. This image is fairly contrived (152TB, happened to fail because of an ext4 filesystem limit for a while - see comment #3). There is a test for this scenario which isn't as contrived upstream: [test_id:2329] Should fail to import images that require too much space fail given a large virtual size QCOW2 file [test_id:2329] Should fail to import images that require too much space fail given a large physical size QCOW2 file That uses a 2GB image on a 500Mi DV instead of a malformed 152TB image that fails for various reasons: - In the past it failed downstream because of using ext4 that cannot create 152TB images (way outside of our scope of testing) - In newer qemu-img and nbdkit versions considering it a malformed file Can we kick this out of 2.6.1?
Importer logs for convenience I0331 11:41:42.940380 1 importer.go:52] Starting importer I0331 11:41:42.940481 1 importer.go:134] begin import process I0331 11:41:42.950297 1 data-processor.go:357] Calculating available size I0331 11:41:42.950349 1 data-processor.go:369] Checking out file system volume size. I0331 11:41:42.950367 1 data-processor.go:377] Request image size not empty. I0331 11:41:42.950379 1 data-processor.go:382] Target size 500Mi. I0331 11:41:42.950480 1 util.go:39] deleting file: /data/lost+found I0331 11:41:42.952174 1 data-processor.go:239] New phase: Convert I0331 11:41:42.952200 1 data-processor.go:245] Validating image I0331 11:41:42.984712 1 qemu.go:237] 0.00 I0331 11:41:43.014400 1 data-processor.go:239] New phase: Resize W0331 11:41:43.023240 1 data-processor.go:343] Available space less than requested size, resizing image to available space 495452160. I0331 11:41:43.023259 1 data-processor.go:349] Expanding image size to: 495452160 I0331 11:41:43.037423 1 data-processor.go:245] Validating image I0331 11:41:43.044954 1 data-processor.go:239] New phase: Complete I0331 11:41:43.045063 1 importer.go:212] Import Complete
Maya, what are the next steps?
An interesting tidbit about this image is that qemu-img from the CNV downstream packages identifies the image differently. With fedora (similar to upstream): $ qemu-img --version; qemu-img info invalid-qcow-large-size.img qemu-img version 5.1.0 (qemu-5.1.0-9.fc33) Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers image: invalid-qcow-large-size.img file format: qcow virtual size: 152 TiB (167125767422464 bytes) disk size: 4 KiB cluster_size: 4096 With downstream's qemu-img: $ ./usr/bin/qemu-img --version; ./usr/bin/qemu-img info invalid-qcow-large-size.img qemu-img version 5.1.0 (qemu-kvm-5.1.0-14.module+el8.3.0+8790+80f9c6d8.1) Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers image: invalid-qcow-large-size.img file format: raw virtual size: 1 KiB (1024 bytes) disk size: 4 KiB It considers it so invalid that it isn't a valid qcow2, and is treated as a RAW file. I am inclined to remove this check. We do have tests for importing a too large image, without using a malformed image.
Hi Dalia. We decided that the failing test case was invalid and have removed it. Moving to ON_QA for your approval. If you approve you can mark it VERIFIED and we'll close.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920