Bug 1971182

Summary: [RFE] Use "qemu:allocation-depth" meta context to report holes
Product: [oVirt] ovirt-imageio Reporter: Nir Soffer <nsoffer>
Component: CommonAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Evelina Shames <eshames>
Severity: high Docs Contact:
Priority: unspecified    
Version: 2.1.1CC: bugs, eblake, eshenitz, michal.skrivanek, sfishbai, sgordon, tnisan
Target Milestone: ovirt-4.4.7Keywords: FutureFeature, ZStream
Target Release: 2.2.0Flags: sbonazzo: ovirt-4.4?
michal.skrivanek: blocker-
pm-rhel: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-imageio-2.2.0-1 Doc Type: Bug Fix
Doc Text:
Cause: Incompatible change in qemu 6.0.0 could confuse ovirt-imageio to report a hole for zeroed area in qcow2 image. Consequence: Downloading a single snapshot can end in corrupted image that cannot be restored later. Fix: ovirt-imageio uses now a different API in qemu providing reliable information about holes in qcow2 images. Result: Downloading single snapshot works now with qemu 6.0.0.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-28 14:16:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nir Soffer 2021-06-12 18:50:05 UTC
Description of problem:

Since ovirt 4.4.4 ovirt-imageio supports transferring single snapshots:
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/list_disk_snapshots.py
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/download_disk_snapshot.py
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py

This feature depends on the detecting unallocated areas in single qcow2
images, and was implemented using standard NBD "base:allocation" meta
context in qemu-nbd.

Using "base:allocation" is not entirely correct, since NBD server may omit
information about holes, but it works since ovirt-imageio uses only qemu-nbd.
Unfortunately qemu 6.0.0 change the way zeroed clusters in qcow2 images are
reported (bug 1968693). Previously they were reported as:

    NBD_STATE_ZERO

But with qemu 6.0.0 they are reported as:

    NBD_STATE_ZERO | NBD_STATE_HOLE

This change is considered a bug fix in qemu, and it is not possible to revert
this change in upstream qemu.

We can fix this issue using the new "qemu:allocation-depth" meta context 
introduced in qemu 5.2.0. This meta context expose reliable (not optional)
information about unallocated areas in a qcow2 image.

Change imageio nbd client to use "qemu:allocation-depth", and use it to 
report holes.

Version-Release number of selected component (if applicable):
2.1.1

With this change uploading and downloading single snapshots should work with
both qemu 5.2.0 (RHEL 8.4) and qemu 6.0.0 (Centos Stream).

Comment 2 Nir Soffer 2021-06-13 14:04:13 UTC
To reproduce the issue we need to download a snapshot using the NBD
backend. This flow is used by backup applications that use snapshot
based backups.

1. Install a host using RHEL 8.5 AV nightly

You should have this qemu version:
$ rpm -q qemu-kvm
qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb.x86_64

2. Create a vm
3. Add a thin virtio-scsi disk to the vm

Inside the guest:

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0    6G  0 disk 
├─sda1   8:1    0    1M  0 part 
├─sda2   8:2    0    1G  0 part /boot
├─sda3   8:3    0  615M  0 part [SWAP]
└─sda4   8:4    0  4.4G  0 part /
sdb      8:16   0   10G  0 disk 
sr0     11:0    1 1024M  0 rom  

4. Write data to the fist cluster of disk /dev/sdb

    # echo "data from base" > /dev/sdb
    # sync

5. Create snapshot including the second disk

6. Zero the first cluster of disk sdb

In the guest run:

    # fallocate --punch-hole --length 64k /dev/sdb

In the guest we cannot see the "data from base" now:

    # dd if=/dev/sdb bs=512 count=1 status=none | hexdump
    0000000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    0000200

The first cluster contains zeroes now.

7. Stop the vm

8. List the second disk snapshots.

In this example the disk id is 2b720ee1-baad-48d8-bfdb-582946b82448

$ python3 list_disk_snapshots.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
[
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
    "parent": null,
    "status": "ok"
  },
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "fb14e6e4-00a7-48de-b7e0-d23996889cbd",
    "parent": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
    "status": "ok"
  }
]

9. Download base image (id=dcb903e7-5567-4eb8-a240-85cbea1dce12)

In this example the disk is on the storage domain "iscsi-00".

$ python download_disk_snapshot.py -c engine-dev iscsi-00 dcb903e7-5567-4eb8-a240-85cbea1dce12 base.qcow2

10. Download top disk snapshot (id=fb14e6e4-00a7-48de-b7e0-d23996889cbd)

$ python3 download_disk_snapshot.py -c engine-dev iscsi-00 fb14e6e4-00a7-48de-b7e0-d23996889cbd --backing-file base.qcow2 top.qcow2

11. Create a checksum of the disk

$ python3 checksum_disk.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
{
    "algorithm": "blake2b",
    "block_size": 4194304,
    "checksum": "b6dcdb509ec27d672ab91ddf6289365469668fa0d6a5de5cbc594c0ea3102825"
}

12. Create a checksum of the downloaded image

$ python3 checksum_image.py top.qcow2
{
  "algorithm": "blake2b",
  "block_size": 4194304,
  "checksum": "fcb7d96381087e6a6d5b07421342d90dd32e6ad5a21b99f37372f29dcd491a7e"
}

The checksums do not match, the downloaded image is does not contain the
same data as the original disk.

This flow should be tested as like other image transfer flows.


The reason is that the disk snapshot dcb903e7-5567-4eb8-a240-85cbea1dce12
was reported as empty by qemu-nbd during the download. So download_disk_snapshot.py
create an empty qcow2 disk:

$ qemu-img map --output json top.qcow2 
[{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 327680},
{ "start": 65536, "length": 10737352704, "depth": 1, "zero": true, "data": false}]

The top image first cluster was zeroed in the guest, so we expect to see:

$ qemu-img map --output json top.qcow2 
[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
{ "start": 65536, "length": 10737352704, "depth": 1, "zero": true, "data": false}]

So "data from base" is exposed to the guest.

To see the data the guest will see, we can convert the disk to raw format:

$ qemu-img convert -f qcow2 -O raw top.qcow2 top.img

Looking at first cluster show data from base:

$ dd if=top.img bs=512 count=1 status=none
data from base

If we had a file system on this disk, the file system would be corrupted 
in the downloaded image.

Comment 3 Nir Soffer 2021-06-13 23:58:36 UTC
Real flow reproducing the issue.

1. Create test disk 

$ virt-builder fedora-32 -o fedora-32.qcow2 \
    --format=qcow2 \
    --hostname=f32 \
    --ssh-inject=root \
    --root-password=password:root -\
    --selinux-relabel \
    --install=qemu-guest-agent

2. Upload disk to iscsi/fc storage

$ upload_disk.py -c my_engine --sd-name my_sd --disk-sparse fedora-32.qcow2

4. Create vm with disk using:

interface: virtio-scsi
enable-discard: yes

Looks like enable discard cannot be enabled when attaching new vm
to a disk. Edit the disk after creating the vm to enable discard.

5. Start the vm

6. Create a big file

In the guest run:

    # dd if=/dev/urandom bs=1M count=1024 of=big-file conv=fsync

7. Create snapshot 1

8. Delete the file and trim

In the guest run:

   # rm big-file
   # fstrim -av

This creates a lot of zeroed clusters in the active vm disk snapshot.

9. Shutdown the vm

10. List disk snapshots

$ ./list_disk_snapshots.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
[
  {
    "actual_size": 4026531840,
    "format": "cow",
    "id": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
    "parent": null,
    "status": "ok"
  },
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "11e79858-d972-4e50-a4d3-501f759c09d7",
    "parent": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
    "status": "ok"
  }
]

11. Download base image

$ ./download_disk_snapshot.py -c my_engine my_sd f18550e7-d2b2-427a-bbd0-d39d5d93cdf1  base.qcow2

12. Download top image rebasing on top of base.qcow2

$  ./download_disk_snapshot.py -c my_engine my_sd 11e79858-d972-4e50-a4d3-501f759c09d7 --backing-file base.qcow2 snap1.qcow2

13. Create checksum for the original disk

$ ./checksuk_disk.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
{
    "algorithm": "blake2b",
    "block_size": 4194304,
    "checksum": "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
}

14. Create checksum for the downloaded image

$ ./checksum_image.py snap1.qcow2
{
  "algorithm": "blake2b",
  "block_size": 4194304,
  "checksum": "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
}

The checksums must match.

15. Upload downloaded image to new disk

$ ./upload_disk.py -c my_engine --sd-name my_sd --disk-format qcow2 --disk-sparse snap1.qcow2

16. Create new vm from this disk and start the vm

The VM must boot normally.

Comment 5 Nir Soffer 2021-06-20 12:10:19 UTC
Note that the fix requires vdsm >= 4.40.70.4.

Comment 6 Nir Soffer 2021-07-04 12:04:03 UTC
Notes for testing:

1. Reproduce the issue with qemu 6.0.0

This is possible only with ovirt-imageio < 2.2.0-1. I reproduced this
on Fedora 32 and RHEL 8.5.

2. Testing with RHEL 8.4

RHEL 8.4 provides qemu 5.2.0. The flows described in comment 2 and
comment 3 can be tested with this version.

ovirt-imageio 2.2.0-1 change the way we get zero extents info from
qemu. We want to make sure using the new way does not introduce 
regressions.

Comment 7 Nir Soffer 2021-07-04 12:07:28 UTC
(continued from comment 6)

3. Testing with RHEL 8.5

I tested the flows in comment 2 and comment 3 with qemu 6.0.0 on
RHEL 8.5, so there should be no issue on Centos Stream running
same version.

Comment 8 Evelina Shames 2021-07-11 07:34:37 UTC
(In reply to Nir Soffer from comment #2)
> To reproduce the issue we need to download a snapshot using the NBD
> backend. This flow is used by backup applications that use snapshot
> based backups.
> 
> 1. Install a host using RHEL 8.5 AV nightly
> 
> You should have this qemu version:
> $ rpm -q qemu-kvm
> qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb.x86_64
> 
> 2. Create a vm
> 3. Add a thin virtio-scsi disk to the vm
> 
> Inside the guest:
> 
> # lsblk
> NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
> sda      8:0    0    6G  0 disk 
> ├─sda1   8:1    0    1M  0 part 
> ├─sda2   8:2    0    1G  0 part /boot
> ├─sda3   8:3    0  615M  0 part [SWAP]
> └─sda4   8:4    0  4.4G  0 part /
> sdb      8:16   0   10G  0 disk 
> sr0     11:0    1 1024M  0 rom  
> 
> 4. Write data to the fist cluster of disk /dev/sdb
> 
>     # echo "data from base" > /dev/sdb
>     # sync
> 
> 5. Create snapshot including the second disk
> 
> 6. Zero the first cluster of disk sdb
> 
> In the guest run:
> 
>     # fallocate --punch-hole --length 64k /dev/sdb
> 
> In the guest we cannot see the "data from base" now:
> 
>     # dd if=/dev/sdb bs=512 count=1 status=none | hexdump
>     0000000 0000 0000 0000 0000 0000 0000 0000 0000
>     *
>     0000200
> 
> The first cluster contains zeroes now.
> 
> 7. Stop the vm
> 
> 8. List the second disk snapshots.
> 
> In this example the disk id is 2b720ee1-baad-48d8-bfdb-582946b82448
> 
> $ python3 list_disk_snapshots.py -c engine-dev
> 2b720ee1-baad-48d8-bfdb-582946b82448
> [
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
>     "parent": null,
>     "status": "ok"
>   },
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "fb14e6e4-00a7-48de-b7e0-d23996889cbd",
>     "parent": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
>     "status": "ok"
>   }
> ]
> 
> 9. Download base image (id=dcb903e7-5567-4eb8-a240-85cbea1dce12)
> 
> In this example the disk is on the storage domain "iscsi-00".
> 
> $ python download_disk_snapshot.py -c engine-dev iscsi-00
> dcb903e7-5567-4eb8-a240-85cbea1dce12 base.qcow2
> 
> 10. Download top disk snapshot (id=fb14e6e4-00a7-48de-b7e0-d23996889cbd)
> 
> $ python3 download_disk_snapshot.py -c engine-dev iscsi-00
> fb14e6e4-00a7-48de-b7e0-d23996889cbd --backing-file base.qcow2 top.qcow2
> 
> 11. Create a checksum of the disk
> 
> $ python3 checksum_disk.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
> {
>     "algorithm": "blake2b",
>     "block_size": 4194304,
>     "checksum":
> "b6dcdb509ec27d672ab91ddf6289365469668fa0d6a5de5cbc594c0ea3102825"
> }
> 
> 12. Create a checksum of the downloaded image
> 
> $ python3 checksum_image.py top.qcow2
> {
>   "algorithm": "blake2b",
>   "block_size": 4194304,
>   "checksum":
> "fcb7d96381087e6a6d5b07421342d90dd32e6ad5a21b99f37372f29dcd491a7e"
> }
> 
> The checksums do not match, the downloaded image is does not contain the
> same data as the original disk.
> 

Verified on RHV-4.4.7-6:
The checksums match.


(In reply to Nir Soffer from comment #3)
> Real flow reproducing the issue.
> 
> 1. Create test disk 
> 
> $ virt-builder fedora-32 -o fedora-32.qcow2 \
>     --format=qcow2 \
>     --hostname=f32 \
>     --ssh-inject=root \
>     --root-password=password:root -\
>     --selinux-relabel \
>     --install=qemu-guest-agent
> 
> 2. Upload disk to iscsi/fc storage
> 
> $ upload_disk.py -c my_engine --sd-name my_sd --disk-sparse fedora-32.qcow2
> 
> 4. Create vm with disk using:
> 
> interface: virtio-scsi
> enable-discard: yes
> 
> Looks like enable discard cannot be enabled when attaching new vm
> to a disk. Edit the disk after creating the vm to enable discard.
> 
> 5. Start the vm
> 
> 6. Create a big file
> 
> In the guest run:
> 
>     # dd if=/dev/urandom bs=1M count=1024 of=big-file conv=fsync
> 
> 7. Create snapshot 1
> 
> 8. Delete the file and trim
> 
> In the guest run:
> 
>    # rm big-file
>    # fstrim -av
> 
> This creates a lot of zeroed clusters in the active vm disk snapshot.
> 
> 9. Shutdown the vm
> 
> 10. List disk snapshots
> 
> $ ./list_disk_snapshots.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
> [
>   {
>     "actual_size": 4026531840,
>     "format": "cow",
>     "id": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
>     "parent": null,
>     "status": "ok"
>   },
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "11e79858-d972-4e50-a4d3-501f759c09d7",
>     "parent": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
>     "status": "ok"
>   }
> ]
> 
> 11. Download base image
> 
> $ ./download_disk_snapshot.py -c my_engine my_sd
> f18550e7-d2b2-427a-bbd0-d39d5d93cdf1  base.qcow2
> 
> 12. Download top image rebasing on top of base.qcow2
> 
> $  ./download_disk_snapshot.py -c my_engine my_sd
> 11e79858-d972-4e50-a4d3-501f759c09d7 --backing-file base.qcow2 snap1.qcow2
> 
> 13. Create checksum for the original disk
> 
> $ ./checksuk_disk.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
> {
>     "algorithm": "blake2b",
>     "block_size": 4194304,
>     "checksum":
> "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
> }
> 
> 14. Create checksum for the downloaded image
> 
> $ ./checksum_image.py snap1.qcow2
> {
>   "algorithm": "blake2b",
>   "block_size": 4194304,
>   "checksum":
> "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
> }
> 
> The checksums must match.
> 
> 15. Upload downloaded image to new disk
> 
> $ ./upload_disk.py -c my_engine --sd-name my_sd --disk-format qcow2
> --disk-sparse snap1.qcow2
> 
> 16. Create new vm from this disk and start the vm
> 
> The VM must boot normally.

Verified on RHV-4.4.7-6:
The checksums match and the VM boot normally.

Moving to 'Verified'.

Comment 9 Sandro Bonazzola 2021-07-28 14:16:50 UTC
This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.