Bug 1971182 - [RFE] Use "qemu:allocation-depth" meta context to report holes
Summary: [RFE] Use "qemu:allocation-depth" meta context to report holes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-imageio
Classification: oVirt
Component: Common
Version: 2.1.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.4.7
: 2.2.0
Assignee: Nir Soffer
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-12 18:50 UTC by Nir Soffer
Modified: 2021-11-04 19:28 UTC (History)
7 users (show)

Fixed In Version: ovirt-imageio-2.2.0-1
Clone Of:
Environment:
Last Closed: 2021-07-28 14:16:50 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.4?
michal.skrivanek: blocker-
pm-rhel: planning_ack?
pm-rhel: devel_ack+
pm-rhel: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 115193 0 master MERGED spec: Exclude broken libvirt version 2021-06-14 13:44:08 UTC
oVirt gerrit 115215 0 master MERGED nbd: Allow merging zero and dirty extents 2021-06-15 21:38:21 UTC
oVirt gerrit 115216 0 master MERGED nbd: Report zero status in dirty extents 2021-06-20 11:41:45 UTC
oVirt gerrit 115219 0 master MERGED nbd: Replace dirty flag with context type 2021-06-20 11:41:30 UTC
oVirt gerrit 115220 0 master MERGED nbd: Use qemu:allocation-depth meta context 2021-06-20 11:41:36 UTC
oVirt gerrit 115228 0 master MERGED spec: Require qemu-img providing allocation depth 2021-06-20 11:41:34 UTC
oVirt gerrit 115229 0 master MERGED spec: Require qemu-img providing allocation depth 2021-06-14 12:14:24 UTC
oVirt gerrit 115230 0 master MERGED nbd: Expose allocation depth in nbd server 2021-06-14 12:14:26 UTC
oVirt gerrit 115232 0 master MERGED tests: Minimize use of imageio internals 2021-06-14 12:14:21 UTC
oVirt gerrit 115233 0 master MERGED tests: Enable allocation-depth in qemu nbd server 2021-06-20 11:41:24 UTC
oVirt gerrit 115234 0 master MERGED nbd: Add constants for meta context names 2021-06-20 11:41:39 UTC
oVirt gerrit 115258 0 master MERGED nbd: Ignore unexpected bits in base:allocation 2021-06-20 11:41:27 UTC
oVirt gerrit 115266 0 master MERGED nbdutil: Support merging extents 2021-06-16 16:37:39 UTC
oVirt gerrit 115296 0 master MERGED tests: Rewrite nbdutil extents tests 2021-06-20 11:41:50 UTC
oVirt gerrit 115320 0 master MERGED spec: Require ovirt-imageio >= 2.2.0-1 2021-06-21 12:40:34 UTC
oVirt gerrit 115339 0 master MERGED Revert "spec: Exclude broken libvirt version" 2021-06-21 12:40:20 UTC

Description Nir Soffer 2021-06-12 18:50:05 UTC
Description of problem:

Since ovirt 4.4.4 ovirt-imageio supports transferring single snapshots:
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/list_disk_snapshots.py
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/download_disk_snapshot.py
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py

This feature depends on the detecting unallocated areas in single qcow2
images, and was implemented using standard NBD "base:allocation" meta
context in qemu-nbd.

Using "base:allocation" is not entirely correct, since NBD server may omit
information about holes, but it works since ovirt-imageio uses only qemu-nbd.
Unfortunately qemu 6.0.0 change the way zeroed clusters in qcow2 images are
reported (bug 1968693). Previously they were reported as:

    NBD_STATE_ZERO

But with qemu 6.0.0 they are reported as:

    NBD_STATE_ZERO | NBD_STATE_HOLE

This change is considered a bug fix in qemu, and it is not possible to revert
this change in upstream qemu.

We can fix this issue using the new "qemu:allocation-depth" meta context 
introduced in qemu 5.2.0. This meta context expose reliable (not optional)
information about unallocated areas in a qcow2 image.

Change imageio nbd client to use "qemu:allocation-depth", and use it to 
report holes.

Version-Release number of selected component (if applicable):
2.1.1

With this change uploading and downloading single snapshots should work with
both qemu 5.2.0 (RHEL 8.4) and qemu 6.0.0 (Centos Stream).

Comment 2 Nir Soffer 2021-06-13 14:04:13 UTC
To reproduce the issue we need to download a snapshot using the NBD
backend. This flow is used by backup applications that use snapshot
based backups.

1. Install a host using RHEL 8.5 AV nightly

You should have this qemu version:
$ rpm -q qemu-kvm
qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb.x86_64

2. Create a vm
3. Add a thin virtio-scsi disk to the vm

Inside the guest:

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0    6G  0 disk 
├─sda1   8:1    0    1M  0 part 
├─sda2   8:2    0    1G  0 part /boot
├─sda3   8:3    0  615M  0 part [SWAP]
└─sda4   8:4    0  4.4G  0 part /
sdb      8:16   0   10G  0 disk 
sr0     11:0    1 1024M  0 rom  

4. Write data to the fist cluster of disk /dev/sdb

    # echo "data from base" > /dev/sdb
    # sync

5. Create snapshot including the second disk

6. Zero the first cluster of disk sdb

In the guest run:

    # fallocate --punch-hole --length 64k /dev/sdb

In the guest we cannot see the "data from base" now:

    # dd if=/dev/sdb bs=512 count=1 status=none | hexdump
    0000000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    0000200

The first cluster contains zeroes now.

7. Stop the vm

8. List the second disk snapshots.

In this example the disk id is 2b720ee1-baad-48d8-bfdb-582946b82448

$ python3 list_disk_snapshots.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
[
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
    "parent": null,
    "status": "ok"
  },
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "fb14e6e4-00a7-48de-b7e0-d23996889cbd",
    "parent": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
    "status": "ok"
  }
]

9. Download base image (id=dcb903e7-5567-4eb8-a240-85cbea1dce12)

In this example the disk is on the storage domain "iscsi-00".

$ python download_disk_snapshot.py -c engine-dev iscsi-00 dcb903e7-5567-4eb8-a240-85cbea1dce12 base.qcow2

10. Download top disk snapshot (id=fb14e6e4-00a7-48de-b7e0-d23996889cbd)

$ python3 download_disk_snapshot.py -c engine-dev iscsi-00 fb14e6e4-00a7-48de-b7e0-d23996889cbd --backing-file base.qcow2 top.qcow2

11. Create a checksum of the disk

$ python3 checksum_disk.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
{
    "algorithm": "blake2b",
    "block_size": 4194304,
    "checksum": "b6dcdb509ec27d672ab91ddf6289365469668fa0d6a5de5cbc594c0ea3102825"
}

12. Create a checksum of the downloaded image

$ python3 checksum_image.py top.qcow2
{
  "algorithm": "blake2b",
  "block_size": 4194304,
  "checksum": "fcb7d96381087e6a6d5b07421342d90dd32e6ad5a21b99f37372f29dcd491a7e"
}

The checksums do not match, the downloaded image is does not contain the
same data as the original disk.

This flow should be tested as like other image transfer flows.


The reason is that the disk snapshot dcb903e7-5567-4eb8-a240-85cbea1dce12
was reported as empty by qemu-nbd during the download. So download_disk_snapshot.py
create an empty qcow2 disk:

$ qemu-img map --output json top.qcow2 
[{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 327680},
{ "start": 65536, "length": 10737352704, "depth": 1, "zero": true, "data": false}]

The top image first cluster was zeroed in the guest, so we expect to see:

$ qemu-img map --output json top.qcow2 
[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
{ "start": 65536, "length": 10737352704, "depth": 1, "zero": true, "data": false}]

So "data from base" is exposed to the guest.

To see the data the guest will see, we can convert the disk to raw format:

$ qemu-img convert -f qcow2 -O raw top.qcow2 top.img

Looking at first cluster show data from base:

$ dd if=top.img bs=512 count=1 status=none
data from base

If we had a file system on this disk, the file system would be corrupted 
in the downloaded image.

Comment 3 Nir Soffer 2021-06-13 23:58:36 UTC
Real flow reproducing the issue.

1. Create test disk 

$ virt-builder fedora-32 -o fedora-32.qcow2 \
    --format=qcow2 \
    --hostname=f32 \
    --ssh-inject=root \
    --root-password=password:root -\
    --selinux-relabel \
    --install=qemu-guest-agent

2. Upload disk to iscsi/fc storage

$ upload_disk.py -c my_engine --sd-name my_sd --disk-sparse fedora-32.qcow2

4. Create vm with disk using:

interface: virtio-scsi
enable-discard: yes

Looks like enable discard cannot be enabled when attaching new vm
to a disk. Edit the disk after creating the vm to enable discard.

5. Start the vm

6. Create a big file

In the guest run:

    # dd if=/dev/urandom bs=1M count=1024 of=big-file conv=fsync

7. Create snapshot 1

8. Delete the file and trim

In the guest run:

   # rm big-file
   # fstrim -av

This creates a lot of zeroed clusters in the active vm disk snapshot.

9. Shutdown the vm

10. List disk snapshots

$ ./list_disk_snapshots.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
[
  {
    "actual_size": 4026531840,
    "format": "cow",
    "id": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
    "parent": null,
    "status": "ok"
  },
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "11e79858-d972-4e50-a4d3-501f759c09d7",
    "parent": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
    "status": "ok"
  }
]

11. Download base image

$ ./download_disk_snapshot.py -c my_engine my_sd f18550e7-d2b2-427a-bbd0-d39d5d93cdf1  base.qcow2

12. Download top image rebasing on top of base.qcow2

$  ./download_disk_snapshot.py -c my_engine my_sd 11e79858-d972-4e50-a4d3-501f759c09d7 --backing-file base.qcow2 snap1.qcow2

13. Create checksum for the original disk

$ ./checksuk_disk.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
{
    "algorithm": "blake2b",
    "block_size": 4194304,
    "checksum": "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
}

14. Create checksum for the downloaded image

$ ./checksum_image.py snap1.qcow2
{
  "algorithm": "blake2b",
  "block_size": 4194304,
  "checksum": "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
}

The checksums must match.

15. Upload downloaded image to new disk

$ ./upload_disk.py -c my_engine --sd-name my_sd --disk-format qcow2 --disk-sparse snap1.qcow2

16. Create new vm from this disk and start the vm

The VM must boot normally.

Comment 5 Nir Soffer 2021-06-20 12:10:19 UTC
Note that the fix requires vdsm >= 4.40.70.4.

Comment 6 Nir Soffer 2021-07-04 12:04:03 UTC
Notes for testing:

1. Reproduce the issue with qemu 6.0.0

This is possible only with ovirt-imageio < 2.2.0-1. I reproduced this
on Fedora 32 and RHEL 8.5.

2. Testing with RHEL 8.4

RHEL 8.4 provides qemu 5.2.0. The flows described in comment 2 and
comment 3 can be tested with this version.

ovirt-imageio 2.2.0-1 change the way we get zero extents info from
qemu. We want to make sure using the new way does not introduce 
regressions.

Comment 7 Nir Soffer 2021-07-04 12:07:28 UTC
(continued from comment 6)

3. Testing with RHEL 8.5

I tested the flows in comment 2 and comment 3 with qemu 6.0.0 on
RHEL 8.5, so there should be no issue on Centos Stream running
same version.

Comment 8 Evelina Shames 2021-07-11 07:34:37 UTC
(In reply to Nir Soffer from comment #2)
> To reproduce the issue we need to download a snapshot using the NBD
> backend. This flow is used by backup applications that use snapshot
> based backups.
> 
> 1. Install a host using RHEL 8.5 AV nightly
> 
> You should have this qemu version:
> $ rpm -q qemu-kvm
> qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb.x86_64
> 
> 2. Create a vm
> 3. Add a thin virtio-scsi disk to the vm
> 
> Inside the guest:
> 
> # lsblk
> NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
> sda      8:0    0    6G  0 disk 
> ├─sda1   8:1    0    1M  0 part 
> ├─sda2   8:2    0    1G  0 part /boot
> ├─sda3   8:3    0  615M  0 part [SWAP]
> └─sda4   8:4    0  4.4G  0 part /
> sdb      8:16   0   10G  0 disk 
> sr0     11:0    1 1024M  0 rom  
> 
> 4. Write data to the fist cluster of disk /dev/sdb
> 
>     # echo "data from base" > /dev/sdb
>     # sync
> 
> 5. Create snapshot including the second disk
> 
> 6. Zero the first cluster of disk sdb
> 
> In the guest run:
> 
>     # fallocate --punch-hole --length 64k /dev/sdb
> 
> In the guest we cannot see the "data from base" now:
> 
>     # dd if=/dev/sdb bs=512 count=1 status=none | hexdump
>     0000000 0000 0000 0000 0000 0000 0000 0000 0000
>     *
>     0000200
> 
> The first cluster contains zeroes now.
> 
> 7. Stop the vm
> 
> 8. List the second disk snapshots.
> 
> In this example the disk id is 2b720ee1-baad-48d8-bfdb-582946b82448
> 
> $ python3 list_disk_snapshots.py -c engine-dev
> 2b720ee1-baad-48d8-bfdb-582946b82448
> [
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
>     "parent": null,
>     "status": "ok"
>   },
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "fb14e6e4-00a7-48de-b7e0-d23996889cbd",
>     "parent": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
>     "status": "ok"
>   }
> ]
> 
> 9. Download base image (id=dcb903e7-5567-4eb8-a240-85cbea1dce12)
> 
> In this example the disk is on the storage domain "iscsi-00".
> 
> $ python download_disk_snapshot.py -c engine-dev iscsi-00
> dcb903e7-5567-4eb8-a240-85cbea1dce12 base.qcow2
> 
> 10. Download top disk snapshot (id=fb14e6e4-00a7-48de-b7e0-d23996889cbd)
> 
> $ python3 download_disk_snapshot.py -c engine-dev iscsi-00
> fb14e6e4-00a7-48de-b7e0-d23996889cbd --backing-file base.qcow2 top.qcow2
> 
> 11. Create a checksum of the disk
> 
> $ python3 checksum_disk.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
> {
>     "algorithm": "blake2b",
>     "block_size": 4194304,
>     "checksum":
> "b6dcdb509ec27d672ab91ddf6289365469668fa0d6a5de5cbc594c0ea3102825"
> }
> 
> 12. Create a checksum of the downloaded image
> 
> $ python3 checksum_image.py top.qcow2
> {
>   "algorithm": "blake2b",
>   "block_size": 4194304,
>   "checksum":
> "fcb7d96381087e6a6d5b07421342d90dd32e6ad5a21b99f37372f29dcd491a7e"
> }
> 
> The checksums do not match, the downloaded image is does not contain the
> same data as the original disk.
> 

Verified on RHV-4.4.7-6:
The checksums match.


(In reply to Nir Soffer from comment #3)
> Real flow reproducing the issue.
> 
> 1. Create test disk 
> 
> $ virt-builder fedora-32 -o fedora-32.qcow2 \
>     --format=qcow2 \
>     --hostname=f32 \
>     --ssh-inject=root \
>     --root-password=password:root -\
>     --selinux-relabel \
>     --install=qemu-guest-agent
> 
> 2. Upload disk to iscsi/fc storage
> 
> $ upload_disk.py -c my_engine --sd-name my_sd --disk-sparse fedora-32.qcow2
> 
> 4. Create vm with disk using:
> 
> interface: virtio-scsi
> enable-discard: yes
> 
> Looks like enable discard cannot be enabled when attaching new vm
> to a disk. Edit the disk after creating the vm to enable discard.
> 
> 5. Start the vm
> 
> 6. Create a big file
> 
> In the guest run:
> 
>     # dd if=/dev/urandom bs=1M count=1024 of=big-file conv=fsync
> 
> 7. Create snapshot 1
> 
> 8. Delete the file and trim
> 
> In the guest run:
> 
>    # rm big-file
>    # fstrim -av
> 
> This creates a lot of zeroed clusters in the active vm disk snapshot.
> 
> 9. Shutdown the vm
> 
> 10. List disk snapshots
> 
> $ ./list_disk_snapshots.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
> [
>   {
>     "actual_size": 4026531840,
>     "format": "cow",
>     "id": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
>     "parent": null,
>     "status": "ok"
>   },
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "11e79858-d972-4e50-a4d3-501f759c09d7",
>     "parent": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
>     "status": "ok"
>   }
> ]
> 
> 11. Download base image
> 
> $ ./download_disk_snapshot.py -c my_engine my_sd
> f18550e7-d2b2-427a-bbd0-d39d5d93cdf1  base.qcow2
> 
> 12. Download top image rebasing on top of base.qcow2
> 
> $  ./download_disk_snapshot.py -c my_engine my_sd
> 11e79858-d972-4e50-a4d3-501f759c09d7 --backing-file base.qcow2 snap1.qcow2
> 
> 13. Create checksum for the original disk
> 
> $ ./checksuk_disk.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
> {
>     "algorithm": "blake2b",
>     "block_size": 4194304,
>     "checksum":
> "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
> }
> 
> 14. Create checksum for the downloaded image
> 
> $ ./checksum_image.py snap1.qcow2
> {
>   "algorithm": "blake2b",
>   "block_size": 4194304,
>   "checksum":
> "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
> }
> 
> The checksums must match.
> 
> 15. Upload downloaded image to new disk
> 
> $ ./upload_disk.py -c my_engine --sd-name my_sd --disk-format qcow2
> --disk-sparse snap1.qcow2
> 
> 16. Create new vm from this disk and start the vm
> 
> The VM must boot normally.

Verified on RHV-4.4.7-6:
The checksums match and the VM boot normally.

Moving to 'Verified'.

Comment 9 Sandro Bonazzola 2021-07-28 14:16:50 UTC
This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.