1971182 – [RFE] Use "qemu:allocation-depth" meta context to report holes

Bug 1971182 - [RFE] Use "qemu:allocation-depth" meta context to report holes

Summary: [RFE] Use "qemu:allocation-depth" meta context to report holes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-imageio
Classification:	oVirt
Component:	Common
Sub Component:
Version:	2.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.4.7
Target Release:	2.2.0
Assignee:	Nir Soffer
QA Contact:	Evelina Shames
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-12 18:50 UTC by Nir Soffer
Modified:	2021-11-04 19:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-imageio-2.2.0-1
Clone Of:
Environment:
Last Closed:	2021-07-28 14:16:50 UTC
oVirt Team:	Storage
Embargoed:
Flags:	sbonazzo: ovirt-4.4? michal.skrivanek: blocker- pm-rhel: planning_ack? pm-rhel: devel_ack+ pm-rhel: testing_ack+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
oVirt gerrit	115193	master	MERGED	spec: Exclude broken libvirt version	2021-06-14 13:44:08 UTC
oVirt gerrit	115215	master	MERGED	nbd: Allow merging zero and dirty extents	2021-06-15 21:38:21 UTC
oVirt gerrit	115216	master	MERGED	nbd: Report zero status in dirty extents	2021-06-20 11:41:45 UTC
oVirt gerrit	115219	master	MERGED	nbd: Replace dirty flag with context type	2021-06-20 11:41:30 UTC
oVirt gerrit	115220	master	MERGED	nbd: Use qemu:allocation-depth meta context	2021-06-20 11:41:36 UTC
oVirt gerrit	115228	master	MERGED	spec: Require qemu-img providing allocation depth	2021-06-20 11:41:34 UTC
oVirt gerrit	115229	master	MERGED	spec: Require qemu-img providing allocation depth	2021-06-14 12:14:24 UTC
oVirt gerrit	115230	master	MERGED	nbd: Expose allocation depth in nbd server	2021-06-14 12:14:26 UTC
oVirt gerrit	115232	master	MERGED	tests: Minimize use of imageio internals	2021-06-14 12:14:21 UTC
oVirt gerrit	115233	master	MERGED	tests: Enable allocation-depth in qemu nbd server	2021-06-20 11:41:24 UTC
oVirt gerrit	115234	master	MERGED	nbd: Add constants for meta context names	2021-06-20 11:41:39 UTC
oVirt gerrit	115258	master	MERGED	nbd: Ignore unexpected bits in base:allocation	2021-06-20 11:41:27 UTC
oVirt gerrit	115266	master	MERGED	nbdutil: Support merging extents	2021-06-16 16:37:39 UTC
oVirt gerrit	115296	master	MERGED	tests: Rewrite nbdutil extents tests	2021-06-20 11:41:50 UTC
oVirt gerrit	115320	master	MERGED	spec: Require ovirt-imageio >= 2.2.0-1	2021-06-21 12:40:34 UTC
oVirt gerrit	115339	master	MERGED	Revert "spec: Exclude broken libvirt version"	2021-06-21 12:40:20 UTC

Description Nir Soffer 2021-06-12 18:50:05 UTC

Description of problem:

Since ovirt 4.4.4 ovirt-imageio supports transferring single snapshots:
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/list_disk_snapshots.py
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/download_disk_snapshot.py
- https://github.com/oVirt/ovirt-engine-sdk/blob/master/sdk/examples/upload_disk.py

This feature depends on the detecting unallocated areas in single qcow2
images, and was implemented using standard NBD "base:allocation" meta
context in qemu-nbd.

Using "base:allocation" is not entirely correct, since NBD server may omit
information about holes, but it works since ovirt-imageio uses only qemu-nbd.
Unfortunately qemu 6.0.0 change the way zeroed clusters in qcow2 images are
reported (bug 1968693). Previously they were reported as:

NBD_STATE_ZERO

But with qemu 6.0.0 they are reported as:

NBD_STATE_ZERO | NBD_STATE_HOLE

This change is considered a bug fix in qemu, and it is not possible to revert
this change in upstream qemu.

We can fix this issue using the new "qemu:allocation-depth" meta context
introduced in qemu 5.2.0. This meta context expose reliable (not optional)
information about unallocated areas in a qcow2 image.

Change imageio nbd client to use "qemu:allocation-depth", and use it to
report holes.

Version-Release number of selected component (if applicable):
2.1.1

With this change uploading and downloading single snapshots should work with
both qemu 5.2.0 (RHEL 8.4) and qemu 6.0.0 (Centos Stream).

Comment 2 Nir Soffer 2021-06-13 14:04:13 UTC

To reproduce the issue we need to download a snapshot using the NBD
backend. This flow is used by backup applications that use snapshot
based backups.

1. Install a host using RHEL 8.5 AV nightly

You should have this qemu version:
$ rpm -q qemu-kvm
qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb.x86_64

2. Create a vm
3. Add a thin virtio-scsi disk to the vm

Inside the guest:

# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0    6G  0 disk 
├─sda1   8:1    0    1M  0 part 
├─sda2   8:2    0    1G  0 part /boot
├─sda3   8:3    0  615M  0 part [SWAP]
└─sda4   8:4    0  4.4G  0 part /
sdb      8:16   0   10G  0 disk 
sr0     11:0    1 1024M  0 rom  

4. Write data to the fist cluster of disk /dev/sdb

    # echo "data from base" > /dev/sdb
    # sync

5. Create snapshot including the second disk

6. Zero the first cluster of disk sdb

In the guest run:

    # fallocate --punch-hole --length 64k /dev/sdb

In the guest we cannot see the "data from base" now:

    # dd if=/dev/sdb bs=512 count=1 status=none | hexdump
    0000000 0000 0000 0000 0000 0000 0000 0000 0000
    *
    0000200

The first cluster contains zeroes now.

7. Stop the vm

8. List the second disk snapshots.

In this example the disk id is 2b720ee1-baad-48d8-bfdb-582946b82448

$ python3 list_disk_snapshots.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
[
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
    "parent": null,
    "status": "ok"
  },
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "fb14e6e4-00a7-48de-b7e0-d23996889cbd",
    "parent": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
    "status": "ok"
  }
]

9. Download base image (id=dcb903e7-5567-4eb8-a240-85cbea1dce12)

In this example the disk is on the storage domain "iscsi-00".

$ python download_disk_snapshot.py -c engine-dev iscsi-00 dcb903e7-5567-4eb8-a240-85cbea1dce12 base.qcow2

10. Download top disk snapshot (id=fb14e6e4-00a7-48de-b7e0-d23996889cbd)

$ python3 download_disk_snapshot.py -c engine-dev iscsi-00 fb14e6e4-00a7-48de-b7e0-d23996889cbd --backing-file base.qcow2 top.qcow2

11. Create a checksum of the disk

$ python3 checksum_disk.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
{
    "algorithm": "blake2b",
    "block_size": 4194304,
    "checksum": "b6dcdb509ec27d672ab91ddf6289365469668fa0d6a5de5cbc594c0ea3102825"
}

12. Create a checksum of the downloaded image

$ python3 checksum_image.py top.qcow2
{
  "algorithm": "blake2b",
  "block_size": 4194304,
  "checksum": "fcb7d96381087e6a6d5b07421342d90dd32e6ad5a21b99f37372f29dcd491a7e"
}

The checksums do not match, the downloaded image is does not contain the
same data as the original disk.

This flow should be tested as like other image transfer flows.


The reason is that the disk snapshot dcb903e7-5567-4eb8-a240-85cbea1dce12
was reported as empty by qemu-nbd during the download. So download_disk_snapshot.py
create an empty qcow2 disk:

$ qemu-img map --output json top.qcow2 
[{ "start": 0, "length": 65536, "depth": 1, "zero": false, "data": true, "offset": 327680},
{ "start": 65536, "length": 10737352704, "depth": 1, "zero": true, "data": false}]

The top image first cluster was zeroed in the guest, so we expect to see:

$ qemu-img map --output json top.qcow2 
[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false},
{ "start": 65536, "length": 10737352704, "depth": 1, "zero": true, "data": false}]

So "data from base" is exposed to the guest.

To see the data the guest will see, we can convert the disk to raw format:

$ qemu-img convert -f qcow2 -O raw top.qcow2 top.img

Looking at first cluster show data from base:

$ dd if=top.img bs=512 count=1 status=none
data from base

If we had a file system on this disk, the file system would be corrupted 
in the downloaded image.

Comment 3 Nir Soffer 2021-06-13 23:58:36 UTC

Real flow reproducing the issue.

1. Create test disk 

$ virt-builder fedora-32 -o fedora-32.qcow2 \
    --format=qcow2 \
    --hostname=f32 \
    --ssh-inject=root \
    --root-password=password:root -\
    --selinux-relabel \
    --install=qemu-guest-agent

2. Upload disk to iscsi/fc storage

$ upload_disk.py -c my_engine --sd-name my_sd --disk-sparse fedora-32.qcow2

4. Create vm with disk using:

interface: virtio-scsi
enable-discard: yes

Looks like enable discard cannot be enabled when attaching new vm
to a disk. Edit the disk after creating the vm to enable discard.

5. Start the vm

6. Create a big file

In the guest run:

    # dd if=/dev/urandom bs=1M count=1024 of=big-file conv=fsync

7. Create snapshot 1

8. Delete the file and trim

In the guest run:

   # rm big-file
   # fstrim -av

This creates a lot of zeroed clusters in the active vm disk snapshot.

9. Shutdown the vm

10. List disk snapshots

$ ./list_disk_snapshots.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
[
  {
    "actual_size": 4026531840,
    "format": "cow",
    "id": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
    "parent": null,
    "status": "ok"
  },
  {
    "actual_size": 1073741824,
    "format": "cow",
    "id": "11e79858-d972-4e50-a4d3-501f759c09d7",
    "parent": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
    "status": "ok"
  }
]

11. Download base image

$ ./download_disk_snapshot.py -c my_engine my_sd f18550e7-d2b2-427a-bbd0-d39d5d93cdf1  base.qcow2

12. Download top image rebasing on top of base.qcow2

$  ./download_disk_snapshot.py -c my_engine my_sd 11e79858-d972-4e50-a4d3-501f759c09d7 --backing-file base.qcow2 snap1.qcow2

13. Create checksum for the original disk

$ ./checksuk_disk.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
{
    "algorithm": "blake2b",
    "block_size": 4194304,
    "checksum": "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
}

14. Create checksum for the downloaded image

$ ./checksum_image.py snap1.qcow2
{
  "algorithm": "blake2b",
  "block_size": 4194304,
  "checksum": "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
}

The checksums must match.

15. Upload downloaded image to new disk

$ ./upload_disk.py -c my_engine --sd-name my_sd --disk-format qcow2 --disk-sparse snap1.qcow2

16. Create new vm from this disk and start the vm

The VM must boot normally.

Comment 5 Nir Soffer 2021-06-20 12:10:19 UTC

Note that the fix requires vdsm >= 4.40.70.4.

Comment 6 Nir Soffer 2021-07-04 12:04:03 UTC

Notes for testing:

1. Reproduce the issue with qemu 6.0.0

This is possible only with ovirt-imageio < 2.2.0-1. I reproduced this
on Fedora 32 and RHEL 8.5.

2. Testing with RHEL 8.4

RHEL 8.4 provides qemu 5.2.0. The flows described in comment 2 and
comment 3 can be tested with this version.

ovirt-imageio 2.2.0-1 change the way we get zero extents info from
qemu. We want to make sure using the new way does not introduce 
regressions.

Comment 7 Nir Soffer 2021-07-04 12:07:28 UTC

(continued from comment 6)

3. Testing with RHEL 8.5

I tested the flows in comment 2 and comment 3 with qemu 6.0.0 on
RHEL 8.5, so there should be no issue on Centos Stream running
same version.

Comment 8 Evelina Shames 2021-07-11 07:34:37 UTC

(In reply to Nir Soffer from comment #2)
> To reproduce the issue we need to download a snapshot using the NBD
> backend. This flow is used by backup applications that use snapshot
> based backups.
> 
> 1. Install a host using RHEL 8.5 AV nightly
> 
> You should have this qemu version:
> $ rpm -q qemu-kvm
> qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb.x86_64
> 
> 2. Create a vm
> 3. Add a thin virtio-scsi disk to the vm
> 
> Inside the guest:
> 
> # lsblk
> NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
> sda      8:0    0    6G  0 disk 
> ├─sda1   8:1    0    1M  0 part 
> ├─sda2   8:2    0    1G  0 part /boot
> ├─sda3   8:3    0  615M  0 part [SWAP]
> └─sda4   8:4    0  4.4G  0 part /
> sdb      8:16   0   10G  0 disk 
> sr0     11:0    1 1024M  0 rom  
> 
> 4. Write data to the fist cluster of disk /dev/sdb
> 
>     # echo "data from base" > /dev/sdb
>     # sync
> 
> 5. Create snapshot including the second disk
> 
> 6. Zero the first cluster of disk sdb
> 
> In the guest run:
> 
>     # fallocate --punch-hole --length 64k /dev/sdb
> 
> In the guest we cannot see the "data from base" now:
> 
>     # dd if=/dev/sdb bs=512 count=1 status=none | hexdump
>     0000000 0000 0000 0000 0000 0000 0000 0000 0000
>     *
>     0000200
> 
> The first cluster contains zeroes now.
> 
> 7. Stop the vm
> 
> 8. List the second disk snapshots.
> 
> In this example the disk id is 2b720ee1-baad-48d8-bfdb-582946b82448
> 
> $ python3 list_disk_snapshots.py -c engine-dev
> 2b720ee1-baad-48d8-bfdb-582946b82448
> [
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
>     "parent": null,
>     "status": "ok"
>   },
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "fb14e6e4-00a7-48de-b7e0-d23996889cbd",
>     "parent": "dcb903e7-5567-4eb8-a240-85cbea1dce12",
>     "status": "ok"
>   }
> ]
> 
> 9. Download base image (id=dcb903e7-5567-4eb8-a240-85cbea1dce12)
> 
> In this example the disk is on the storage domain "iscsi-00".
> 
> $ python download_disk_snapshot.py -c engine-dev iscsi-00
> dcb903e7-5567-4eb8-a240-85cbea1dce12 base.qcow2
> 
> 10. Download top disk snapshot (id=fb14e6e4-00a7-48de-b7e0-d23996889cbd)
> 
> $ python3 download_disk_snapshot.py -c engine-dev iscsi-00
> fb14e6e4-00a7-48de-b7e0-d23996889cbd --backing-file base.qcow2 top.qcow2
> 
> 11. Create a checksum of the disk
> 
> $ python3 checksum_disk.py -c engine-dev 2b720ee1-baad-48d8-bfdb-582946b82448
> {
>     "algorithm": "blake2b",
>     "block_size": 4194304,
>     "checksum":
> "b6dcdb509ec27d672ab91ddf6289365469668fa0d6a5de5cbc594c0ea3102825"
> }
> 
> 12. Create a checksum of the downloaded image
> 
> $ python3 checksum_image.py top.qcow2
> {
>   "algorithm": "blake2b",
>   "block_size": 4194304,
>   "checksum":
> "fcb7d96381087e6a6d5b07421342d90dd32e6ad5a21b99f37372f29dcd491a7e"
> }
> 
> The checksums do not match, the downloaded image is does not contain the
> same data as the original disk.
> 

Verified on RHV-4.4.7-6:
The checksums match.


(In reply to Nir Soffer from comment #3)
> Real flow reproducing the issue.
> 
> 1. Create test disk 
> 
> $ virt-builder fedora-32 -o fedora-32.qcow2 \
>     --format=qcow2 \
>     --hostname=f32 \
>     --ssh-inject=root \
>     --root-password=password:root -\
>     --selinux-relabel \
>     --install=qemu-guest-agent
> 
> 2. Upload disk to iscsi/fc storage
> 
> $ upload_disk.py -c my_engine --sd-name my_sd --disk-sparse fedora-32.qcow2
> 
> 4. Create vm with disk using:
> 
> interface: virtio-scsi
> enable-discard: yes
> 
> Looks like enable discard cannot be enabled when attaching new vm
> to a disk. Edit the disk after creating the vm to enable discard.
> 
> 5. Start the vm
> 
> 6. Create a big file
> 
> In the guest run:
> 
>     # dd if=/dev/urandom bs=1M count=1024 of=big-file conv=fsync
> 
> 7. Create snapshot 1
> 
> 8. Delete the file and trim
> 
> In the guest run:
> 
>    # rm big-file
>    # fstrim -av
> 
> This creates a lot of zeroed clusters in the active vm disk snapshot.
> 
> 9. Shutdown the vm
> 
> 10. List disk snapshots
> 
> $ ./list_disk_snapshots.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
> [
>   {
>     "actual_size": 4026531840,
>     "format": "cow",
>     "id": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
>     "parent": null,
>     "status": "ok"
>   },
>   {
>     "actual_size": 1073741824,
>     "format": "cow",
>     "id": "11e79858-d972-4e50-a4d3-501f759c09d7",
>     "parent": "f18550e7-d2b2-427a-bbd0-d39d5d93cdf1",
>     "status": "ok"
>   }
> ]
> 
> 11. Download base image
> 
> $ ./download_disk_snapshot.py -c my_engine my_sd
> f18550e7-d2b2-427a-bbd0-d39d5d93cdf1  base.qcow2
> 
> 12. Download top image rebasing on top of base.qcow2
> 
> $  ./download_disk_snapshot.py -c my_engine my_sd
> 11e79858-d972-4e50-a4d3-501f759c09d7 --backing-file base.qcow2 snap1.qcow2
> 
> 13. Create checksum for the original disk
> 
> $ ./checksuk_disk.py -c my_engine 04e20159-443a-447e-bc2c-5620515137dc
> {
>     "algorithm": "blake2b",
>     "block_size": 4194304,
>     "checksum":
> "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
> }
> 
> 14. Create checksum for the downloaded image
> 
> $ ./checksum_image.py snap1.qcow2
> {
>   "algorithm": "blake2b",
>   "block_size": 4194304,
>   "checksum":
> "73942588b8c2734598d9636499a1324392305056ddd293a04c207de0e56d39c4"
> }
> 
> The checksums must match.
> 
> 15. Upload downloaded image to new disk
> 
> $ ./upload_disk.py -c my_engine --sd-name my_sd --disk-format qcow2
> --disk-sparse snap1.qcow2
> 
> 16. Create new vm from this disk and start the vm
> 
> The VM must boot normally.

Verified on RHV-4.4.7-6:
The checksums match and the VM boot normally.

Moving to 'Verified'.

Comment 9 Sandro Bonazzola 2021-07-28 14:16:50 UTC

This bugzilla is included in oVirt 4.4.7 release, published on July 6th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.7 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.