1533975 – detect-zeroes=unmap/on does not produce a sparse file on NFS v4.1 when attempting blockdev/drive-mirror

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1533975 - detect-zeroes=unmap/on does not produce a sparse file on NFS v4.1 when attempting blockdev/drive-mirror

Summary: detect-zeroes=unmap/on does not produce a sparse file on NFS v4.1 when attemp...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Kevin Wolf
QA Contact:	aihua liang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1229843 1297255
TreeView+	depends on / blocked

Reported:	2018-01-12 17:01 UTC by Peter Krempa
Modified:	2022-03-13 14:38 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-13 20:47:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Peter Krempa 2018-01-12 17:01:20 UTC

Description of problem:
Even when detect-zeroes and unmap are enabled, writing zeroes to a image file stored on NFSv4.1 backed storage does not produce a sparse file.

Version-Release number of selected component (if applicable):
Observed on current upstream, thus all prior versions are expected to be impacted.


How reproducible:
100%


Steps to Reproduce:
1. Create a qemu VM with a disk which has unmapped regions on a NFSv4.1 server. Following example use a file created by:

dd if=dev/urandom of=/data/tmp/sparse.raw seek=10000 count=1 bs=1M
dd if=dev/zero of=/data/tmp/copy seek=10000 count=1 bs=1M

2. Pre-create and attach a new backend file via blockdev-add:

dd if=dev/zero of=/data/tmp/copy seek=10000 count=1 bs=1M

'{"execute":"blockdev-add", "arguments":{"driver":"raw","node-name":"test","detect-zeroes":"unmap","discard":"unmap","file":{"driver":"file","filename":"/mnt/floppy/copy"}}}'

3. Copy over the data:
'{"execute":"blockdev-mirror","arguments":{"job-id":"testjob", "device": "#block194", "target": "test", "filter-node-name":"testfilternode","sync":"full"}}'

Notes:
1) Same results are when 'on' is used instead of 'unmap' for 'detect-zeroes'.
2) "sync":"top" does not make sense for raw file
3) Same results are obtained when using 'drive-mirror' and using the following JSON for 'target':
    json:{"driver":"raw", "detect-zeroes":"unmap", "discard":"unmap", "file": { "filename":"%s", "driver":"file", "discard":"unmap", "detect-zeroes":"unmap" } }

with 'format being not set and using the mode:existing.

4) The code attempts to write 'efficient' zeroes, but for NFS this apparently is not supported. There is no (known-to-me) way to skip zero-writes.

Actual results:
# du -h * 
9.8G	copy
1.0M	sparse.raw

Expected results:
# du -h * 
1.0M	copy
1.0M	sparse.raw

Additional info:
Using 'detect-zeroes' to sparsify a file was suggested to be used to keep sparsness of storage files on prior to nfs 4.2 in 1297255, which is impossible to implement without this fixed.

Hack to pass the JSON into drive-mirror is used so that this can be used prior to fully supporting blockdev. (at least there was an attempt)

Comment 3 Kevin Wolf 2018-01-16 17:16:34 UTC

(In reply to Peter Krempa from comment #0)
> Additional info:
> Using 'detect-zeroes' to sparsify a file was suggested to be used to keep
> sparsness of storage files on prior to nfs 4.2 in 1297255, which is
> impossible to implement without this fixed.

I can't see a comment suggesting this in bug 1297255, but either case, this approach isn't going to work for NFS < 4.2. Old NFS versions just don't support thin provisioning at all, so qemu can neither punch holes nor get information about the allocation status of blocks.

The reason why 'qemu-img convert' can be more efficient is because it knows that it just created the image file and that any parts of the file that it doesn't write to will read as zero. The best we could do to allow the same in the mirror block job would be adding a flag in the QMP command that allows libvirt to tell essentially "I promise this whole image reads as zeros, you can just skip any write_zeroes and discard commands without impacting the result".

Would you think that libvirt has enough information to pass this option, and would it solve what you need?

Preferably, of course, everyone would just switch to NFS 4.2, but I'm told that this is unlikely to be a viable solution...

Comment 4 Peter Krempa 2018-01-22 15:02:30 UTC

(In reply to Kevin Wolf from comment #3)
> (In reply to Peter Krempa from comment #0)
> > Additional info:
> > Using 'detect-zeroes' to sparsify a file was suggested to be used to keep
> > sparsness of storage files on prior to nfs 4.2 in 1297255, which is
> > impossible to implement without this fixed.
> 
> I can't see a comment suggesting this in bug 1297255, but either case, this
> approach isn't going to work for NFS < 4.2. Old NFS versions just don't
> support thin provisioning at all, so qemu can neither punch holes nor get
> information about the allocation status of blocks.
> 
> The reason why 'qemu-img convert' can be more efficient is because it knows
> that it just created the image file and that any parts of the file that it
> doesn't write to will read as zero. The best we could do to allow the same
> in the mirror block job would be adding a flag in the QMP command that
> allows libvirt to tell essentially "I promise this whole image reads as
> zeros, you can just skip any write_zeroes and discard commands without
> impacting the result".

Yes that is exactly what I had in mind. In libvirt we can offload this knowledge to the management applications as they usually pre-create the image and pass it in with the --reuse-external flag which translates to mode: existing.

> 
> Would you think that libvirt has enough information to pass this option, and
> would it solve what you need?

We can require users to certify this the same way as we require passing an image which has enough space.

> 
> Preferably, of course, everyone would just switch to NFS 4.2, but I'm told
> that this is unlikely to be a viable solution...

Yes, obviously. It actually works great there.

Comment 12 Kevin Wolf 2018-06-27 09:29:02 UTC

So I looked some more into it and the mirror block job actually already tries quite hard to avoid unnecessary writes. The reason that it is defeated here is that the setup chosen simply doesn't support thin provisioning:

* Only allocated blocks are initially marked as dirty (and will therefore be copied). However, the source filesystem doesn't support getting information whether a block is allocated or not. You have to read it and manually check whether it's zero.

* The target filesystem doesn't support efficient zero writes, so even if you know that you need to write zeros, it doesn't help you. Just skipping any zero writes is wrong because the guest keeps working on the image and it could first write non-zero data to it and later overwrite it with zeroes. If we copied the non-zero data, we must zero out it again even though the target had zeros initially.

* Both sides don't use a proper image format, but the raw file. This means that we only get the features that the file system already provides on its own. If qcow2 had been used for either source or target, the mirror block job would have a way to keep things sparse.

Reconstructing the allocation information while initially marking blocks as dirty isn't possible either, this would involve reading in the whole image (a second time) while the VM is blocked, which would take by far too long to be reasonable.

Possible solutions for the problem, in descending order of preferability:

1. Use qcow2 for the target image. (No changes in the QEMU/libvirt necessary)

2. Get the concurrent guest writes out of the way and use qemu-img. This could be achieve by an image fleecing setup (add a temporary overlay image, blockdev-backup from the active guest image to the temporary overlay and start the built-in NBD server for the overlay) and qemu-img convert from an NBD source. (Changes might be necessary in libvirt (not sure), but not QEMU)

3. Add an option to the mirror job that makes it manage another bitmap in memory that tracks for each block in the target image whether it has already been written to or is still in its original state. Zero writes could then be skipped if the block had never written to. (Changes necessary to every layer in the virt stack)

Comment 14 Ademar Reis 2018-12-13 20:47:27 UTC

(In reply to Kevin Wolf from comment #12)
> Possible solutions for the problem, in descending order of preferability:
> 
> 1. Use qcow2 for the target image. (No changes in the QEMU/libvirt necessary)
> 
> 2. Get the concurrent guest writes out of the way and use qemu-img. This
> could be achieve by an image fleecing setup (add a temporary overlay image,
> blockdev-backup from the active guest image to the temporary overlay and
> start the built-in NBD server for the overlay) and qemu-img convert from an
> NBD source. (Changes might be necessary in libvirt (not sure), but not QEMU)
> 
> 3. Add an option to the mirror job that makes it manage another bitmap in
> memory that tracks for each block in the target image whether it has already
> been written to or is still in its original state. Zero writes could then be
> skipped if the block had never written to. (Changes necessary to every layer
> in the virt stack)

I should add another option (#0): upgrade to NFS 4.2+, as this is a limitation of NFS < 4.1.

Anyway, option #3 (this BZ) is not feasible anytime soon. It has a high cost to implement and would take too long to be worth it (by then, the affected customer might be able to upgrade to NFS 4.2+). So I'm closing this BZ. Please consider the workarounds above.

For some extra technical details, please consult comment #12 in full.

Note You need to log in before you can comment on or make changes to this bug.