2094224 – Slow convert when using -n (no create) flag

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2094224 - Slow convert when using -n (no create) flag

Summary: Slow convert when using -n (no create) flag

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	CentOS Stream
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Virtualization Maintenance
QA Contact:	Tingting Mao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-07 07:54 UTC by Nir Soffer
Modified:	2023-09-22 16:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-22 16:35:51 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	pm-rhel: mirror+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7387	0	None	Migrated	None	2023-09-22 16:35:49 UTC
Red Hat Issue Tracker	RHELPLAN-124431	0	None	None	None	2022-06-07 08:05:52 UTC

Description Nir Soffer 2022-06-07 07:54:11 UTC

Description of problem:

qemu-img convert with "-n" is much slower than it could, running many fdatasync
during the convert.

Version-Release number of selected component (if applicable):
qemu-img-7.0.0-4.el9.x86_64

How reproducible:
Always

Steps to Reproduce:

1. Create source image

    $ qemu-img create -f raw empty.raw 8t
    Formatting 'empty.raw', fmt=raw size=8796093022208

2. Create target image

    $ qemu-img create -f qcow2 target.qcow2 8t
    Formatting 'target.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off 
    compression_type=zlib size=1099511627776 lazy_refcounts=off refcount_bits=16

3. Convert using -n (keeping the target image)

    $ time qemu-img convert -f raw -O qcow2 -t none -T none -n empty.raw target.qcow2

    real	1m26.608s
    user	0m10.918s
    sys	        0m11.469s

4. Convert without -n (recreating the target image):

    $ time qemu-img convert -f raw -O qcow2 -t none -T none empty.raw target.qcow2

    real	0m0.056s
    user	0m0.012s
    sys	        0m0.026s

Actual results:
Converting with -n is slow even when there is no data in the source image.

Expected results:
Faster conversion, paying only for data in the source image.

Additional info:

Running with strace reveals that converting writes lot of data (zero clusters?)
and call fdatasync many times during the convert.

$ strace -f -tt -T -o convert.strace qemu-img convert -f raw -O qcow2 -t none -T none -n empty.raw target.qcow2

This is the typical pattern we see:

$ egrep 'pwrite64\(|fdatasync\(' convert.strace
...
359864 17:05:15.583796 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 458752 <unfinished ...>
359864 17:05:15.584030 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 462848 <unfinished ...>
359864 17:05:15.584211 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 466944 <unfinished ...>
359864 17:05:15.584383 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 471040 <unfinished ...>
359864 17:05:15.584548 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 475136 <unfinished ...>
359864 17:05:15.584723 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 479232 <unfinished ...>
359864 17:05:15.584891 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 483328 <unfinished ...>
359864 17:05:15.585062 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 487424 <unfinished ...>
359864 17:05:15.585225 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 491520 <unfinished ...>
359864 17:05:15.585391 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 495616 <unfinished ...>
359864 17:05:15.585553 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 499712 <unfinished ...>
359864 17:05:15.585723 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 503808 <unfinished ...>
359864 17:05:15.585890 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 507904 <unfinished ...>
359864 17:05:15.586055 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 512000 <unfinished ...>
359864 17:05:15.586219 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 516096 <unfinished ...>
359864 17:05:15.586382 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 520192 <unfinished ...>
359864 17:05:15.586544 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 524288 <unfinished ...>
359864 17:05:15.586721 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 528384 <unfinished ...>
359864 17:05:15.586885 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 532480 <unfinished ...>
359864 17:05:15.587016 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 536576 <unfinished ...>
359864 17:05:15.587185 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 540672 <unfinished ...>
359864 17:05:15.587348 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 544768 <unfinished ...>
359864 17:05:15.587510 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 548864 <unfinished ...>
359864 17:05:15.587675 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 552960 <unfinished ...>
359864 17:05:15.587857 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 557056 <unfinished ...>
359864 17:05:15.588033 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 561152 <unfinished ...>
359864 17:05:15.588200 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 565248 <unfinished ...>
359864 17:05:15.588362 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 569344 <unfinished ...>
359864 17:05:15.588529 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 573440 <unfinished ...>
359864 17:05:15.588702 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 577536 <unfinished ...>
359864 17:05:15.588866 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 581632 <unfinished ...>
359864 17:05:15.589032 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 585728 <unfinished ...>
359864 17:05:15.589164 fdatasync(8 <unfinished ...>
359864 17:05:15.591075 pwrite64(8, "\200\0\0\0\0\5\0\0\200\0\0\0\0\6\0\0\200\0\0\0\0\7\0\0\200\0\0\0\0\10\0\0"..., 512, 196608 <unfinished ...>
359864 17:05:15.591237 fdatasync(8 <unfinished ...>
359864 17:05:15.592075 pwrite64(8, "\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536, 131072 <unfinished ...>
359864 17:05:15.592263 fdatasync(8 <unfinished ...>

$ xzgrep pwrite64 convert-pwrite64-fdatasync.strace.xz | wc -l
557056

$ xzgrep fdatasync convert-pwrite64-fdatasync.strace.xz | wc -l
49153

When converting an image, we can skip fdatasync calls during the convert,
and sync once at the end.

Since we know the contents of the source during the conversion, can we allocate
clusters more efficiently (e.g. once for every 1g) instead of incrementally?


Workaround:

Don't use -n. RHV was using it for many cases it was not needed.

In cases when -n is required (target is nbd), use --target-is-zero:

    $ time qemu-img convert -f raw -O qcow2 -t none -T none -n --target-is-zero empty.raw target.qcow2 

    real	0m0.046s
    user	0m0.021s
    sys	        0m0.016s

Comment 1 Tingting Mao 2022-06-07 08:58:00 UTC

Tried in latest rhel9.1, hit the issue.


Tested with:
qemu-kvm-7.0.0-5.el9
kernel-5.14.0-96.el9.x86_64


Steps:
For emptied source image:
# qemu-img create -f raw empty.raw 8T

# qemu-img create -f qcow2 target.qcow2 8T

# time qemu-img convert -f raw -O qcow2 -t none -T none -n empty.raw target.qcow2
real	0m26.299s       -----------------------------------------------------------------------------------> slower
user	0m9.062s
sys	0m6.057s

# time qemu-img convert -f raw -O qcow2 -t none -T none empty.raw target.qcow2
real	0m0.024s
user	0m0.016s
sys	0m0.009s


For not emptied source image:
# qemu-img create -f raw empty.raw 8T

# qemu-io -c 'write -P 1 0 1.99G' empty.raw -f raw

# qemu-img create -f qcow2 target.qcow2 8T

# time qemu-img convert -f raw -O qcow2 -t none -T none -n empty.raw target.qcow2
real	1m21.624s          --------------------------------------------------------------------------------> much more slower
user	0m30.717s
sys	0m34.378s

# time qemu-img convert -f raw -O qcow2 -t none -T none empty.raw target.qcow2

real	0m2.940s
user	0m0.269s
sys	0m0.478s

Comment 2 Kevin Wolf 2022-06-08 14:07:44 UTC

Nir, is there any specific scenario in practice that you're interested in improving? I imagine the use of -n without --target-is-zero is interesting mainly for converting to a block device without zeroing it first?

So looking at your data, I see several things that are possibly non-optimal:

- Writing lots of zeros. Without --target-is-zero, we have to do this to ensure that the target actually has the same data as the source because otherwise non-zero data that was previously stored on the target would stay around instead of being zeroed like it is in the source image. However, this is really supposed to use blk_co_pwrite_zeroes(), which would probably end up as fallocate() syscalls. Does this not work on your target device so that it falls back to explicit writes?

- Lots of fdatasync() calls. I'm honestly surprised about these. With an 8 TB image, it's clear that the metadata cache won't be large enough to hold the whole image, but allocation in the target should be sequential, so we should be able to do better than this. Do we end up always evicting and writing back only one entry from the L2 and refcount block cache when it's full, so that both caches take turns in writing back very little metadata and you get an fdatasync() each time? Maybe this could be a bit cleverer. Note that allocating clusters before copying the data wouldn't change anything about this behaviour, if this is what we are seeing.

- More than one fdatasync() in the whole process. Even if we improve the cache behaviour (which would not only affect qemu-img convert, but also cluster allocation while running VMs), there is no reason to sync half-converted images because a copy is only useful when it's completed. Maybe qemu-img convert could unconditionally turn on cache.no-flush=on for the target and only switch to the actually requested cache mode before doing a final bdrv_flush() and closing the image.

Hanna, any additional thoughts on this?

Comment 3 Nir Soffer 2022-06-09 11:04:26 UTC

(In reply to Kevin Wolf from comment #2)
> Nir, is there any specific scenario in practice that you're interested in
> improving? I imagine the use of -n without --target-is-zero is interesting
> mainly for converting to a block device without zeroing it first?

RHV was using -n without --target-is-zero in 4.4.0-4.5.0 for most copies. We
change this to use -n only for raw prellocated volumes when we allocated the
volumes before the copy, bypassing the inefficiency and issues with NFS 3
when using -o pereallocation=falloc.

So we are not affected by this issue in RHV. I think qemu-img gives the user
all the tools to do efficient copy with --target-is-zero, so the only case is
copying via NBD, or maybe other cases when you cannot create the image or zero
it before the copy.

> So looking at your data, I see several things that are possibly non-optimal:
> 
> - Writing lots of zeros. Without --target-is-zero, we have to do this to
> ensure that the target actually has the same data as the source because
> otherwise non-zero data that was previously stored on the target would stay
> around instead of being zeroed like it is in the source image. However, this
> is really supposed to use blk_co_pwrite_zeroes(), which would probably end
> up as fallocate() syscalls. Does this not work on your target device so that
> it falls back to explicit writes?

I tried this on local XFS file system, but I think these writes are not completely
zeroes:

    59864 17:05:15.586382 pwrite64(8, "\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\1"..., 4096, 520192

Some of them may be zeroes (I did not check):

    359864 17:05:15.586544 pwrite64(8, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 524288
 
> - More than one fdatasync() in the whole process. Even if we improve the
> cache behaviour (which would not only affect qemu-img convert, but also
> cluster allocation while running VMs), there is no reason to sync
> half-converted images because a copy is only useful when it's completed.
> Maybe qemu-img convert could unconditionally turn on cache.no-flush=on for
> the target and only switch to the actually requested cache mode before doing
> a final bdrv_flush() and closing the image.

This sounds like the right approach. I did not have time to test how it affects
timing. Is this possible to set cache.no-flush=on using json: filename?

I'm attaching strace for same convert using 100g image. In this log we see:

$ grep fdatasync convert.log | wc -l
1202

$ grep 'fdatasync resumed' convert.log | wc -l
601

The average wait for fdatasync was:

$ awk '/fdatasync resumed/ {gsub(/<|>/, "", $8); sum+=$8; count+=1} END {print sum/count}' convert.log
0.00119834

So for 49153 fdatasync in 8 TiB image this could take 58.9 seconds,
which are 68% of the convert time.

Comment 4 Nir Soffer 2022-06-09 11:05:39 UTC

Created attachment 1888313 [details]
convert log for 100g image showing time taken for pwrite64/fdatasync

Comment 5 Kevin Wolf 2022-06-09 12:18:38 UTC

(In reply to Nir Soffer from comment #3)
> So we are not affected by this issue in RHV. I think qemu-img gives the user
> all the tools to do efficient copy with --target-is-zero, so the only case is
> copying via NBD, or maybe other cases when you cannot create the image or
> zero it before the copy.

Ok, I see. So this isn't necessarily the scenario that directly affects users, but it might be a case that happens to reproduce some inefficiencies that we should have a closer look at.

> I tried this on local XFS file system, but I think these writes are not
> completely zeroes

Yes, I think you're right that they won't be all zero in the common case. What I missed at first is that what we seem to write out here is metadata from the cache, which shouldn't be all zero after being modified.

One thing I just noticed is that pattern you quoted in the descriptions is that we get lots of sequential 4k writes. Maybe l2-cache-entry-size=65536 would already improve things a bit. But this looks like an inefficiency in the cache implementation. It should just do a single I/O request for multiple adjacent cache entries.

> This sounds like the right approach. I did not have time to test how it
> affects timing. Is this possible to set cache.no-flush=on using json: filename?

Yes, this should work, and I seem to remember that you can either use nested JSON objects or just 'cache.no-flush' as a key. Though when we implement it in qemu-img, I would expect that we wouldn't literally change the options, but directly set the corresponding flag in the BlockDriverState.

The more I look at this, however, I feel that cache.no-flush=on would only hide the symptoms while there must be a more fundamental problem somewhere (this doesn't mean that cache.no-flush=on is a bad idea, just that it's probably not enough). We're not supposed to even try to flush the image that much, especially in the case you describe next:

> I'm attaching strace for same convert using 100g image. In this log we see:
> 
> $ grep fdatasync convert.log | wc -l
> 1202
> 
> $ grep 'fdatasync resumed' convert.log | wc -l
> 601

The really interesting part for me here is that you still see the problem for a 100 GB image. In this case, all of the metadata should fit in the cache and we shouldn't get the allocation vs. cache eviction pattern that I suspected in comment 2. So I think we need to look at what even caused qcow2 to flush the image.

Comment 6 RHEL Program Management 2023-09-22 16:35:27 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 7 RHEL Program Management 2023-09-22 16:35:51 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.