Bug 1847192 - qemu-img convert uses possibly slow pre-zeroing on block storage
Summary: qemu-img convert uses possibly slow pre-zeroing on block storage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.2
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: rc
: 8.3
Assignee: Kevin Wolf
QA Contact: Tingting Mao
URL:
Whiteboard:
Depends On:
Blocks: 1855250 1861682
TreeView+ depends on / blocked
 
Reported: 2020-06-15 21:07 UTC by Nir Soffer
Modified: 2021-10-27 06:55 UTC (History)
16 users (show)

Fixed In Version: qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1855250 1861682 (view as bug list)
Environment:
Last Closed: 2020-11-17 17:49:16 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nir Soffer 2020-06-15 21:07:02 UTC
Description of problem:

qemu-img has an optimization to zero out the entire image before copying data.
This optimization is useful when the destination storage support very fast
zeroing (e.g. file), but when using block storage zero throughput is not 
predictable. It can be as fast as writing actual zero (e.g LIO) or 100 times
faster (high end storage).

When converting mostly empty images using pre-zero does not matter since zero
is usually faster than writing zeroes. But when converting mostly full images,
using pre-zero slows down the copy, in my tests up to 45% slower, but can be
more for fully prellocated images.

Here are few tests showing the issue.

## Test 1 - vm-based test environment and poor laptop storage.

$ ./qemu-img info test.img
image: test.img
file format: raw
virtual size: 10 GiB (10737418240 bytes)
disk size: 8 GiB

$ time ./qemu-img convert -f raw -O raw -t none -T none -W test.img /dev/test/lv1

With qemu-img master (9e3903136d9acde2fb2dd9e967ba928050a6cb4a)

real    1m20.483s
user    0m0.490s
sys     0m0.739s

With with patch[1] disabling pre-zero for block storage:

real    0m55.831s
user    0m0.610s
sys     0m0.956s


## Test 2 - real server and storage

Testing this LUN:

# multipath -ll
3600a098038304437415d4b6a59684a52 dm-3 NETAPP,LUN C-Mode
size=5.0T features='3 queue_if_no_path pg_init_retries 50'
hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 18:0:0:0 sdb     8:16  active ready running
| `- 19:0:0:0 sdc     8:32  active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 20:0:0:0 sdd     8:48  active ready running
  `- 21:0:0:0 sde     8:64  active ready running


The destination is 100g logical volume on this LUN:

# qemu-img info test-lv
image: test-lv
file format: raw
virtual size: 100 GiB (107374182400 bytes)
disk size: 0 B


The source image is 100g image with 48g of data:

# qemu-img info fedora-31-100g-50p.raw
image: fedora-31-100g-50p.raw
file format: raw
virtual size: 100 GiB (107374182400 bytes)
disk size: 48.4 GiB


We can zero 2.3 g/s:

# time blkdiscard -z test-lv

real 0m43.902s
user 0m0.002s
sys 0m0.130s

(I should really test with fallocate instead of blkdiscard, but the results look
the same.)

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s   %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            20.80  301.40      0.98   2323.31     0.00     0.00    0.00   0.00   26.56  854.50 257.94    48.23  7893.41   0.73  23.58

dm-3            15.20  297.20      0.80   2321.67     0.00     0.00    0.00   0.00   26.43  836.06 248.72    53.80  7999.30   0.78  24.22


We can write 445m/s:

# dd if=/dev/zero bs=2M count=51200 of=test-lv oflag=direct conv=fsync 107374182400 bytes (107 GB, 100 GiB) copied, 241.257 s, 445 MB/s

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s   %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3             6.60 6910.00      0.39    431.85     0.00     0.00    0.00   0.00    2.48    2.70  15.19    60.73    64.00   0.14  98.84

dm-3            40.80 6682.60      1.59    417.61     0.00     0.00    0.00   0.00    1.71    2.73  14.92    40.00    63.99   0.15  97.60

dm-3             6.60 6887.40      0.39    430.46     0.00     0.00    0.00   0.00    2.15    2.66  14.92    60.73    64.00   0.14  98.22


Testing latest qemu-img:

# rpm -q qemu-img
qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64

# time qemu-img convert -p -f raw -O raw -t none -W fedora-31-100g-50p.raw test-lv
    (100.00/100%)

real 2m2.337s
user 0m2.708s
sys 0m17.326s

# iostat -xdm dm-3 5

pre zero phase:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s   %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            24.00  265.40      1.00   2123.20     0.00     0.00    0.00   0.00   36.81  543.52 144.99    42.48  8192.00   0.70  20.14

dm-3             9.60  283.60      0.59   2265.60     0.00     0.00    0.00   0.00   35.42  576.80 163.78    62.50  8180.44   0.70  20.58

dm-3            24.00  272.00      1.00   2176.00     0.00     0.00    0.00   0.00   22.89  512.40 139.77    42.48  8192.00   0.67  19.90

copy phase:

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s    %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3            27.20 10671.20      1.19    655.84     0.00     0.00    0.00   0.00    2.70   10.99 111.98    44.83    62.93   0.09  96.74

dm-3             6.40 11537.00      0.39    712.33     0.00     0.00    0.00   0.00    3.00   11.90 131.52    62.50    63.23   0.08  97.82

dm-3            27.20 12400.20      1.19    765.47     0.00     0.00    0.00   0.00    3.60   11.16 132.31    44.83    63.21   0.08  95.50

dm-3             9.60 11312.60      0.59    698.20     0.00     0.20    0.00   0.00    3.73   11.69 126.64    63.00    63.20   0.09  97.70


Testing latest qemu-img + with patch[1] disabling pre-zero for block storage.

# rpm -q qemu-img
qemu-img-4.2.0-25.module+el8.2.1+6815+1c792dc8.nsoffer202006140516.x86_64

# time qemu-img convert -p -f raw -O raw -t none -W fedora-31-100g-50p.raw test-lv
    (100.00/100%)

real 1m42.083s
user 0m3.007s
sys 0m18.735s

# iostat -xdm dm-3 5

Device            r/s     w/s     rMB/s     wMB/s   rrqm/s   wrqm/s   %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

dm-3             6.60 7919.60      0.39   1136.67     0.00     0.00   0.00   0.00   14.70   15.32 117.43    60.73   146.97   0.10  77.84

dm-3            27.00 9065.00      1.19    571.38     0.00     0.20   0.00   0.00    2.52   14.64 128.21    45.13    64.54   0.11  97.46

dm-3             6.80 9467.40      0.40    814.75     0.00     0.00   0.00   0.00    2.74   12.15 110.25    60.82    88.12   0.10  90.46

dm-3            29.00 7713.20      1.32    996.48     0.00     0.40   0.00   0.01    5.40   14.48 107.98    46.60   132.29   0.11  83.76

dm-3            11.60 9661.60      0.70    703.54     0.00     0.40   0.00   0.00    2.26   11.22 103.56    61.72    74.57   0.10  97.98

dm-3            23.80 9639.20      0.99    696.82     0.00     0.00   0.00   0.00    1.98   11.54 106.49    42.80    74.03   0.10  93.68

dm-3            10.00 7184.60      0.60   1147.56     0.00     0.00   0.00   0.00   12.84   15.32 106.58    61.36   163.56   0.09  68.30

dm-3            35.00 6771.40      1.69   1293.37     0.00     0.00   0.00   0.00   17.44   18.06 119.48    49.58   195.59   0.10  66.46


Version-Release number of selected component (if applicable):
qemu-img-4.2.0-22.module+el8.2.1+6758+cb8d64c2.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Convert image to raw format on block device

Actual results:
Operation slower because of pre-zero.

Expected results:
Skip pre-zero for block device.

Additional info:

Qemu already disables pre-zero for block device in
handle_aiocb_write_zeroes_block(), but since:

commit a3d6ae2299eaab1bced05551d0a0abfbcd9d08d0
Author: Nir Soffer <nirsof>
Date:   Sun Mar 24 02:20:12 2019 +0200

    qemu-img: Enable BDRV_REQ_MAY_UNMAP in convert

Qemu is using handle_aiocb_write_zeroes_unmap(), assuming that
fallocate() is fast since it uses BLKDEV_ZERO_NOFALLBACK in
the kernel, so the kernel avoids slow manual zeroing.

On RHEL 7 this worked fine, sine fallocate() did not support block devices
with kernel 3.10. The call would always fail and we fall back to 
handle_aiocb_write_zeroes(), which calls handle_aiocb_write_zeroes_block()
which return -ENOTSUP when called with QEMU_AIO_NO_FALLBACK.

On RHEL 8.2 with kernel 4.18 fallocate() works not for block devices,
exposing the issue.

Even when using BLKDEV_ZERO_NOFALLBACK, and zeroing is faster than manually
writing zeros, it is not fast enough to make pre-zero an optimization.
We don't have a way to predict the zeroing throughput with block devices.

Marked as high severity since this may cause performance issues for RHV
customers when copying huge disks.

Comment 1 Nir Soffer 2020-06-15 21:08:49 UTC
Posted fix upstream:
https://lists.nongnu.org/archive/html/qemu-block/2020-06/msg00683.html

Comment 3 Xueqiang Wei 2020-06-16 16:35:43 UTC
I found a similar bug: Bug 1722734 - 'qemu-img convert' images over RBD is very very slow.

Could you please help check if they are the same issue. Not sure the backend of your block device. Many thanks.

Comment 4 Nir Soffer 2020-06-16 17:29:22 UTC
(In reply to Xueqiang Wei from comment #3)
> I found a similar bug: Bug 1722734 - 'qemu-img convert' images over RBD is
> very very slow.
> 
> Could you please help check if they are the same issue. Not sure the backend
> of your block device. Many thanks.

The backend in this case is iSCSI, not related to RBD.

The RBD issues looks very different, 5 seconds with file vs 6 minutes with rbd,
so I don't think this can be related to pre-zero.

Comment 6 Richard W.M. Jones 2020-06-17 11:14:25 UTC
In the case (which might not be here) where you know that the target
device already contains zeroes, qemu-img convert recently added an
option --target-is-zero.

Comment 7 Nir Soffer 2020-06-17 11:36:36 UTC
(In reply to Richard W.M. Jones from comment #6)
> In the case (which might not be here) where you know that the target
> device already contains zeroes, qemu-img convert recently added an
> option --target-is-zero.

With block storage we don't have  a way to know if a new logical volume
is zeroed. There is no kernel interface reporting that since there is
no way to report such thing from storage.

The only case when we know that target is zeroed is when using new
file on a file based storage. In this case zeroing is usually extremely
fast (e.g. 500 GiB/s), but qemu does not have to zero anything because
sparse file is already zeroed without extra work.

The only case when zeroing is slow is NFS < 4.2, when using preallocated
volume. In this case we already zeroed the image when creating it using

    qemu-img create -f raw -o preallocation=falloc

If we create this volume in a copy volume flow, we run next:

    qemu-img convert ... -o preallocation=falloc /path/to/src-volume /path/to/dst-volume

So it this case --target-is-zero would be handy to avoid the cost of 
zeroing the image twice.

But our current code avoids the double zeroing by creating sparse file
in the first step, and letting qemu-img convert allocate it.

I think we can simplify this flow using --target-is-zero and maybe get
slightly better performance, but I'm not sure the performance difference
would be noticeable.

Comment 10 Xueqiang Wei 2020-06-22 18:30:56 UTC
Reproduced it on 8.3.0 fast train.

Versions:
kernel-4.18.0-209.el8.x86_64
qemu-kvm-5.0.0-0.module+el8.3.0+6620+5d5e1420


1. create a 2G iamge
# qemu-img create -f raw test.img 2G
Formatting 'test.img', fmt=raw size=2147483648

2. covert it to full allocation
# qemu-img convert -p -f raw -O raw -S 0 test.img test.img 
    (100.00/100%)
# qemu-img info test.img 
image: test.img
file format: raw
virtual size: 2 GiB (2147483648 bytes)
disk size: 2 GiB

3. create lvm on block device
# pvcreate /dev/mapper/mpatha
# vgcreate vgtest /dev/mapper/mpatha
# lvcreate -l 100%FREE -n lvtest vgtest
# lvcreate -L 20G -n lvtest vgtest

4. convert the image to block device
# time qemu-img convert -f raw -O raw -t none -T none -W /home/bug/test.img /dev/vgtest/lvtest -p
    (100.00/100%)

real	15m8.734s
user	0m0.558s
sys	0m0.369s



Tested with --target-is-zero, it get faster.

# time qemu-img convert -f raw -O raw -t none -T none -W /home/bug/test.img /dev/vgtest/lvtest -p --target-is-zero -n
    (100.00/100%)

real	0m12.757s
user	0m0.558s
sys	0m0.384s

Comment 11 Kevin Wolf 2020-06-23 09:06:06 UTC
(In reply to Xueqiang Wei from comment #10)
> Tested with --target-is-zero, it get faster.

Just to make sure, this is not the expected time for the improved version. --target-is-zero gives additional guarantees which mean that qemu-img has less work to do. Without it, qemu-img has to zero out blocks without data because otherwise the result could be corrupted. We'll never get this as fast as --target-is-zero.

Comment 12 Xueqiang Wei 2020-06-23 16:38:59 UTC
(In reply to Kevin Wolf from comment #11)
> (In reply to Xueqiang Wei from comment #10)
> > Tested with --target-is-zero, it get faster.
> 
> Just to make sure, this is not the expected time for the improved version.
> --target-is-zero gives additional guarantees which mean that qemu-img has
> less work to do. Without it, qemu-img has to zero out blocks without data
> because otherwise the result could be corrupted. We'll never get this as
> fast as --target-is-zero.


Kevin,

Thanks for your explanation. I have tested with improved version (8.2.1 scratch build), the result is in Comment 9.
I just want to verify the parameter "--target-is-zero", it's really fast. Thanks.

Comment 13 Kevin Wolf 2020-07-08 13:37:19 UTC
This is fixed in qemu.git master as of commit edafc70c0c, which will be contained in 5.1.0-rc0.

Comment 14 Nir Soffer 2020-07-08 13:54:11 UTC
(In reply to Kevin Wolf from comment #13)
Thanks Kevin. can we backport it to 8.2.1? (see comment 8)

Comment 15 Kevin Wolf 2020-07-08 14:08:18 UTC
At this point, this means requesting a z-stream fix. The patch is simple, so I assume if you give an explanation why it's important to RHV to have this in 8.2.1 and request the z-stream flag, we should be able to get it approved.

Comment 16 Nir Soffer 2020-07-08 14:25:17 UTC
(In reply to Kevin Wolf from comment #15)
From RHV point of view, this is a performance regression compared to RHEL 7.

In RHEL 7 fallocate(PUNCH_HOLE) used in the pre-zero step failed, and qemu-img
failed back to ioctl(BLKZEROUT), which was disabled when using NOFALLBACK, so
the entire prezero step was skipped.

In RHEL 8, fallocate(PUNCH_HOLE) works for block devices - slowly - so the
pre-zero step became a performance regression.

Since this the difference can be significant (up to 45% in my tests), I think
this worth a backport.

Comment 17 Xueqiang Wei 2020-07-09 06:45:02 UTC
(In reply to Nir Soffer from comment #16)
> (In reply to Kevin Wolf from comment #15)
> From RHV point of view, this is a performance regression compared to RHEL 7.
> 
> In RHEL 7 fallocate(PUNCH_HOLE) used in the pre-zero step failed, and
> qemu-img
> failed back to ioctl(BLKZEROUT), which was disabled when using NOFALLBACK, so
> the entire prezero step was skipped.
> 
> In RHEL 8, fallocate(PUNCH_HOLE) works for block devices - slowly - so the
> pre-zero step became a performance regression.
> 
> Since this the difference can be significant (up to 45% in my tests), I think
> this worth a backport.


Hi Nir,

You mean backport the the fix to 8.2.1.z, right?


Hi Kevin, Nir,

This bug is for 8.3.0 fast train, I also hit it on 8.3.0 slow train. Do we need to clone it for 8.3.0 slow train?


Versions:
kernel-4.18.0-222.el8.x86_64
qemu-kvm-4.2.0-29.module+el8.3.0+7212+401047e6

1. create a 2G iamge
# qemu-img create -f raw test.img 2G
Formatting 'test.img', fmt=raw size=2147483648

2. covert it to full allocation
# qemu-img convert -p -f raw -O raw -S 0 test.img test.img 
    (100.00/100%)
# qemu-img info test.img 
image: test.img
file format: raw
virtual size: 2 GiB (2147483648 bytes)
disk size: 2 GiB

3. create lvm on block device
# pvcreate /dev/mapper/mpatha
# vgcreate vgtest /dev/mapper/mpatha
# lvcreate -L 20G -n lvtest vgtest

4. convert the image to block device
# time qemu-img convert -f raw -O raw -t none -T none -W /home/bug/test.img /dev/vgtest/lvtest -p
    (100.00/100%)

real	14m17.389s
user	0m0.570s
sys	0m0.984s

Comment 18 Kevin Wolf 2020-07-09 08:41:23 UTC
(In reply to Xueqiang Wei from comment #17)
> This bug is for 8.3.0 fast train, I also hit it on 8.3.0 slow train. Do we
> need to clone it for 8.3.0 slow train?

I think it's not as important on slow train. We can clone it, but it would probably be for 8.3.1.

Comment 19 Xueqiang Wei 2020-07-09 10:54:34 UTC
(In reply to Kevin Wolf from comment #18)
> (In reply to Xueqiang Wei from comment #17)
> > This bug is for 8.3.0 fast train, I also hit it on 8.3.0 slow train. Do we
> > need to clone it for 8.3.0 slow train?
> 
> I think it's not as important on slow train. We can clone it, but it would
> probably be for 8.3.1.


Kevin,

Thanks a lot. I will clone it first, in order to track the issue for slow train.

Comment 20 Nir Soffer 2020-07-13 15:11:52 UTC
I think Kevin already answered.

Comment 27 Xueqiang Wei 2020-08-13 14:28:48 UTC
Tested with qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901, not hit this issue. So set status to VERIFIED.

Versions:
kernel-4.18.0-227.el8.x86_64
qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901


1. create a 2G iamge
# qemu-img create -f raw test.img 2G
Formatting 'test.img', fmt=raw size=2147483648

2. covert it to full allocation
# qemu-img convert -p -f raw -O raw -S 0 test.img test.img 
    (100.00/100%)
# qemu-img info test.img 
image: test.img
file format: raw
virtual size: 2 GiB (2147483648 bytes)
disk size: 2 GiB

3. create lvm on block device
# pvcreate /dev/mapper/mpatha
# vgcreate vgtest /dev/mapper/mpatha
# lvcreate -L 20G -n lvtest vgtest

4. convert the image to block device
# time qemu-img convert -f raw -O raw -t none -T none -W /home/test.img /dev/vgtest/lvtest -p
    (100.00/100%)

real	0m17.255s
user	0m0.235s
sys	0m0.633s

Comment 30 errata-xmlrpc 2020-11-17 17:49:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:8.3 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5137


Note You need to log in before you can comment on or make changes to this bug.