Bug 1648622

Summary:

[v2v] Migration performance regression

Product:

Red Hat Enterprise Linux 7

Reporter:

Ilan Zuckerman <izuckerm>

Component:

qemu-kvm-rhev

Assignee:

Maxim Levitsky <mlevitsk>

Status:

CLOSED ERRATA

QA Contact:

Tingting Mao <timao>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.6

CC:

areis, bthurber, chayang, coli, dagur, dmetzger, eblake, fdupont, izuckerm, jferlan, jinzhao, jprause, juzhang, kwolf, mlevitsk, mrezanin, mtessun, mxie, nsoffer, rjones, tzheng, virt-maint, yuhuang, zili

Target Milestone:

Keywords:

Performance, Regression, ZStream

Target Release:

7.6

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

qemu-kvm-rhev-2.12.0-37.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1743322 (view as bug list)

Environment:

Last Closed:

2020-03-31 14:34:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

V2V

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1743322

Attachments:

Description	Flags
Tests results with upstream fix	none

Description Ilan Zuckerman 2018-11-11 06:04:05 UTC

Description of problem:
The migration speed of regression case 14a (20x100GB - Busy - iSCSI to FC, 2 hosts - VDDK -  10 Parallel) degraded by 28% (~20 minutes).

Detailed statistics and logs can be found here. Both, base line and last regression test case are marked with gray to better compare.
https://docs.google.com/spreadsheets/d/11pNoWzTNiif6ClARij4L-85W75PyYusEzfFU82l78vk/edit?usp=sharing

Both of the test cases were executed on NetApp only as this is what we currently have.


Version-Release number of selected component (if applicable):
cfme-5.9.5.2-1.el7cf.x86_64
vdsm-4.30.0-623.gitb7238dd.el7.x86_64
rhev 4.2.7-3.0.1.el7ev
nbdkit-1.2.6-1.1.lp.el7ev.x86_64
virt-v2v-1.38.2-12.22.lp.el7ev.x86_64libguestfs-1.36.10-6.16.rhvpreview.el7ev.x86_64
VMware-vix-disklib-6.5.2-6195444

RHEV build 4.2.7-7


How reproducible:
100%

Steps to Reproduce:
1.Create test plan as describes in excel sheet.

Actual results:
Migration Speed (Megabytes/Sec) degraded from 474 to 363.


Expected results:
No performance drop should be noticed.

Additional info:
Two conversation email threads are relevant for this BZ, and are attached as text files in private message.

Comment 4 Nir Soffer 2018-11-11 16:31:37 UTC

Based on what we see in the logs, the issue is behavior change in qemu-img 2.12.

In 2.10 qemu-img was writing all data and zero parts of the image. In 2.12, qemu is
first zeroing the entire device, and then write the data parts.

In v2v scale tests, we test 100G images with 76G of data. So in this case qemu-img
is writing 100G of zeros, and 76G of data, compared with 24G of zero and 76G of data
in 2.10. Writing more zeros take much more time if writing zeros is not fast.

The pipeline used by virt-v2v is:

   vmware -> nbdkit -> qemu-img convert -> nbdkit -> imageio -> image

This can be tested in a simpler way using nbdkit (> 1.6), converting local
file to nbdkit:

    $ virt-builder fedora-28

    $ pvcreate /dev/mapper/xxxyyy

    $ vgcreate vg /dev/mapper/xxxyyy
 
    $ lvcreate -n lv -L 6G vg

    $ time /path/to/nbdkit file /dev/vg/lv  \
        --run 'qemu-img convert fedora-28.img -p -n $nbd' 

This mail thread give more details:
http://lists.nongnu.org/archive/html/qemu-block/2018-11/msg00225.html

Comment 5 Nir Soffer 2018-11-11 16:33:49 UTC

Ilan, can you reproduce the issue using nbdkit as explained in comment 4? We need
to compare the time with rhel 7.5 host (qemu-img 2.10), and rhel 7.6 (qemu-img
2.12), with same storage you used for v2v import.

Comment 9 Daniel Gur 2018-11-20 06:06:52 UTC

Nir updated me that he is in contact with qemu team so they can introduce the fix for what we suspect is the root cause for the issue explained in comment 4.

Comment 10 Nir Soffer 2018-11-20 08:33:53 UTC

Kevin, should we move this bug to qemu-kvm-rhev?

Comment 11 Kevin Wolf 2018-11-20 12:58:25 UTC

I think it's not entirely clear yet what the solution will be and which components it will affect, but QEMU will certainly be one of them. So reassigning it to qemu-kvm-rhev should be okay. As this is NBD, I suppose Eric would be the right assignee.

Comment 12 Nir Soffer 2018-11-20 13:42:20 UTC

Based on comment 11, moving to qemu-kvm-rhev.

Comment 14 Nir Soffer 2018-11-20 13:55:57 UTC

Eric, can you take a look?

Comment 15 Tingting Mao 2018-11-21 13:08:44 UTC

Hi, 

I am Tingting from Virt-qe.

According to the comment 4, I compared the covert time between 'qemu-kvm-rhev-2.10' and 'qemu-kvm-rhev-2.12'. However, the result is not matched to the bug.(i.e. The convert time seems longer on 'qemu-kvm-rhev-2.10'(23m28.830s) than on 'qemu-kvm-rhev-2.12'(22m19.212s).)

Also, I noticed that after 'convert', the image file which is based on rbd is written with the *full* data in source image on 'qemu-kvm-rhev-2.10'[1], while for 'qemu-kvm-rhev-2.12', it is just written with the *actual* data in source image file[2].

Please let me know if I missed something, thanks.



Reproduced steps:

For ‘qemu-kvm-rhev-2.12’:

1. Check the version of ‘qemu-kvm-rhev’
# qemu-img --version
qemu-img version 2.12.0 (qemu-kvm-rhev-2.12.0-14.el7)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

2. Create source image
# qemu-img create -f qcow2 base.qcow2 100G
Formatting 'base.qcow2', fmt=qcow2 size=107374182400 cluster_size=65536 lazy_refcounts=off refcount_bits=16

3. Write data to it.
# cat test.sh
#!/bin/bash
i=0
while [ $i -le 76 ]
do
    qemu-io -c "write -P 1 ${i}G 1G" base.qcow2
    i=$(expr $i + 1)
done

4. Check the info of the image
# qemu-img info base.qcow2
image: base.qcow2
file format: qcow2
virtual size: 100G (107374182400 bytes)
disk size: 77G
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false

5. Prepare a image based on nbd
# qemu-img create -f raw target.img 105G
Formatting 'target.img', fmt=raw size=112742891520
# qemu-nbd -f raw target.img -p 8800 -t

6. Convert source image to target
# time qemu-img convert -f qcow2 -O raw base.qcow2 nbd:localhost:8800 -p -n
    (100.00/100%)

real    22m19.212s
user    0m7.424s
sys    1m9.403s

7. Check info of target file
# qemu-img info target.img
image: target.img
file format: raw
virtual size: 105G (112742891520 bytes)
disk size: 77G -----------------------------------> [2]


For ‘qemu-kvm-rhev-2.10’:

1. Downgrade the qemu version to 2.10
# qemu-img --version
qemu-img version 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

2. Prepare a nbd image file
# qemu-img create -f raw target.img 105G
Formatting 'target.img', fmt=raw size=112742891520
# qemu-nbd -f raw target.img -p 8900 -t

3. Convert the source image to another target
# time qemu-img convert -f qcow2 -O raw base.qcow2 nbd:localhost:9000 -p -n
    (100.00/100%)

real    23m28.830s
user    0m6.626s
sys    1m6.677s

4. Check info of target file
# qemu-img info target.img
image: target.img
file format: raw
virtual size: 105G (112742891520 bytes)
disk size: 100G  ----------------------------------------> [1]

Comment 16 Richard W.M. Jones 2018-11-21 14:25:41 UTC

The regression is very specific to the backing file.  It affects RHV
because there we are using a Linux block device with very slow
support for zeroing.  If you just use a local file (supports fast,
almost free zeroing) then you wouldn't notice any difference.

Comment 17 Tingting Mao 2018-11-22 11:32:22 UTC

Thanks for Richard's info. Reproduced this issue with the nbd's backing file based on block device like below, it indeed takes much more time on 'qemu-kvm-rhev-2.12'(50m34.078s) than 'qemu-kvm-rhev-2.10'(23m57.366s).


Prepare one block disk:
# pvcreate /dev/sdb
# vgcreate lvtest /dev/sdb
# lvcreate -L 105G -n target.img lvtest


For ‘qemu-kvm-rhev-2.10’:

# qemu-img --version
qemu-img version 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

# qemu-img info base.img
image: base.img
file format: raw
virtual size: 100G (107374182400 bytes)
disk size: 76G

# time qemu-img convert base.img nbd:localhost:9000 -p -n
    (100.00/100%)

real    23m57.366s
user    0m6.496s
sys    1m10.790s


For ‘qemu-kvm-rhev-2.12.0-14.el7’:

# qemu-img --version
qemu-img version 2.12.0 (qemu-kvm-rhev-2.12.0-14.el7)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

# time qemu-img convert base.img nbd:localhost:9000 -p -n
    (100.00/100%)

real    50m34.078s
user    0m6.760s
sys    1m12.567s

Comment 18 Nir Soffer 2018-11-22 23:57:28 UTC

(In reply to Tingting Mao from comment #15)
> Also, I noticed that after 'convert', the image file which is based on nbd
> is written with the *full* data in source image on 'qemu-kvm-rhev-2.10'[1],
> while for 'qemu-kvm-rhev-2.12', it is just written with the *actual* data in
> source image file[2].

This looks like another bug in qemu-nbd on 2.10, not related to this bug.

This bug is not related to qemu-nbd - we do not use it in the import process.
It is better to check with nbdkit for reproducing and testing this bug, as
explained in comment 4.

However since you reproduce the issue very clearly I don't know if it worth the
time to repeat the test, maybe only with small image (1g) just to see if we get
the same picture.

Comment 19 Tingting Mao 2018-11-23 10:30:25 UTC

Tested this issue with 'nbdkit'. It's hard to find the time difference for 1G image, so I used 100G again. It still hints the bug listed below.


Nbdkit for 1G image convert

For ‘qemu-kvm-rhev-2.12’:

# qemu-img info source.img
image: source.img
file format: raw
virtual size: 1.0G (1073741824 bytes)
disk size: 778M

# time nbdkit file file=/dev/vgtest/target1.img -p 9000 --run 'qemu-img convert source.img -p -n nbd:localhost:9000'
    (100.00/100%)

real    0m10.596s
user    0m0.094s
sys    0m0.524s


For ‘qemu-kvm-rhev-2.10’:

# qemu-img --version
qemu-img version 2.10.0(qemu-kvm-rhev-2.10.0-21.el7_5.7)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

# time nbdkit file file=/dev/vgtest/target2.img -p 9000 --run 'qemu-img convert source.img -p -n nbd:localhost:9000'
    (100.00/100%)

real    0m10.434s
user    0m0.110s
sys    0m0.493s




Nbdkit for 100G image convert

For ‘qemu-kvm-rhev-2.10’:

# qemu-img info base.img
image: base.img
file format: raw
virtual size: 100G (107374182400 bytes)
disk size: 76G

# time nbdkit file file=/dev/vgtest/target.img -p 9000 --run 'qemu-img convert base.img -p -n nbd:localhost:9000'
    (100.00/100%)


real    24m53.579s
user    0m7.019s
sys    1m12.853s


For ‘qemu-kvm-rhev-2.12’:

# time nbdkit file file=/dev/vgtest/target.img -p 9000 --run 'qemu-img convert base.img -p -n nbd:localhost:9000'
    (6.01/100%)
    (100.00/100%)

real    42m25.667s
user    0m7.594s
sys    1m11.913s

Comment 20 Richard W.M. Jones 2018-11-23 11:08:05 UTC

Which version of nbdkit was used in the test?  Although it probably doesn't
matter here since LVs cannot do efficient zeroing anyway, it's still
worth pointing out that nbdkit 1.2 (RHEL 7.6) does not have the
optimizations that Nir made for efficient zeroing, whereas nbdkit 1.8
(RHEL 7.7) does have them.  See also:
https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=66264

Comment 21 Tingting Mao 2018-11-24 02:13:08 UTC

(In reply to Richard W.M. Jones from comment #20)
> Which version of nbdkit was used in the test?  Although it probably doesn't
> matter here since LVs cannot do efficient zeroing anyway, it's still
> worth pointing out that nbdkit 1.2 (RHEL 7.6) does not have the
> optimizations that Nir made for efficient zeroing, whereas nbdkit 1.8
> (RHEL 7.7) does have them.  See also:
> https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=66264

It's nbdkit 1.2.

# rpm -qa | grep nbdkit
nbdkit-basic-plugins-1.2.7-2.el7.x86_64
nbdkit-1.2.7-2.el7.x86_64

# uname -r
3.10.0-944.el7.x86_64

Comment 29 Richard W.M. Jones 2019-03-04 12:34:21 UTC

This is surely the sort of thing we should be able to query through the NBD protocol.
The current NBD_FLAG_ROTATIONAL flag seems most similar since it indicates a
preference/optimization for the underlying block device.

Comment 31 Nir Soffer 2019-03-04 13:15:12 UTC

(In reply to Richard W.M. Jones from comment #29)
> This is surely the sort of thing we should be able to query through the NBD
> protocol.
> The current NBD_FLAG_ROTATIONAL flag seems most similar since it indicates a
> preference/optimization for the underlying block device.

This looks like abuse of unrelated flag. I think fast write zeros on block
based storage depends on the way the storage implements write zeroes. On XtremIO
this seems to be always fast (50G/s). I guess it always deacllocate blocks without
doing any I/O. On HPar we see 100G/s if the area was not allocated or 1G/s if 
it was allocated.

So I think we can do:
- Add FAST_WRITE_ZEROES flag to NBD protocol
- Report this flag only for file based storage supporting fallocate()
  (NFS > 4.2, GlusterFS, XFS, ext4...)

Comment 32 Kevin Wolf 2019-03-04 13:39:59 UTC

We don't have the information to set that flag. You have to try it and see whether it works. Which is exactly why I suggested extending the NBD protocol with a new operation instead of a capability flag.

Comment 33 Nir Soffer 2019-03-04 15:35:00 UTC

(In reply to Kevin Wolf from comment #32)
How can you test if storage has fast write zeroes? When you use ioctl(BLKZEROOUT)
or fallocate(ZERO_RANGE) they may succeed, but they may be fast or slow.

Comment 34 Eric Blake 2019-04-17 18:49:55 UTC

Sounds like it is related (if not duplicate) of bug 1647104

Comment 35 Eric Blake 2019-04-17 19:13:20 UTC

(In reply to Kevin Wolf from comment #32)
> We don't have the information to set that flag. You have to try it and see
> whether it works. Which is exactly why I suggested extending the NBD
> protocol with a new operation instead of a capability flag.

NBD protocol addition proposed:
https://lists.debian.org/nbd/2019/03/msg00004.html

I will be implementing a proof of concept for it during qemu 4.1 phase to evaluate how much it helps.

Comment 36 Nir Soffer 2019-04-17 19:31:21 UTC

Created attachment 1556002 [details]
Tests results with upstream fix

This bug is fixed upstream by:

commit 1bd2e35c2992c679ef8b223153d47ffce76e7dc5
Merge: 905870b53c c6e3f520c8
Author: Peter Maydell <peter.maydell>
Date:   Tue Mar 26 15:52:46 2019 +0000

    Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
    
    Block layer patches:
    
    - Fix slow pre-zeroing in qemu-img convert
    - Test case for block job pausing on I/O errors
    
    # gpg: Signature made Tue 26 Mar 2019 15:28:00 GMT
    # gpg:                using RSA key 7F09B272C88F2FD6
    # gpg: Good signature from "Kevin Wolf <kwolf>" [full]
    # Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6
    
    * remotes/kevin/tags/for-upstream:
      qemu-io: Add write -n for BDRV_REQ_NO_FALLBACK
      qemu-img: Use BDRV_REQ_NO_FALLBACK for pre-zeroing
      file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes
      block: Advertise BDRV_REQ_NO_FALLBACK in filter drivers
      block: Add BDRV_REQ_NO_FALLBACK
      block: Remove error messages in bdrv_make_zero()
      iotests: add 248: test resume mirror after auto pause on ENOSPC
    
    Signed-off-by: Peter Maydell <peter.maydell>

I did benchmarks few weeks ago showing that with this change qemu-img is
up to 6.8 times faster when converting to qemu-nbd, and 1.75 faster when
converting to nbdkit (using file plugin).

I compared tested with:
- qemu-img-rhev-2.12.0-21
- qemu-4.0.0-rc0
- kevin slow zero fix on top of qemu-4.0.0-rc0
- kevin slow zero fix + my unmap fix

Here is summary of the results.

### Converting empty image

version                    server       time         change
===========================================================
Kevin slow zero fix        nbdkit       0m5.660s   +2399.0%
-----------------------------------------------------------
4.0.0-rc0                  nbdkit       0m14.123s   +951.5%
-----------------------------------------------------------
Kevin slow zero fix        qemu-nbd     0m32.229s   +415.0%
-----------------------------------------------------------
4.0.0-rc0                  qemu-nbd     2m20.002s     -4.4%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21    qemu-nbd     2m14.390s     +0.0%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21    nbdkit       error             -
-----------------------------------------------------------

### Converting 75% full image

version                    server       time         change
===========================================================
Kevin slow zero fix        nbdkit       0m15.510s  +1005.5%
-----------------------------------------------------------
Kevin slow zero fix        qemu-nbd     0m22.783s   +680.0%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21    nbdkit       0m27.213s   +573.1%
-----------------------------------------------------------
4.0.0-rc0                  nbdkit       0m28.037s   +556.2%
-----------------------------------------------------------
4.0.0-rc0                  qemu-nbd     2m35.085s     +0.0%
-----------------------------------------------------------
qemu-img-rhev-2.12.0-21    qemu-nbd     2m35.954s     +0.0%
-----------------------------------------------------------

See the attachment for full details on how it was tested.

Comment 54 Maxim Levitsky 2019-08-25 09:27:04 UTC

I posted V2 of the patches, with all the patches I send previosly (unmodified) and this commit backported (no conflicts thankfully)

Best regards,
    Maxim Levitsky

Comment 58 Tingting Mao 2019-11-06 08:30:56 UTC

Verified this bug as below, the write performance both improved with qemu-nbd and nbdkit. So set this bug as verified. Thanks.


Result:
qemu-nbd: 0m31.627s ----> 0m21.189s
nbdkit: 0m18.145s ---> 0m10.222s


Tested with:
1. Nbdkit info
# rpm -qa | grep nbdkit
nbdkit-devel-1.8.0-3.el7.x86_64
nbdkit-debuginfo-1.8.0-3.el7.x86_64
nbdkit-1.8.0-3.el7.x86_64
nbdkit-plugin-python-common-1.8.0-3.el7.x86_64
nbdkit-example-plugins-1.8.0-3.el7.x86_64
nbdkit-plugin-python2-1.8.0-3.el7.x86_64
nbdkit-plugin-vddk-1.8.0-3.el7.x86_64
nbdkit-basic-plugins-1.8.0-3.el7.x86_64

2. Nvme disk info
# lsblk | tail -n 3
nvme0n1                          259:0    0 745.2G  0 disk
├─nvme0n1p1                      259:1    0   105G  0 part
└─nvme0n1p2                      259:2    0     6G  0 part

3.The source image 
# dd if=/dev/urandom of=test.img bs=5M count=1024
# qemu-img info test.img
image: test.img
file format: raw
virtual size: 5.0G (5368709120 bytes)
disk size: 5.0G



Steps:

In ‘qemu-kvm-rhev-2.12.0-38.el7’
With qemu-nbd
1. Export the target nvme block over nbd
# qemu-nbd -f raw /dev/nvme0n1p2 -p 9000 -t

2. Convert the image
# time qemu-img convert test.img nbd:localhost:9000 -n -p
    (100.00/100%)

real    0m21.189s
user    0m1.003s
sys    0m4.716s

With nbdkit:
#  time nbdkit file file=/dev/nvme0n1p2 -p 9000 --run 'qemu-img convert test.img -p -n nbd:localhost:9000'
    (100.00/100%)

real    0m10.222s
user    0m0.438s
sys    0m3.592s



In ‘qemu-kvm-rhev-2.12.0-31.el7’
With qemu-nbd
1. Export the target nvme block over nbd
# qemu-nbd -f raw /dev/nvme0n1p2 -p 9000 -t

2. Convert the image
# time qemu-img convert test.img nbd:localhost:9000 -n -p    (100.00/100%)

real    0m31.627s
user    0m2.243s
sys    0m6.681s


With nbdkit:
#  time nbdkit file file=/dev/nvme0n1p2 -p 9000 --run 'qemu-img convert test.img -p -n nbd:localhost:9000'
    (100.00/100%)

real    0m18.145s
user    0m0.465s
sys    0m3.986s

Comment 60 errata-xmlrpc 2020-03-31 14:34:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1216