Bug 1647104

Summary: qemu-img convert/map much slower when using qemu-nbd compared with direct access to image
Product: Red Hat Enterprise Linux 7 Reporter: Nir Soffer <nsoffer>
Component: qemu-kvm-rhevAssignee: Maxim Levitsky <mlevitsk>
Status: CLOSED DEFERRED QA Contact: Tingting Mao <timao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.6CC: adbarbos, chayang, coli, eblake, juzhang, mlevitsk, tom.ty89, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1726528 (view as bug list) Environment:
Last Closed: 2019-07-22 20:29:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1726528    
Attachments:
Description Flags
output of qemu-img map accessing the image directly
none
output of qemu-img map accessing the image directly
none
output of qemu-img map accessing the image via qemu-nbd none

Description Nir Soffer 2018-11-06 16:48:57 UTC
Created attachment 1502545 [details]
output of qemu-img map accessing the image directly

Description of problem:

qemu-img convert and map are much slower when accessing an image using qemu-nbd 
compared with accessing an image directly.

Version-Release number of selected component (if applicable):
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64

How reproducible:
Always

Steps to Reproduce:
See bellow.

Actual results:
qemu-img map is 140 times slower
qemu-img convert is 2.5 time slower

Expected results:
Minor performance difference expected

RHV will use qemu nbd server for backup purposes. It should be efficient enough
to allow backup of entire cluster during a backup window.

I did not test accessing image via qemu builtin nbd server yet, I assume that the
behavior will be similar as the nbd server code is shared with qemu-nbd.

Bellow details of the test.

## How test image was created

The test image is a fresh Fedora 29 server installed on RHV thin image on block
storage.

After installing the image, I run these commands to populate the image with some
data, simulating real image.

lvresize +L 30g fedora/root
yum update
mkdir data
cd data
for i in $(seq -w 100); do dd if=/dev/urandom of=$i bs=1M count=256; done

## Image info

$ ls -lh nbd-test-disk1
lrwxrwxrwx. 1 root root 78 Nov  6 14:42 nbd-test-disk1 -> /dev/8daa13f5-d6b9-479d-b637-50cd4f3207d8/61d13c96-641e-4748-bb16-2f3975f5bfc1


$ qemu-img info nbd-test-disk1 
image: nbd-test-disk1
file format: qcow2
virtual size: 50G (53687091200 bytes)
disk size: 0
cluster_size: 65536
Format specific information:
    compat: 1.1
    lazy refcounts: false
    refcount bits: 16
    corrupt: false


$ lvdisplay 8daa13f5-d6b9-479d-b637-50cd4f3207d8/61d13c96-641e-4748-bb16-2f3975f5bfc1
  --- Logical volume ---
  LV Path                /dev/8daa13f5-d6b9-479d-b637-50cd4f3207d8/61d13c96-641e-4748-bb16-2f3975f5bfc1
  LV Name                61d13c96-641e-4748-bb16-2f3975f5bfc1
  VG Name                8daa13f5-d6b9-479d-b637-50cd4f3207d8
  LV UUID                qfrwHx-SJHR-USw0-Bd31-aODb-keZD-Jf52Vs
  LV Write Access        read/write
  LV Creation host, time b02-h25-r620.rhev.openstack.engineering.redhat.com, 2018-11-05 13:05:59 +0000
  LV Status              available
  # open                 0
  LV Size                28.00 GiB
  Current LE             224
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     8192
  Block device           253:13


$ lsblk /dev/mapper/3600a098038304437415d4b6a59676d67
NAME                                                                                MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
3600a098038304437415d4b6a59676d67                                                   253:2    0  1.3T  0 mpath 
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-07e4a192--cb83--40c8--8180--7beee118d987 253:12   0   50G  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-61d13c96--641e--4748--bb16--2f3975f5bfc1 253:13   0   28G  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-metadata                                 253:133  0  512M  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-outbox                                   253:134  0  128M  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-xleases                                  253:135  0    1G  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-leases                                   253:136  0    2G  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-ids                                      253:137  0  128M  0 lvm   
├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-inbox                                    253:138  0  128M  0 lvm   
└─8daa13f5--d6b9--479d--b637--50cd4f3207d8-master                                   253:139  0    1G  0 lvm   


$ multipath -ll /dev/mapper/3600a098038304437415d4b6a59676d67
3600a098038304437415d4b6a59676d67 dm-2 NETAPP  ,LUN C-Mode      
size=1.3T features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw
`-+- policy='round-robin 0' prio=30 status=active
  |- 8:0:1:1 sdh 8:112 active ready running
  |- 7:0:0:1 sdb 8:16  active ready running
  |- 8:0:0:1 sdd 8:48  active ready running
  `- 7:0:1:1 sde 8:64  active ready running


## Testing qemu-img map, accessing image directly

$ time qemu-img map -f qcow2 --output json nbd-test-disk1 > nbd-test-disk1-map.json 

real	0m0.153s
user	0m0.101s
sys	0m0.021s


$ wc -l nbd-test-disk1-map.json 
3541 nbd-test-disk1-map.json


## Coping using dd (conv=sparse)

This is for reference to understand capabilities of the storage.

$ dd if=nbd-test-disk1 of=/dev/shm/disk.qcow2 bs=8M iflag=direct conv=sparse,fsync
3584+0 records in
3584+0 records out
30064771072 bytes (30 GB) copied, 42.4591 s, 708 MB/s

ls -lhs /dev/shm/disk.qcow2 
28G -rw-r--r--. 1 root root 28G Nov  6 14:56 /dev/shm/disk.qcow2


## qemu-img convert using qemu-img convert, accessing image directly

The destination is in /dev/shm, since I want to test read throughput.
I also tested -W option, the results are the same.

$ time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img
    (100.00/100%)

real	0m37.758s
user	0m6.141s
sys	0m31.317s


$ time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img
    (100.00/100%)

real	0m29.270s
user	0m6.214s
sys	0m31.570s


$ time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img
    (100.00/100%)

real	0m28.423s
user	0m6.373s
sys	0m31.348s


$ ls -lhs /dev/shm/disk.img 
28G -rw-r--r--. 1 root root 50G Nov  6 14:50 /dev/shm/disk.img


## Exposing the image using qemu-nbd

Image was exposed using nbd socket like this:

$ qemu-nbd -k /tmp/nbd.sock -v -t -f qcow2 nbd-test-disk1 -x export --cache=none --aio=native

I also tried --detect-zeroes=on - it should not be needed for qcow2 source image,
but I tested it to be sure. The results are the same.


## qemu-img map using qemu-nbd

$ time qemu-img map --output json  nbd:unix:/tmp/nbd.sock:exportname=export > nbd-test-disk1-map-via-nbd.json

real	0m22.591s
user	0m6.186s
sys	0m9.944s


## qemu-img convert using qemu-nbd

I also tested -W option, the results are the same.


$ time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img
    (100.00/100%)

real	1m14.928s
user	0m21.143s
sys	0m52.625s


time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img
    (100.00/100%)

real	1m16.764s
user	0m21.745s
sys	0m53.919s


$ ls -lhs /dev/shm/disk.img 
28G -rw-r--r--. 1 root root 50G Nov  6 15:39 /dev/shm/disk.img

Comment 2 Nir Soffer 2018-11-06 16:51:06 UTC
Created attachment 1502547 [details]
output of qemu-img map accessing the image directly

Comment 3 Nir Soffer 2018-11-06 16:51:54 UTC
Created attachment 1502548 [details]
output of qemu-img map accessing the image via qemu-nbd

Comment 6 Eric Blake 2019-04-17 18:50:23 UTC
Sounds like it is related (if not duplicate) of bug 1648622

Comment 10 Maxim Levitsky 2019-05-21 12:49:50 UTC
Reproduced upstream as well, although it seems that the qemu-img convert is a bit faster now:

[root@virtlab415 test]# time qemu-img map -f qcow2 --output json nbd-test-disk1 > nbd-test-disk1-map.json 

real	0m0.053s
user	0m0.033s
sys	0m0.012s

[root@virtlab415 test]# time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img
    (100.00/100%)

real	0m51.823s
user	0m0.766s
sys	0m4.564s


[root@virtlab415 test]# time qemu-img map --output json  nbd:unix:/tmp/nbd.sock:exportname=export > nbd-test-disk1-map-via-nbd.json

real	0m6.389s
user	0m2.741s
sys	0m2.210s

[root@virtlab415 test]# time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img
    (100.00/100%)

real	1m9.238s
user	0m4.262s
sys	0m8.958s

Comment 17 Maxim Levitsky 2019-06-26 14:57:52 UTC
I found the root cause(s) of the bug.


1. First of all, as I already said qemu-img iterates over the image in up to 1G chunks, so the larger the sparse qcow2 image is, the more nbd traffic will be needed.


2. QEMU's NBD server advertises the max transfer size (in NBD_INFO_BLOCK_SIZE) as bs->bl.max_transfer, capping it at NBD_MAX_BUFFER_SIZE (currently 32M)


3. QEMU NBD client uses this max transfer size to cap the chunk size it uses when it queries for the block status (map) and for convert


3. for files, file-posix doesn't set bs->bl.max_transfer, so 32M chunks are used, but for block devices it uses the BLKSECTGET IOCTL to get max transfer size of the underlying block device, which happens to be 128K on the nvme drive I use. 

IMHO this is wrong, as the kernel is able to split larger requests so it can do transfers of any size. However I don't know the kernel block driver well enough to be 100% sure that this will always work. At least for the limited testing I did, removing the BLKSECTGET code, brings back the performance to be the same as on a file, fixing this bug.

Also note that when exporting the qcow image as raw over NBD (

Comment 18 Maxim Levitsky 2019-06-26 19:02:11 UTC
I hit submit too soon here and forgot to post the continuation of this comment.

So, Also note that when exporting the qcow image as raw over NBD (qemu-nbd -k /tmp/nbd.sock -v  -f raw  -x export --cache=none --aio=native --persistent /dev/nvme0n1p3),
qcow2 driver runs in the client, which avoids the issue all together, since it know without reading the whole underlying device which areas are allocated and which are not.

Also note that the code which added max transfer size to the file-posix is somewhat recent, it is from commit 

commit 6f6071745bd0366221f5a0160ed7d18d0e38b9f7
Author: Fam Zheng <famz>
Date:   Fri Jun 3 10:07:02 2016 +0800

    raw-posix: Fetch max sectors for host block device
    
    This is sometimes a useful value we should count in.
    
    Signed-off-by: Fam Zheng <famz>
    Reviewed-by: Eric Blake <eblake>
    Signed-off-by: Kevin Wolf <kwolf>

This code might not be needed, _or_ we can tune the nbd driver to allow larger that max transfer requests which would be split by qemu block layer thus avoiding the network overhead.
I need your advice here how to proceed on this.

Comment 19 Eric Blake 2019-06-26 19:24:30 UTC
(In reply to Maxim Levitsky from comment #18)
> I hit submit too soon here and forgot to post the continuation of this
> comment.
> 
> So, Also note that when exporting the qcow image as raw over NBD (qemu-nbd
> -k /tmp/nbd.sock -v  -f raw  -x export --cache=none --aio=native
> --persistent /dev/nvme0n1p3),
> qcow2 driver runs in the client, which avoids the issue all together, since
> it know without reading the whole underlying device which areas are
> allocated and which are not.
> 
> Also note that the code which added max transfer size to the file-posix is
> somewhat recent, it is from commit 
> 
> commit 6f6071745bd0366221f5a0160ed7d18d0e38b9f7
> Author: Fam Zheng <famz>
> Date:   Fri Jun 3 10:07:02 2016 +0800
> 
>     raw-posix: Fetch max sectors for host block device
>     
>     This is sometimes a useful value we should count in.
>     
>     Signed-off-by: Fam Zheng <famz>
>     Reviewed-by: Eric Blake <eblake>
>     Signed-off-by: Kevin Wolf <kwolf>
> 
> This code might not be needed, _or_ we can tune the nbd driver to allow
> larger that max transfer requests which would be split by qemu block layer
> thus avoiding the network overhead.
> I need your advice here how to proceed on this.

If I could find time to work on NBD protocol extensions, I've got several things lined up to see what will help with performance (some may help more than others, but we may still want all of them):

- One proposal is to add NBD_CMD_FLAG_FAST_ZERO: https://lists.debian.org/nbd/2019/03/msg00004.html
This would map nicely to qemu's recent BDRV_REQ_NO_FALLBACK addition, and means that attempts to pre-zero the destination image do not accidentally slow it down if zeroing is not fast.
- There's also been a proposal to advertise whether an image is known to contain all zeroes at initial connection time, which bypasses the need to query block status or attempt pre-zero. But it is also more limited in scope - it's a one shot flag (as soon as you write to the image, the flag is no longer valid)
- Another proposal is about how to expand NBD_OPT_GO to provide additional information about maximum sizing for zero/trim requests which is larger than the maximum transfer reported in NBD_INFO_BLOCK_SIZE. With this, it would be possible for NBD to report support for zeroing ~4G of an image in one go, rather than having to do it in a loop 32M at a time.

You also mentioned running qemu-nbd -f raw and having the client use -f qcow2 (instead of our more typical qemu-nbd -f qcow2 and client -f raw); that's okay for read-only images, but until we implement resize support in the NBD protocol, it requires pre-allocation on the server side before the client connects, or you risk ENOSPC situations that you don't get with local files holding the qcow2 format (since local files can resize as needed).

The problem is that as all of the above points are not yet standardized in the NBD protocol, they need a proof of concept code implementation (obviously in qemu, but preferably also in nbdkit and/or libnbd to prove interoperability of the extension). At this point, the soonest any of these extensions will be finalized will be for qemu 4.2. If we find other tricks in qemu proper for avoiding the slow paths in the first place, those can be applied even without NBD extensions.

Comment 22 Maxim Levitsky 2019-06-27 15:11:38 UTC
I checked the kernel source, and basically I am pretty sure that the kernel can accept any size O_DIRECT read/writes on a block device.

First the in the __blkdev_direct_IO, the incoming iovec is split into bios which have limit of 256 pages each, then underlying block layer splits the bios futher according to the device limits.
This happens in the blk_queue_split which is called from the blk_mq_make_request, the 'only' make_request function remaining these days for hardware block layer (after removal of the non mq block layer).

The blk_queue_split splits the requests according to all the hardware limitations.

I also tested that if I set mdtd of qemu's virtual nvme drive to 1 (which corresponds to 8K maximum transfer size (4K << 1), and I run
dd if=/dev/nvme0n1 bs=1M count=1 iflag=direct of=/dev/null,
in the guest, I see in the qemu that the virtual nvme drive gets lots of nice 8K sized requests, and the dd succeeds.

So I think that the max transfer size limit on the raw block devices is wrong to have. But I would be more that happy you to prove me wrong, because there might be some corner cases yet.

Best regards,
     Maxim Levitsky

Comment 23 Maxim Levitsky 2019-06-27 15:33:20 UTC
And on top of that for scsi passthrough, a char(!) device is used (/dev/sg*) which bypasses the whole block layer and indeed for sure carries the limitation of max transfer size.
So I think that the right solution here is to drop the max transfer size code from all but these devices.

Comment 25 Tingting Mao 2019-07-01 05:54:31 UTC
Reproduced this issue in rhel8.1.0-av like below:

Tested with:
qemu-kvm-4.0.0-4.module+el8.1.0+3356+cda7f1ee
4.18.0-100.el8.x86_64


Steps:
=====================================================================================
Local:
# time qemu-img map --output=json /dev/nvme0n1p1 > test.json

real	0m0.017s
user	0m0.008s
sys	0m0.006s

# time qemu-img convert -f raw -O raw /dev/nvme0n1p1 /dev/shm/tgt.img 

real	0m2.805s
user	0m0.819s
sys	0m8.154s

======================================================================================
NBD:
# time qemu-img map --output=json nbd:unix:/home/test/my.socket:exportname=export > test_nbd.json 

real	0m0.704s
user	0m0.275s
sys	0m0.243s

# time qemu-img convert -f raw -O raw nbd:unix:/home/test/my.socket:exportname=export /dev/shm/tgt.img 

real	0m5.014s
user	0m0.797s
sys	0m2.602s
=======================================================================================


Additional info:
Source image info:
1. Partition info:
nvme0n1                          259:0    0 745.2G  0 disk 
└─nvme0n1p1                      259:1    0     5G  0 part 
2. Mount the partition and write 1.5G data to it
# mount /dev/nvme0n1p1 /mnt/
# cd /mnt/
# dd if=/dev/urandom of=f1 bs=1M count=512
# dd if=/dev/urandom of=f2 bs=1M count=1024
# umount /mnt

Export image via below CML:
# qemu-nbd -k /home/test/my.socket -v -f raw -x export --cache=none --aio=threads --persistent /dev/nvme0n1p1

Comment 26 Tingting Mao 2019-07-01 06:51:35 UTC
Reproduced this issue in rhel7.7 like below:

Tested with:
kernel-3.10.0-1058.el7.x86_64
qemu-kvm-rhev-2.12.0-33.el7


Steps:
================================================================================================
Local:
# time qemu-img map --output=json /dev/nvme0n1p1 > test.json

real	0m0.028s
user	0m0.014s
sys	0m0.013s

# time qemu-img convert -f raw -O raw /dev/nvme0n1p1 /dev/shm/tgt.img 

real	0m2.795s
user	0m0.837s
sys	0m6.545s
==================================================================================================
NBD:
# time qemu-img map --output=json nbd:unix:/home/test/my.socket:exportname=export > test_nbd.json 

real	0m0.652s
user	0m0.210s
sys	0m0.252s

# time qemu-img convert -f raw -O raw nbd:unix:/home/test/my.socket:exportname=export /dev/shm/tgt.img

real	0m3.102s
user	0m0.722s
sys	0m2.278s
===================================================================================================


Additional info:
The source image and export line is the same as the ones in Comment 25.

Comment 31 Tingting Mao 2019-07-03 03:20:47 UTC
After confirming the steps with Maxim via IRC, re-tested this issue in rhel7, and reproduced the issues. Thanks.


Tested with:
qemu-kvm-rhev-2.12.0-33.el7
kernel-3.10.0-1058.el7.x86_64


Steps:
1. Create the block file with qcow2 format
# qemu-img create -f qcow2 /dev/nvme0n1p1 5G

2. Compare the map and convert time 
===========================================================================================================
Local:
# time qemu-img map ***-f qcow2*** --output=json /dev/nvme0n1p1> test.json

real	0m0.025s
user	0m0.013s
sys	0m0.012s

# time qemu-img convert ***-f qcow2*** -O raw /dev/nvme0n1p1 /dev/shm/tgt.img 

real	0m0.166s
user	0m0.016s
sys	0m0.150s
=============================================================================================================
NBD:
# time qemu-img map ***-f raw*** --output=json nbd:unix:/home/test/my.socket:exportname=export > test.json 

real	0m0.655s
user	0m0.200s
sys	0m0.265s

# time qemu-img convert ***-f raw*** -O raw nbd:unix:/home/test/my.socket:exportname=export /dev/shm/tgt.img

real	0m3.020s
user	0m0.708s
sys	0m2.435s
==============================================================================================================


Note:
In comment 26, I did not create qcow2 image on the block device(i.e The source file is a raw block device), and exported the file as RAW and used as RAW in client.

Comment 32 Maxim Levitsky 2019-07-04 12:46:08 UTC
Patch posted upstream:

https://www.mail-archive.com/qemu-devel@nongnu.org/msg627717.html

Comment 33 Maxim Levitsky 2019-07-04 12:54:36 UTC
V2 of the patch:

https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg01412.html

Comment 34 Maxim Levitsky 2019-07-16 08:26:07 UTC
Patch is accepted upstream:

https://git.qemu.org/?p=qemu.git;a=commit;h=867eccfed84f96b54f4a432c510a02c2ce03b430

Comment 36 Tom Yan 2020-09-06 10:59:33 UTC
You appear to have assumed that the only "SCSI Passthrough" is `-device scsi-generic`, while the fact is there's also `-device scsi-block` (passthrough without the sg driver). Unlike `-device scsi-hd`, getting max_sectors is necessary to it (more precisely, hw_max_sectors might what matters, but BLKSECTGET reports max_sectors, so).

I'm unsure about how qemu-nbd works, but the commit clearly wasn't the right approach to fix the original issue it addresses. (It should for example ignore the max_transfer if it will never matter in to it, or overrides it in certain cases; when I glimpsed over this, I don't see how it could be file-posix problem when it is reporting the right thing, regardless of whether "removing" the code helps.)

I don't think we want to "mark" `-device scsi-block` as sg either. It will probably bring even more unexpected problems, because they are literally different sets of things behind the scene / in the kernel.

Comment 38 Tom Yan 2020-09-06 11:05:06 UTC
Maybe you want to add some condition for this:
https://github.com/qemu/qemu/blob/v5.1.0/nbd/server.c#L659
Or not clamp it at all.