Bug 1647104
Summary: | qemu-img convert/map much slower when using qemu-nbd compared with direct access to image | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Nir Soffer <nsoffer> | ||||||||
Component: | qemu-kvm-rhev | Assignee: | Maxim Levitsky <mlevitsk> | ||||||||
Status: | CLOSED DEFERRED | QA Contact: | Tingting Mao <timao> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.6 | CC: | adbarbos, chayang, coli, eblake, juzhang, mlevitsk, tom.ty89, virt-maint | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1726528 (view as bug list) | Environment: | |||||||||
Last Closed: | 2019-07-22 20:29:46 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1726528 | ||||||||||
Attachments: |
|
Created attachment 1502547 [details]
output of qemu-img map accessing the image directly
Created attachment 1502548 [details]
output of qemu-img map accessing the image via qemu-nbd
Sounds like it is related (if not duplicate) of bug 1648622 Reproduced upstream as well, although it seems that the qemu-img convert is a bit faster now: [root@virtlab415 test]# time qemu-img map -f qcow2 --output json nbd-test-disk1 > nbd-test-disk1-map.json real 0m0.053s user 0m0.033s sys 0m0.012s [root@virtlab415 test]# time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img (100.00/100%) real 0m51.823s user 0m0.766s sys 0m4.564s [root@virtlab415 test]# time qemu-img map --output json nbd:unix:/tmp/nbd.sock:exportname=export > nbd-test-disk1-map-via-nbd.json real 0m6.389s user 0m2.741s sys 0m2.210s [root@virtlab415 test]# time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img (100.00/100%) real 1m9.238s user 0m4.262s sys 0m8.958s I found the root cause(s) of the bug. 1. First of all, as I already said qemu-img iterates over the image in up to 1G chunks, so the larger the sparse qcow2 image is, the more nbd traffic will be needed. 2. QEMU's NBD server advertises the max transfer size (in NBD_INFO_BLOCK_SIZE) as bs->bl.max_transfer, capping it at NBD_MAX_BUFFER_SIZE (currently 32M) 3. QEMU NBD client uses this max transfer size to cap the chunk size it uses when it queries for the block status (map) and for convert 3. for files, file-posix doesn't set bs->bl.max_transfer, so 32M chunks are used, but for block devices it uses the BLKSECTGET IOCTL to get max transfer size of the underlying block device, which happens to be 128K on the nvme drive I use. IMHO this is wrong, as the kernel is able to split larger requests so it can do transfers of any size. However I don't know the kernel block driver well enough to be 100% sure that this will always work. At least for the limited testing I did, removing the BLKSECTGET code, brings back the performance to be the same as on a file, fixing this bug. Also note that when exporting the qcow image as raw over NBD ( I hit submit too soon here and forgot to post the continuation of this comment. So, Also note that when exporting the qcow image as raw over NBD (qemu-nbd -k /tmp/nbd.sock -v -f raw -x export --cache=none --aio=native --persistent /dev/nvme0n1p3), qcow2 driver runs in the client, which avoids the issue all together, since it know without reading the whole underlying device which areas are allocated and which are not. Also note that the code which added max transfer size to the file-posix is somewhat recent, it is from commit commit 6f6071745bd0366221f5a0160ed7d18d0e38b9f7 Author: Fam Zheng <famz> Date: Fri Jun 3 10:07:02 2016 +0800 raw-posix: Fetch max sectors for host block device This is sometimes a useful value we should count in. Signed-off-by: Fam Zheng <famz> Reviewed-by: Eric Blake <eblake> Signed-off-by: Kevin Wolf <kwolf> This code might not be needed, _or_ we can tune the nbd driver to allow larger that max transfer requests which would be split by qemu block layer thus avoiding the network overhead. I need your advice here how to proceed on this. (In reply to Maxim Levitsky from comment #18) > I hit submit too soon here and forgot to post the continuation of this > comment. > > So, Also note that when exporting the qcow image as raw over NBD (qemu-nbd > -k /tmp/nbd.sock -v -f raw -x export --cache=none --aio=native > --persistent /dev/nvme0n1p3), > qcow2 driver runs in the client, which avoids the issue all together, since > it know without reading the whole underlying device which areas are > allocated and which are not. > > Also note that the code which added max transfer size to the file-posix is > somewhat recent, it is from commit > > commit 6f6071745bd0366221f5a0160ed7d18d0e38b9f7 > Author: Fam Zheng <famz> > Date: Fri Jun 3 10:07:02 2016 +0800 > > raw-posix: Fetch max sectors for host block device > > This is sometimes a useful value we should count in. > > Signed-off-by: Fam Zheng <famz> > Reviewed-by: Eric Blake <eblake> > Signed-off-by: Kevin Wolf <kwolf> > > This code might not be needed, _or_ we can tune the nbd driver to allow > larger that max transfer requests which would be split by qemu block layer > thus avoiding the network overhead. > I need your advice here how to proceed on this. If I could find time to work on NBD protocol extensions, I've got several things lined up to see what will help with performance (some may help more than others, but we may still want all of them): - One proposal is to add NBD_CMD_FLAG_FAST_ZERO: https://lists.debian.org/nbd/2019/03/msg00004.html This would map nicely to qemu's recent BDRV_REQ_NO_FALLBACK addition, and means that attempts to pre-zero the destination image do not accidentally slow it down if zeroing is not fast. - There's also been a proposal to advertise whether an image is known to contain all zeroes at initial connection time, which bypasses the need to query block status or attempt pre-zero. But it is also more limited in scope - it's a one shot flag (as soon as you write to the image, the flag is no longer valid) - Another proposal is about how to expand NBD_OPT_GO to provide additional information about maximum sizing for zero/trim requests which is larger than the maximum transfer reported in NBD_INFO_BLOCK_SIZE. With this, it would be possible for NBD to report support for zeroing ~4G of an image in one go, rather than having to do it in a loop 32M at a time. You also mentioned running qemu-nbd -f raw and having the client use -f qcow2 (instead of our more typical qemu-nbd -f qcow2 and client -f raw); that's okay for read-only images, but until we implement resize support in the NBD protocol, it requires pre-allocation on the server side before the client connects, or you risk ENOSPC situations that you don't get with local files holding the qcow2 format (since local files can resize as needed). The problem is that as all of the above points are not yet standardized in the NBD protocol, they need a proof of concept code implementation (obviously in qemu, but preferably also in nbdkit and/or libnbd to prove interoperability of the extension). At this point, the soonest any of these extensions will be finalized will be for qemu 4.2. If we find other tricks in qemu proper for avoiding the slow paths in the first place, those can be applied even without NBD extensions. I checked the kernel source, and basically I am pretty sure that the kernel can accept any size O_DIRECT read/writes on a block device. First the in the __blkdev_direct_IO, the incoming iovec is split into bios which have limit of 256 pages each, then underlying block layer splits the bios futher according to the device limits. This happens in the blk_queue_split which is called from the blk_mq_make_request, the 'only' make_request function remaining these days for hardware block layer (after removal of the non mq block layer). The blk_queue_split splits the requests according to all the hardware limitations. I also tested that if I set mdtd of qemu's virtual nvme drive to 1 (which corresponds to 8K maximum transfer size (4K << 1), and I run dd if=/dev/nvme0n1 bs=1M count=1 iflag=direct of=/dev/null, in the guest, I see in the qemu that the virtual nvme drive gets lots of nice 8K sized requests, and the dd succeeds. So I think that the max transfer size limit on the raw block devices is wrong to have. But I would be more that happy you to prove me wrong, because there might be some corner cases yet. Best regards, Maxim Levitsky And on top of that for scsi passthrough, a char(!) device is used (/dev/sg*) which bypasses the whole block layer and indeed for sure carries the limitation of max transfer size. So I think that the right solution here is to drop the max transfer size code from all but these devices. Some more info here: https://developer.ibm.com/articles/enhancing-qemu-virtio-scsi-with-block-limits-vpd-emulation/#sec3 Reproduced this issue in rhel8.1.0-av like below: Tested with: qemu-kvm-4.0.0-4.module+el8.1.0+3356+cda7f1ee 4.18.0-100.el8.x86_64 Steps: ===================================================================================== Local: # time qemu-img map --output=json /dev/nvme0n1p1 > test.json real 0m0.017s user 0m0.008s sys 0m0.006s # time qemu-img convert -f raw -O raw /dev/nvme0n1p1 /dev/shm/tgt.img real 0m2.805s user 0m0.819s sys 0m8.154s ====================================================================================== NBD: # time qemu-img map --output=json nbd:unix:/home/test/my.socket:exportname=export > test_nbd.json real 0m0.704s user 0m0.275s sys 0m0.243s # time qemu-img convert -f raw -O raw nbd:unix:/home/test/my.socket:exportname=export /dev/shm/tgt.img real 0m5.014s user 0m0.797s sys 0m2.602s ======================================================================================= Additional info: Source image info: 1. Partition info: nvme0n1 259:0 0 745.2G 0 disk └─nvme0n1p1 259:1 0 5G 0 part 2. Mount the partition and write 1.5G data to it # mount /dev/nvme0n1p1 /mnt/ # cd /mnt/ # dd if=/dev/urandom of=f1 bs=1M count=512 # dd if=/dev/urandom of=f2 bs=1M count=1024 # umount /mnt Export image via below CML: # qemu-nbd -k /home/test/my.socket -v -f raw -x export --cache=none --aio=threads --persistent /dev/nvme0n1p1 Reproduced this issue in rhel7.7 like below: Tested with: kernel-3.10.0-1058.el7.x86_64 qemu-kvm-rhev-2.12.0-33.el7 Steps: ================================================================================================ Local: # time qemu-img map --output=json /dev/nvme0n1p1 > test.json real 0m0.028s user 0m0.014s sys 0m0.013s # time qemu-img convert -f raw -O raw /dev/nvme0n1p1 /dev/shm/tgt.img real 0m2.795s user 0m0.837s sys 0m6.545s ================================================================================================== NBD: # time qemu-img map --output=json nbd:unix:/home/test/my.socket:exportname=export > test_nbd.json real 0m0.652s user 0m0.210s sys 0m0.252s # time qemu-img convert -f raw -O raw nbd:unix:/home/test/my.socket:exportname=export /dev/shm/tgt.img real 0m3.102s user 0m0.722s sys 0m2.278s =================================================================================================== Additional info: The source image and export line is the same as the ones in Comment 25. After confirming the steps with Maxim via IRC, re-tested this issue in rhel7, and reproduced the issues. Thanks. Tested with: qemu-kvm-rhev-2.12.0-33.el7 kernel-3.10.0-1058.el7.x86_64 Steps: 1. Create the block file with qcow2 format # qemu-img create -f qcow2 /dev/nvme0n1p1 5G 2. Compare the map and convert time =========================================================================================================== Local: # time qemu-img map ***-f qcow2*** --output=json /dev/nvme0n1p1> test.json real 0m0.025s user 0m0.013s sys 0m0.012s # time qemu-img convert ***-f qcow2*** -O raw /dev/nvme0n1p1 /dev/shm/tgt.img real 0m0.166s user 0m0.016s sys 0m0.150s ============================================================================================================= NBD: # time qemu-img map ***-f raw*** --output=json nbd:unix:/home/test/my.socket:exportname=export > test.json real 0m0.655s user 0m0.200s sys 0m0.265s # time qemu-img convert ***-f raw*** -O raw nbd:unix:/home/test/my.socket:exportname=export /dev/shm/tgt.img real 0m3.020s user 0m0.708s sys 0m2.435s ============================================================================================================== Note: In comment 26, I did not create qcow2 image on the block device(i.e The source file is a raw block device), and exported the file as RAW and used as RAW in client. Patch posted upstream: https://www.mail-archive.com/qemu-devel@nongnu.org/msg627717.html V2 of the patch: https://lists.gnu.org/archive/html/qemu-devel/2019-07/msg01412.html Patch is accepted upstream: https://git.qemu.org/?p=qemu.git;a=commit;h=867eccfed84f96b54f4a432c510a02c2ce03b430 You appear to have assumed that the only "SCSI Passthrough" is `-device scsi-generic`, while the fact is there's also `-device scsi-block` (passthrough without the sg driver). Unlike `-device scsi-hd`, getting max_sectors is necessary to it (more precisely, hw_max_sectors might what matters, but BLKSECTGET reports max_sectors, so). I'm unsure about how qemu-nbd works, but the commit clearly wasn't the right approach to fix the original issue it addresses. (It should for example ignore the max_transfer if it will never matter in to it, or overrides it in certain cases; when I glimpsed over this, I don't see how it could be file-posix problem when it is reporting the right thing, regardless of whether "removing" the code helps.) I don't think we want to "mark" `-device scsi-block` as sg either. It will probably bring even more unexpected problems, because they are literally different sets of things behind the scene / in the kernel. https://lists.nongnu.org/archive/html/qemu-block/2020-09/msg00281.html https://lists.nongnu.org/archive/html/qemu-block/2020-09/msg00282.html Maybe you want to add some condition for this: https://github.com/qemu/qemu/blob/v5.1.0/nbd/server.c#L659 Or not clamp it at all. |
Created attachment 1502545 [details] output of qemu-img map accessing the image directly Description of problem: qemu-img convert and map are much slower when accessing an image using qemu-nbd compared with accessing an image directly. Version-Release number of selected component (if applicable): qemu-img-rhev-2.12.0-18.el7_6.1.x86_64 How reproducible: Always Steps to Reproduce: See bellow. Actual results: qemu-img map is 140 times slower qemu-img convert is 2.5 time slower Expected results: Minor performance difference expected RHV will use qemu nbd server for backup purposes. It should be efficient enough to allow backup of entire cluster during a backup window. I did not test accessing image via qemu builtin nbd server yet, I assume that the behavior will be similar as the nbd server code is shared with qemu-nbd. Bellow details of the test. ## How test image was created The test image is a fresh Fedora 29 server installed on RHV thin image on block storage. After installing the image, I run these commands to populate the image with some data, simulating real image. lvresize +L 30g fedora/root yum update mkdir data cd data for i in $(seq -w 100); do dd if=/dev/urandom of=$i bs=1M count=256; done ## Image info $ ls -lh nbd-test-disk1 lrwxrwxrwx. 1 root root 78 Nov 6 14:42 nbd-test-disk1 -> /dev/8daa13f5-d6b9-479d-b637-50cd4f3207d8/61d13c96-641e-4748-bb16-2f3975f5bfc1 $ qemu-img info nbd-test-disk1 image: nbd-test-disk1 file format: qcow2 virtual size: 50G (53687091200 bytes) disk size: 0 cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false $ lvdisplay 8daa13f5-d6b9-479d-b637-50cd4f3207d8/61d13c96-641e-4748-bb16-2f3975f5bfc1 --- Logical volume --- LV Path /dev/8daa13f5-d6b9-479d-b637-50cd4f3207d8/61d13c96-641e-4748-bb16-2f3975f5bfc1 LV Name 61d13c96-641e-4748-bb16-2f3975f5bfc1 VG Name 8daa13f5-d6b9-479d-b637-50cd4f3207d8 LV UUID qfrwHx-SJHR-USw0-Bd31-aODb-keZD-Jf52Vs LV Write Access read/write LV Creation host, time b02-h25-r620.rhev.openstack.engineering.redhat.com, 2018-11-05 13:05:59 +0000 LV Status available # open 0 LV Size 28.00 GiB Current LE 224 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:13 $ lsblk /dev/mapper/3600a098038304437415d4b6a59676d67 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT 3600a098038304437415d4b6a59676d67 253:2 0 1.3T 0 mpath ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-07e4a192--cb83--40c8--8180--7beee118d987 253:12 0 50G 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-61d13c96--641e--4748--bb16--2f3975f5bfc1 253:13 0 28G 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-metadata 253:133 0 512M 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-outbox 253:134 0 128M 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-xleases 253:135 0 1G 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-leases 253:136 0 2G 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-ids 253:137 0 128M 0 lvm ├─8daa13f5--d6b9--479d--b637--50cd4f3207d8-inbox 253:138 0 128M 0 lvm └─8daa13f5--d6b9--479d--b637--50cd4f3207d8-master 253:139 0 1G 0 lvm $ multipath -ll /dev/mapper/3600a098038304437415d4b6a59676d67 3600a098038304437415d4b6a59676d67 dm-2 NETAPP ,LUN C-Mode size=1.3T features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 alua' wp=rw `-+- policy='round-robin 0' prio=30 status=active |- 8:0:1:1 sdh 8:112 active ready running |- 7:0:0:1 sdb 8:16 active ready running |- 8:0:0:1 sdd 8:48 active ready running `- 7:0:1:1 sde 8:64 active ready running ## Testing qemu-img map, accessing image directly $ time qemu-img map -f qcow2 --output json nbd-test-disk1 > nbd-test-disk1-map.json real 0m0.153s user 0m0.101s sys 0m0.021s $ wc -l nbd-test-disk1-map.json 3541 nbd-test-disk1-map.json ## Coping using dd (conv=sparse) This is for reference to understand capabilities of the storage. $ dd if=nbd-test-disk1 of=/dev/shm/disk.qcow2 bs=8M iflag=direct conv=sparse,fsync 3584+0 records in 3584+0 records out 30064771072 bytes (30 GB) copied, 42.4591 s, 708 MB/s ls -lhs /dev/shm/disk.qcow2 28G -rw-r--r--. 1 root root 28G Nov 6 14:56 /dev/shm/disk.qcow2 ## qemu-img convert using qemu-img convert, accessing image directly The destination is in /dev/shm, since I want to test read throughput. I also tested -W option, the results are the same. $ time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img (100.00/100%) real 0m37.758s user 0m6.141s sys 0m31.317s $ time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img (100.00/100%) real 0m29.270s user 0m6.214s sys 0m31.570s $ time qemu-img convert -p -f qcow2 -O raw -T none nbd-test-disk1 /dev/shm/disk.img (100.00/100%) real 0m28.423s user 0m6.373s sys 0m31.348s $ ls -lhs /dev/shm/disk.img 28G -rw-r--r--. 1 root root 50G Nov 6 14:50 /dev/shm/disk.img ## Exposing the image using qemu-nbd Image was exposed using nbd socket like this: $ qemu-nbd -k /tmp/nbd.sock -v -t -f qcow2 nbd-test-disk1 -x export --cache=none --aio=native I also tried --detect-zeroes=on - it should not be needed for qcow2 source image, but I tested it to be sure. The results are the same. ## qemu-img map using qemu-nbd $ time qemu-img map --output json nbd:unix:/tmp/nbd.sock:exportname=export > nbd-test-disk1-map-via-nbd.json real 0m22.591s user 0m6.186s sys 0m9.944s ## qemu-img convert using qemu-nbd I also tested -W option, the results are the same. $ time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img (100.00/100%) real 1m14.928s user 0m21.143s sys 0m52.625s time qemu-img convert -p -f raw -O raw nbd:unix:/tmp/nbd.sock:exportname=export /dev/shm/disk.img (100.00/100%) real 1m16.764s user 0m21.745s sys 0m53.919s $ ls -lhs /dev/shm/disk.img 28G -rw-r--r--. 1 root root 50G Nov 6 15:39 /dev/shm/disk.img