Performance benchmarking shows that enabling multi-queue virtio-blk and virtio-scsi increases performance. There are several reasons: 1. Request completion IRQs are received on the vCPU that submitted I/O. This avoids waking up the submitter vCPU with an interprocessor interrupt on each completion. 2. The mq-deadline I/O scheduler is the default for single-queue blk-mq devices, whereas the default I/O scheduler for multi-queue blk-mq devices is "none". The latency overhead of mq-deadline affects workloads that are latency-sensitive. 3. Request completion is performed in softirq context for single-queue blk-mq devices. HIPRI polling I/O is completed in the current task for multi-queue blk-mq devices. I have measured a 25% performance improvement for 4 KB random read, iodepth=1, hipri=1 on NVMe just by enabling multi-queue. QEMU should enable multi-queue by default in new machine types.
Confirm the usage: 1. for virtio-scsi -device virtio-scsi-pci,id=virtio_scsi_pci0,num_queues=4,bus=pcie.0-root-port-4,addr=0x0 \ for virtio-blk -device virtio-blk-pci num-queues= 2. the number should equal as number of VCPUs Am i right?
Hi qing.wang, Nothing changes when the number of queues is explicitly set with -device virtio-scsi-pci,num_queues= or -device virtio-blk-pci,num-queues=. This only affects the default behavior when a guest is started without these parameters. I am proposing setting the default number of queues to the number of vCPUs, but this is under discussion upstream and the final behavior may be different.
I find it is not works for windows guests,please help me check the test steps: 1.Create image qemu-img create -f qcow2 /home/images/data1.qcow2 11G qemu-img create -f qcow2 /home/images/data2.qcow2 12G qemu-img create -f qcow2 /home/images/data3.qcow2 13G qemu-img create -f qcow2 /home/images/data4.qcow2 14G 2.boot vm /usr/libexec/qemu-kvm \ -name copy_read_vm1 \ -machine q35 \ -nodefaults \ -vga qxl \ -m 2048 \ -smp 8 \ -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x2,chassis=1 \ -device pcie-root-port,id=pcie.0-root-port-1,port=0x1,addr=0x2.0x1,bus=pcie.0,chassis=2 \ -device pcie-root-port,id=pcie.0-root-port-2,port=0x2,addr=0x2.0x2,bus=pcie.0,chassis=3 \ -device pcie-root-port,id=pcie.0-root-port-3,port=0x3,addr=0x2.0x3,bus=pcie.0,chassis=4 \ -device pcie-root-port,id=pcie.0-root-port-4,port=0x4,addr=0x2.0x4,bus=pcie.0,chassis=5 \ -device pcie-root-port,id=pcie.0-root-port-5,port=0x5,addr=0x2.0x5,bus=pcie.0,chassis=6 \ -device pcie-root-port,id=pcie.0-root-port-6,port=0x6,addr=0x2.0x6,bus=pcie.0,chassis=7 \ -device pcie-root-port,id=pcie.0-root-port-7,port=0x7,addr=0x2.0x7,bus=pcie.0,chassis=8 \ -device qemu-xhci,id=usb1,bus=pcie.0-root-port-1,addr=0x0 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -object iothread,id=iothread0 \ -device virtio-scsi-pci,id=scsi0,bus=pcie.0-root-port-2,iothread=iothread0 \ -device virtio-scsi-pci,id=scsi1,bus=pcie.0-root-port-3,num_queues=8,iothread=iothread0 \ -blockdev driver=qcow2,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/win2019-64-virtio-scsi.qcow2,node-name=drive_image1 \ -device scsi-hd,id=os1,drive=drive_image1,bootindex=0,bus=scsi0.0 \ \ -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data1.qcow2,node-name=node1 \ -device virtio-blk-pci,id=blk_data1,drive=node1,bus=pcie.0-root-port-4,addr=0x0,bootindex=1,iothread=iothread0 \ -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data2.qcow2,node-name=node2 \ -device virtio-blk-pci,id=blk_data2,drive=node2,bus=pcie.0-root-port-5,addr=0x0,bootindex=2,iothread=iothread0,num-queues=8 \ -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data3.qcow2,node-name=node3 \ -device scsi-hd,id=blk_data3,drive=node3,bus=scsi0.0,bootindex=3 \ -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data4.qcow2,node-name=node4 \ -device scsi-hd,id=blk_data4,drive=node4,bus=scsi1.0,bootindex=4 \ \ -vnc :5 \ -monitor stdio \ -device pcie-root-port,id=pcie.0-root-port-8,slot=8,chassis=8,addr=0x8,bus=pcie.0 \ -device virtio-net-pci,mac=9a:b5:b6:b1:b2:b5,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pcie.0-root-port-8,addr=0x0 \ -netdev tap,id=idxgXAlm \ -chardev file,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmp5.log,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -qmp tcp:0:5955,server,nowait \ -chardev file,path=/var/tmp/monitor-serial5.log,id=serial_id_serial0 \ -device isa-serial,chardev=serial_id_serial0 \ -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=none,media=cdrom,file=/home/kvm_autotest_root/iso/windows/winutils.iso \ -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \ 3.format data disk as ntfs quick format. d:\coreutils\DummyCMD.exe e:\1.dat 10240000000 1 d:\coreutils\DummyCMD.exe f:\1.dat 10240000000 1 d:\coreutils\DummyCMD.exe g:\1.dat 10240000000 1 d:\coreutils\DummyCMD.exe h:\1.dat 10240000000 1 Compare the output, find the time on num_queues enabled disk is more. (DummyCMD.exe come from https://www.mynikko.com/dummy/)
Copy&paste from a recent document created by Stefan, including testing instructions: ---- virtio-blk/scsi multi-queue by default The -device virtio-blk,num-queues= and -device virtio-scsi,num_queues= parameters control how many virtqueues are available to the guest. Allocating one virtqueue per vCPU improves performance as follows: Interrupts are handled on the vCPU that submitted the request, avoiding IPIs The I/O scheduler is automatically set to “none” by the Linux block layer The number of queues can still be set explicitly in libvirt domain XML or on the QEMU command-line, but the latest QEMU machine type now defaults to num-queues=num-vcpus instead of num-queues=1. Note: That all virtqueues are still handled by a single thread in QEMU. This is not the same as QEMU block layer multi-queue support. Documentation This is a user-visible feature since it affects performance. Existing VMs are not affected because older machine types remain unchanged. Performance is expected to be better, especially for latency-sensitive workloads. If a customer is concerned about regressions they can compare with explicitly setting the number of queues to 1. Testing Launch an SMP guest without explicitly setting the number of queues for the virtio-blk device. The guest should now show that the Linux multiqueue block layer is active: # ls /sys/block/vda/mq/ 0 1 2 3 # cat /sys/block/vda/queue/scheduler [none] mq-deadline kyber bfq Compare this with a guest that explicitly has the number of queues set to 1: # ls /sys/block/vda/mq/ 0 # cat /sys/block/vda/queue/scheduler [mq-deadline] kuber bfq none
Tested on Red Hat Enterprise Linux release 8.4 Beta (Ootpa) 4.18.0-252.el8.dt4.x86_64 qemu-kvm-common-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 For virtio-scsi: /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -m 2048 \ -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \ -chardev socket,nowait,id=qmp_id_qmpmonitor1,path=/tmp/avocado_y9jitn48/monitor-qmpmonitor1-20201215-031130-UGyrst1b,server \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,nowait,id=qmp_id_catch_monitor,path=/tmp/avocado_y9jitn48/monitor-catch_monitor-20201215-031130-UGyrst1b,server \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idEvx5RT \ -chardev socket,nowait,id=chardev_serial0,path=/tmp/avocado_y9jitn48/serial-serial0-20201215-031130-UGyrst1b,server \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20201215-031130-UGyrst1b,path=/tmp/avocado_y9jitn48/seabios-20201215-031130-UGyrst1b,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20201215-031130-UGyrst1b,iobase=0x402 \ -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device virtio-scsi-pci,id=virtio_scsi_pci1,num_queues=1,bus=pci.0,addr=0x5 \ -blockdev node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg0.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg0 \ -device scsi-hd,id=stg0,bus=virtio_scsi_pci1.0,drive=drive_stg0,write-cache=on \ -device virtio-scsi-pci,id=virtio_scsi_pci2,num_queues=8,bus=pci.0,addr=0x6 \ -blockdev node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg1.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg1 \ -device scsi-hd,id=stg1,bus=virtio_scsi_pci2.0,drive=drive_stg1,write-cache=on \ -blockdev node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg2.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg2 \ -device scsi-hd,id=stg2,drive=drive_stg2,write-cache=on \ -device virtio-net-pci,mac=9a:52:fd:51:01:1d,id=id5I3Uqq,netdev=idYmyF0E,bus=pci.0,addr=0x7 \ -netdev tap,id=idYmyF0E,vhost=on,vhostfd=20,fd=16 \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm For virtio-blk: /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -m 2048 \ -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \ -chardev socket,nowait,id=qmp_id_qmpmonitor1,path=/tmp/avocado_y9jitn48/monitor-qmpmonitor1-20201215-030532-IscLJLhb,server \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,nowait,id=qmp_id_catch_monitor,path=/tmp/avocado_y9jitn48/monitor-catch_monitor-20201215-030532-IscLJLhb,server \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idR75jHN \ -chardev socket,nowait,id=chardev_serial0,path=/tmp/avocado_y9jitn48/serial-serial0-20201215-030532-IscLJLhb,server \ -device isa-serial,id=serial0,chardev=chardev_serial0 \ -chardev socket,id=seabioslog_id_20201215-030532-IscLJLhb,path=/tmp/avocado_y9jitn48/seabios-20201215-030532-IscLJLhb,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20201215-030532-IscLJLhb,iobase=0x402 \ -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x4 \ -blockdev node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg0.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg0 \ -device virtio-blk-pci,id=stg0,drive=drive_stg0,bootindex=1,write-cache=on,num-queues=1,bus=pci.0,addr=0x5 \ -blockdev node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg1.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg1 \ -device virtio-blk-pci,id=stg1,drive=drive_stg1,bootindex=2,write-cache=on,num-queues=8,bus=pci.0,addr=0x6 \ -blockdev node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg2.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg2 \ -device virtio-blk-pci,id=stg2,drive=drive_stg2,bootindex=3,write-cache=on,bus=pci.0,addr=0x7 \ -device virtio-net-pci,mac=9a:52:fd:51:01:1d,id=idV6pVbJ,netdev=id5mFz6m,bus=pci.0,addr=0x8 \ -netdev tap,id=id5mFz6m,vhost=on,vhostfd=20,fd=16 \ -vnc :1 \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm The third data disk stg2.qcow2 indeed enable multi-queue same as smp number: [root@localhost ~]# ls /sys/block/vdd/mq/ 0 1 2 3 4 5 6 7 But for the performance it has not improvement, sometimes it is worse, does it only take affect on specific operation or device? my test command : num-queues=1 disk: time dd if=/dev/zero of=/dev/sdb bs=256k count=10240 oflag=direct num-queues=8 disk: time dd if=/dev/zero of=/dev/sdc bs=256k count=10240 oflag=direct And for the each hotpluged disk it will open many fds related to the smp number. https://bugzilla.redhat.com/show_bug.cgi?id=1902548#c17
Hi,yama, could you please help to provide performance test with this feature.
Here are comparison results between 4 queues and 1 queue with qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 and kernel-4.18.0-259.el8.dt3.x86_64: raw+blk http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/raw.virtio_blk.*.x86_64.html raw+scsi http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/raw.virtio_scsi.*.x86_64.html For raw+blk, there is improvement for 4k read and write, but performance regression for 4k and 16k randread, randwrite and randrw. For raw+scsi, no obvious performance difference. 1 queue: -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2 \ -blockdev node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,num-queues=1,bus=pcie-root-port-3,addr=0x0 \------>specify num-queues=1 4 queues: -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2 \ -blockdev node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \ -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \ -device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,bus=pcie-root-port-3,addr=0x0 \-------->don't specify num-queues explicitly Hello Stefan, Could you please help check above results?
Since this feature has been implemented (multi-queue enabled default), could you please open a dedicated bug to track performance issue?
The feature has been verified due to https://bugzilla.redhat.com/show_bug.cgi?id=1827722#c11
(In reply to qing.wang from comment #16) > Since this feature has been implemented (multi-queue enabled default), could > you please open a dedicated bug to track performance issue? Yes, before reporting a new bug, Stefan, could you please help check results of comment 15?
(In reply to Yanhui Ma from comment #15) > Here are comparison results between 4 queues and 1 queue with > qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 and > kernel-4.18.0-259.el8.dt3.x86_64: > raw+blk > http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4. > 0/raw.virtio_blk.*.x86_64.html > raw+scsi > http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4. > 0/raw.virtio_scsi.*.x86_64.html > > For raw+blk, there is improvement for 4k read and write, but performance > regression for 4k and 16k randread, randwrite and randrw. > For raw+scsi, no obvious performance difference. > > 1 queue: > -m 4096 \ > -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2 \ > -blockdev > node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap, > aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \ > -blockdev > node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_disk1 \ > -device > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0, > chassis=4 \ > -device > virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,num- > queues=1,bus=pcie-root-port-3,addr=0x0 \------>specify num-queues=1 > > 4 queues: > -m 4096 \ > -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2 \ > -blockdev > node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap, > aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \ > -blockdev > node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_disk1 \ > -device > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0, > chassis=4 \ > -device > virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on, > bus=pcie-root-port-3,addr=0x0 \-------->don't specify num-queues explicitly > > Hello Stefan, > > Could you please help check above results? Hi, Please try the following configuration changes: 1. Disable the I/O scheduler inside the guest (echo none >/sys/block/vdb/queue/scheduler). This will make the num-queues=1 vs num-queues=4 comparison fairer because blk-mq has a different default I/O scheduler for num-queues=1. 2. Use -blockdev ...,aio=native. This avoids the thread pool on the host and is the recommended configuration. 3. Pin the 4 vCPU threads to dedicated host CPUs on the same NUMA node. 4. Add -object iothread,id=iothread0 and -device virtio-blk-pci,id=disk1,...,iothread=iothread0. Pin this iothread to a dedicated host CPU on the same NUMA node. I suggest first running with #1, then adding #2 and running again, then adding #3 and running again, etc so we can understand how these changes affected the results. Thanks!
I tried #1,2,4, it looks like num-queues=1 have better performance sometimes. /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -m 2048 \ -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2 \ -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \ -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x4 \ -object iothread,id=iothread0 \ -blockdev node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/stg0.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg0 \ -device virtio-blk-pci,id=stg0,drive=drive_stg0,bootindex=1,write-cache=on,num-queues=1,bus=pci.0,addr=0x5,iothread=iothread0 \ \ -blockdev node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/stg1.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg1 \ -device virtio-blk-pci,id=stg1,drive=drive_stg1,bootindex=2,write-cache=on,num-queues=8,bus=pci.0,addr=0x6,iothread=iothread0 \ -blockdev node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/stg2.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg2 \ -device virtio-blk-pci,id=stg2,drive=drive_stg2,bootindex=3,write-cache=on,bus=pci.0,addr=0x7 \ -device virtio-net-pci,mac=9a:52:fd:51:01:1d,id=idV6pVbJ,netdev=id5mFz6m,bus=pci.0,addr=0x8 \ -netdev tap,id=id5mFz6m,vhost=on \ -vnc :5 \ -qmp tcp:0:5955,server,nowait \ -rtc base=utc,clock=host,driftfix=slew \ -boot menu=off,order=cdn,once=c,strict=off \ -enable-kvm -monitor stdio \ [root@localhost ~]# ls /sys/block/vdb/mq/ 0 [root@localhost ~]# ls /sys/block/vdc/mq/ 0 1 2 3 4 5 6 7 [root@localhost ~]# cat /sys/block/vdb/queue/scheduler [none] mq-deadline kyber bfq [root@localhost ~]# cat /sys/block/vdc/queue/scheduler [none] mq-deadline kyber bfq ====================================================== [root@localhost ~]# time dd if=/dev/zero of=/dev/vdc bs=256k count=10240 oflag=direct;time dd if=/dev/zero of=/dev/vdb bs=256k count=10240 2684354560 bytes (2.7 GB, 2.5 GiB) copied, 13.2766 s, 202 MB/s real 0m13.280s user 0m0.054s sys 0m0.941s 10240+0 records in 10240+0 records out 2684354560 bytes (2.7 GB, 2.5 GiB) copied, 13.1277 s, 204 MB/s real 0m13.131s user 0m0.024s sys 0m1.046s ====================================================== Hi,yama,could you please help to double check.
Hi Stefan, how to implement you mentioned #3 in comment 19? I tried adding numactl before qemu-kvm ,but the guest can not boot up. numactl -C 2 /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox on \ -machine pc ......
(In reply to qing.wang from comment #21) > Hi Stefan, how to implement you mentioned #3 in comment 19? > > I tried adding numactl before qemu-kvm ,but the guest can not boot up. > > > numactl -C 2 /usr/libexec/qemu-kvm \ > -name 'avocado-vt-vm1' \ > -sandbox on \ > -machine pc > ...... Hi qinwang, numactl \ -m 1 /usr/libexec/qemu-kvm ... then Pin vcpu threads to corresponding physical cpu.
(In reply to qing.wang from comment #20) > I tried #1,2,4, it looks like num-queues=1 have better performance sometimes. > /usr/libexec/qemu-kvm \ > -name 'avocado-vt-vm1' \ > -sandbox on \ > -machine pc \ > -nodefaults \ > -device VGA,bus=pci.0,addr=0x2 \ > -m 2048 \ > -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2 \ > -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \ > -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \ > -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ > -blockdev > node-name=file_image1,driver=file,auto-read-only=on,discard=unmap, > aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio.qcow2, > cache.direct=on,cache.no-flush=off \ > -blockdev > node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_image1 \ > -device > virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on, > bus=pci.0,addr=0x4 \ > -object iothread,id=iothread0 \ > -blockdev > node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=native, > filename=/home/kvm_autotest_root/images/stg0.qcow2,cache.direct=on,cache.no- > flush=off \ > -blockdev > node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_stg0 \ > -device > virtio-blk-pci,id=stg0,drive=drive_stg0,bootindex=1,write-cache=on,num- > queues=1,bus=pci.0,addr=0x5,iothread=iothread0 \ > \ > -blockdev > node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=native, > filename=/home/kvm_autotest_root/images/stg1.qcow2,cache.direct=on,cache.no- > flush=off \ > -blockdev > node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_stg1 \ > -device > virtio-blk-pci,id=stg1,drive=drive_stg1,bootindex=2,write-cache=on,num- > queues=8,bus=pci.0,addr=0x6,iothread=iothread0 \ > -blockdev > node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads, > filename=/home/kvm_autotest_root/images/stg2.qcow2,cache.direct=on,cache.no- > flush=off \ > -blockdev > node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no- > flush=off,file=file_stg2 \ > -device > virtio-blk-pci,id=stg2,drive=drive_stg2,bootindex=3,write-cache=on,bus=pci.0, > addr=0x7 \ > -device > virtio-net-pci,mac=9a:52:fd:51:01:1d,id=idV6pVbJ,netdev=id5mFz6m,bus=pci.0, > addr=0x8 \ > -netdev tap,id=id5mFz6m,vhost=on \ > -vnc :5 \ > -qmp tcp:0:5955,server,nowait \ > -rtc base=utc,clock=host,driftfix=slew \ > -boot menu=off,order=cdn,once=c,strict=off \ > -enable-kvm -monitor stdio \ > > > [root@localhost ~]# ls /sys/block/vdb/mq/ > 0 > [root@localhost ~]# ls /sys/block/vdc/mq/ > 0 1 2 3 4 5 6 7 > [root@localhost ~]# cat /sys/block/vdb/queue/scheduler > [none] mq-deadline kyber bfq > [root@localhost ~]# cat /sys/block/vdc/queue/scheduler > [none] mq-deadline kyber bfq > > ====================================================== > [root@localhost ~]# time dd if=/dev/zero of=/dev/vdc bs=256k count=10240 > oflag=direct;time dd if=/dev/zero of=/dev/vdb bs=256k count=10240 2684354560 > bytes (2.7 GB, 2.5 GiB) copied, 13.2766 s, 202 MB/s > > real 0m13.280s > user 0m0.054s > sys 0m0.941s > 10240+0 records in > 10240+0 records out > 2684354560 bytes (2.7 GB, 2.5 GiB) copied, 13.1277 s, 204 MB/s > > real 0m13.131s > user 0m0.024s > sys 0m1.046s > > ====================================================== > > Hi,yama,could you please help to double check. I will re-test the performance according to Stefan's suggestions.
(In reply to Stefan Hajnoczi from comment #19) > (In reply to Yanhui Ma from comment #15) > > Here are comparison results between 4 queues and 1 queue with > > qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 and > > kernel-4.18.0-259.el8.dt3.x86_64: > > raw+blk > > http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4. > > 0/raw.virtio_blk.*.x86_64.html > > raw+scsi > > http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4. > > 0/raw.virtio_scsi.*.x86_64.html > > > > For raw+blk, there is improvement for 4k read and write, but performance > > regression for 4k and 16k randread, randwrite and randrw. > > For raw+scsi, no obvious performance difference. > > > > 1 queue: > > -m 4096 \ > > -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2 \ > > -blockdev > > node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap, > > aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \ > > -blockdev > > node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no- > > flush=off,file=file_disk1 \ > > -device > > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0, > > chassis=4 \ > > -device > > virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,num- > > queues=1,bus=pcie-root-port-3,addr=0x0 \------>specify num-queues=1 > > > > 4 queues: > > -m 4096 \ > > -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2 \ > > -blockdev > > node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap, > > aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \ > > -blockdev > > node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no- > > flush=off,file=file_disk1 \ > > -device > > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0, > > chassis=4 \ > > -device > > virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on, > > bus=pcie-root-port-3,addr=0x0 \-------->don't specify num-queues explicitly > > > > Hello Stefan, > > > > Could you please help check above results? > > Hi, > Please try the following configuration changes: > 1. Disable the I/O scheduler inside the guest (echo none > >/sys/block/vdb/queue/scheduler). This will make the num-queues=1 vs > num-queues=4 comparison fairer because blk-mq has a different default I/O > scheduler for num-queues=1. changing scheduler as none for num-queues=1, still some regression for virtio-blk: http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler/raw.virtio_blk.*.x86_64.html http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler/raw.virtio_scsi.*.x86_64.html > 2. Use -blockdev ...,aio=native. This avoids the thread pool on the host and > is the recommended configuration. setting none scheduler and aio=native for both 1 queue and 4 queues, now almost no performance regression for virtio-blk: http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler-native/raw.virtio_blk.*.x86_64.html http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler-native/raw.virtio_scsi.*.x86_64.html > 3. Pin the 4 vCPU threads to dedicated host CPUs on the same NUMA node. The test results always pin the 4 vCPU threads to dedicated host CPUs on the same NUMA node, so no need to re-test for this. > 4. Add -object iothread,id=iothread0 and -device > virtio-blk-pci,id=disk1,...,iothread=iothread0. Pin this iothread to a > dedicated host CPU on the same NUMA node. > setting none scheduler and aio=native and adding iothread and pin the iothread to host CPU for both 1 queue and 4 queues, there are slightly regression for both virtio-blk and virtio-scsi randrw. http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/iothread-none-scheduler-native/raw.virtio_blk.*.x86_64.html http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/iothread-none-scheduler-native/raw.virtio_scsi.*.x86_64.html Hi Stefan, I have got above results with configuration changes. Pls help check them. If we only compare none with mq-deadline for single queue, it seems no big difference between the two schedulers: http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/1queuenone-deadline/raw.virtio_blk.*.x86_64.html http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/1queuenone-deadline/raw.virtio_scsi.*.x86_64.html If we only compare aio=threads and aio=native, native is better than threads: http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/4queuenative-thread/raw.virtio_blk.*.x86_64.html http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/4queuenative-thread/raw.virtio_scsi.*.x86_64.html If we only compare with iothread and without iothread, there are improvements for some block sizes, some regression for other block sizes. http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/dataplane/4queues/raw.virtio_blk.*.x86_64.html http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/dataplane/4queues/raw.virtio_scsi.*.x86_64.html > I suggest first running with #1, then adding #2 and running again, then > adding #3 and running again, etc so we can understand how these changes > affected the results. > > Thanks!
Thank you for running all these comparisons! Unfortunately the results are mixed - some things improve, others become worse. However, I summed up all the numbers for #4 and the overall trend is a degradation in performance. I do not see a pattern explaining why there are regressions in some cases :(. I think detailed profiling of the biggest regression cases will be required. Existing customer VMs will not regress because they use old QEMU machine types that set num-queues=1. If customers encounter regressions with new machine types they can select num-queues=1 manually. You don't need to do anything. I or someone in my team will investigate further. The no iothread vs iothread comparison is also interesting because there are (unexpected) significant degradations in some cases that seem to be worth investigating (>-30% bandwidth!). Please raise a separate BZ for the iothread regressions.
I have created "Bug 1930286 - randread and randrw regression with virtio-blk multi-queue" so Stefano Garzare can investigate those regressions. Additionally, I have created "Bug 1930320 - virtio-blk with iothreads can be significantly slower than without" so I can investigate the IOThread regression.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098
*** Bug 1791331 has been marked as a duplicate of this bug. ***