Bug 1827722 - virtio-blk and virtio-scsi multi-queue should be enabled by default
Summary: virtio-blk and virtio-scsi multi-queue should be enabled by default
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.2
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: rc
: 8.3
Assignee: Stefan Hajnoczi
QA Contact: qing.wang
URL:
Whiteboard:
: 1791331 (view as bug list)
Depends On: 1930320
Blocks: 1930286
TreeView+ depends on / blocked
 
Reported: 2020-04-24 15:57 UTC by Stefan Hajnoczi
Modified: 2021-07-26 09:51 UTC (History)
11 users (show)

Fixed In Version: qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-25 06:42:08 UTC
Type: Feature Request
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Stefan Hajnoczi 2020-04-24 15:57:33 UTC
Performance benchmarking shows that enabling multi-queue virtio-blk and virtio-scsi increases performance.  There are several reasons:

1. Request completion IRQs are received on the vCPU that submitted I/O.  This avoids waking up the submitter vCPU with an interprocessor interrupt on each completion.

2. The mq-deadline I/O scheduler is the default for single-queue blk-mq devices, whereas the default I/O scheduler for multi-queue blk-mq devices is "none".  The latency overhead of mq-deadline affects workloads that are latency-sensitive.

3. Request completion is performed in softirq context for single-queue blk-mq devices.  HIPRI polling I/O is completed in the current task for multi-queue blk-mq devices.

I have measured a 25% performance improvement for 4 KB random read, iodepth=1, hipri=1 on NVMe just by enabling multi-queue.

QEMU should enable multi-queue by default in new machine types.

Comment 1 qing.wang 2020-04-26 06:01:04 UTC
Confirm the usage: 

1.
for virtio-scsi
-device virtio-scsi-pci,id=virtio_scsi_pci0,num_queues=4,bus=pcie.0-root-port-4,addr=0x0 \

for virtio-blk
-device virtio-blk-pci num-queues=

2. the number should equal as  number of VCPUs

Am i right?

Comment 3 Stefan Hajnoczi 2020-04-28 14:56:51 UTC
Hi qing.wang,
Nothing changes when the number of queues is explicitly set with -device virtio-scsi-pci,num_queues= or -device virtio-blk-pci,num-queues=.  This only affects the default behavior when a guest is started without these parameters.

I am proposing setting the default number of queues to the number of vCPUs, but this is under discussion upstream and the final behavior may be different.

Comment 5 qing.wang 2020-04-30 10:06:20 UTC
I find it is not works for windows guests,please help me check the test steps:
1.Create image
qemu-img create -f qcow2 /home/images/data1.qcow2 11G
qemu-img create -f qcow2 /home/images/data2.qcow2 12G
qemu-img create -f qcow2 /home/images/data3.qcow2 13G
qemu-img create -f qcow2 /home/images/data4.qcow2 14G


2.boot vm
/usr/libexec/qemu-kvm \
  -name copy_read_vm1 \
  -machine q35 \
  -nodefaults \
  -vga qxl \
  -m 2048 \
  -smp 8 \
  -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x2,chassis=1 \
  -device pcie-root-port,id=pcie.0-root-port-1,port=0x1,addr=0x2.0x1,bus=pcie.0,chassis=2 \
  -device pcie-root-port,id=pcie.0-root-port-2,port=0x2,addr=0x2.0x2,bus=pcie.0,chassis=3 \
  -device pcie-root-port,id=pcie.0-root-port-3,port=0x3,addr=0x2.0x3,bus=pcie.0,chassis=4 \
  -device pcie-root-port,id=pcie.0-root-port-4,port=0x4,addr=0x2.0x4,bus=pcie.0,chassis=5 \
  -device pcie-root-port,id=pcie.0-root-port-5,port=0x5,addr=0x2.0x5,bus=pcie.0,chassis=6 \
  -device pcie-root-port,id=pcie.0-root-port-6,port=0x6,addr=0x2.0x6,bus=pcie.0,chassis=7 \
  -device pcie-root-port,id=pcie.0-root-port-7,port=0x7,addr=0x2.0x7,bus=pcie.0,chassis=8 \
  -device qemu-xhci,id=usb1,bus=pcie.0-root-port-1,addr=0x0 \
  -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
  -object iothread,id=iothread0 \
  -device virtio-scsi-pci,id=scsi0,bus=pcie.0-root-port-2,iothread=iothread0  \
  -device virtio-scsi-pci,id=scsi1,bus=pcie.0-root-port-3,num_queues=8,iothread=iothread0  \
  -blockdev driver=qcow2,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/win2019-64-virtio-scsi.qcow2,node-name=drive_image1 \
  -device scsi-hd,id=os1,drive=drive_image1,bootindex=0,bus=scsi0.0 \
  \
  -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data1.qcow2,node-name=node1 \
  -device virtio-blk-pci,id=blk_data1,drive=node1,bus=pcie.0-root-port-4,addr=0x0,bootindex=1,iothread=iothread0  \
  -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data2.qcow2,node-name=node2 \
  -device virtio-blk-pci,id=blk_data2,drive=node2,bus=pcie.0-root-port-5,addr=0x0,bootindex=2,iothread=iothread0,num-queues=8 \
  -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data3.qcow2,node-name=node3 \
  -device scsi-hd,id=blk_data3,drive=node3,bus=scsi0.0,bootindex=3 \
  -blockdev driver=qcow2,file.aio=threads,file.driver=file,cache.direct=off,cache.no-flush=on,file.filename=/home/images/data4.qcow2,node-name=node4 \
  -device scsi-hd,id=blk_data4,drive=node4,bus=scsi1.0,bootindex=4 \
  \
  -vnc :5 \
  -monitor stdio \
  -device pcie-root-port,id=pcie.0-root-port-8,slot=8,chassis=8,addr=0x8,bus=pcie.0 \
  -device virtio-net-pci,mac=9a:b5:b6:b1:b2:b5,id=idMmq1jH,vectors=4,netdev=idxgXAlm,bus=pcie.0-root-port-8,addr=0x0 \
  -netdev tap,id=idxgXAlm \
  -chardev file,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmp5.log,server,nowait \
  -mon chardev=qmp_id_qmpmonitor1,mode=control \
  -qmp tcp:0:5955,server,nowait \
  -chardev file,path=/var/tmp/monitor-serial5.log,id=serial_id_serial0 \
  -device isa-serial,chardev=serial_id_serial0 \
  -drive id=drive_cd1,if=none,snapshot=off,aio=threads,cache=none,media=cdrom,file=/home/kvm_autotest_root/iso/windows/winutils.iso \
   -device ide-cd,id=cd1,drive=drive_cd1,bus=ide.0,unit=0 \

3.format data disk as ntfs quick format.

d:\coreutils\DummyCMD.exe e:\1.dat 10240000000 1 
d:\coreutils\DummyCMD.exe f:\1.dat 10240000000 1
d:\coreutils\DummyCMD.exe g:\1.dat 10240000000 1
d:\coreutils\DummyCMD.exe h:\1.dat 10240000000 1

Compare the output, find the time on num_queues enabled disk is more.
(DummyCMD.exe come from https://www.mynikko.com/dummy/)

Comment 10 Ademar Reis 2020-11-20 22:37:02 UTC
Copy&paste from a recent document created by Stefan, including testing instructions:

----


virtio-blk/scsi multi-queue by default

The -device virtio-blk,num-queues= and -device virtio-scsi,num_queues= parameters control how many virtqueues are available to the guest. Allocating one virtqueue per vCPU improves performance as follows:
Interrupts are handled on the vCPU that submitted the request, avoiding IPIs
The I/O scheduler is automatically set to “none” by the Linux block layer

The number of queues can still be set explicitly in libvirt domain XML or on the QEMU command-line, but the latest QEMU machine type now defaults to num-queues=num-vcpus instead of num-queues=1.

Note: That all virtqueues are still handled by a single thread in QEMU. This is not the same as QEMU block layer multi-queue support.
Documentation
This is a user-visible feature since it affects performance. Existing VMs are not affected because older machine types remain unchanged.

Performance is expected to be better, especially for latency-sensitive workloads. If a customer is concerned about regressions they can compare with explicitly setting the number of queues to 1.
Testing
Launch an SMP guest without explicitly setting the number of queues for the virtio-blk device. The guest should now show that the Linux multiqueue block layer is active:

# ls /sys/block/vda/mq/
0 1 2 3
# cat /sys/block/vda/queue/scheduler
[none] mq-deadline kyber bfq
Compare this with a guest that explicitly has the number of queues set to 1:

# ls /sys/block/vda/mq/
0
# cat /sys/block/vda/queue/scheduler
[mq-deadline] kuber bfq none

Comment 11 qing.wang 2020-12-15 09:17:36 UTC
Tested on 

Red Hat Enterprise Linux release 8.4 Beta (Ootpa)
4.18.0-252.el8.dt4.x86_64
qemu-kvm-common-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64

For virtio-scsi:
/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 2048  \
    -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,nowait,id=qmp_id_qmpmonitor1,path=/tmp/avocado_y9jitn48/monitor-qmpmonitor1-20201215-031130-UGyrst1b,server  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,nowait,id=qmp_id_catch_monitor,path=/tmp/avocado_y9jitn48/monitor-catch_monitor-20201215-031130-UGyrst1b,server  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idEvx5RT \
    -chardev socket,nowait,id=chardev_serial0,path=/tmp/avocado_y9jitn48/serial-serial0-20201215-031130-UGyrst1b,server \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20201215-031130-UGyrst1b,path=/tmp/avocado_y9jitn48/seabios-20201215-031130-UGyrst1b,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20201215-031130-UGyrst1b,iobase=0x402 \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device virtio-scsi-pci,id=virtio_scsi_pci1,num_queues=1,bus=pci.0,addr=0x5 \
    -blockdev node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg0.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg0 \
    -device scsi-hd,id=stg0,bus=virtio_scsi_pci1.0,drive=drive_stg0,write-cache=on \
    -device virtio-scsi-pci,id=virtio_scsi_pci2,num_queues=8,bus=pci.0,addr=0x6 \
    -blockdev node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg1.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg1 \
    -device scsi-hd,id=stg1,bus=virtio_scsi_pci2.0,drive=drive_stg1,write-cache=on \
    -blockdev node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg2.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg2 \
    -device scsi-hd,id=stg2,drive=drive_stg2,write-cache=on \
    -device virtio-net-pci,mac=9a:52:fd:51:01:1d,id=id5I3Uqq,netdev=idYmyF0E,bus=pci.0,addr=0x7  \
    -netdev tap,id=idYmyF0E,vhost=on,vhostfd=20,fd=16  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm

For virtio-blk:

/usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 2048  \
    -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -chardev socket,nowait,id=qmp_id_qmpmonitor1,path=/tmp/avocado_y9jitn48/monitor-qmpmonitor1-20201215-030532-IscLJLhb,server  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,nowait,id=qmp_id_catch_monitor,path=/tmp/avocado_y9jitn48/monitor-catch_monitor-20201215-030532-IscLJLhb,server  \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idR75jHN \
    -chardev socket,nowait,id=chardev_serial0,path=/tmp/avocado_y9jitn48/serial-serial0-20201215-030532-IscLJLhb,server \
    -device isa-serial,id=serial0,chardev=chardev_serial0  \
    -chardev socket,id=seabioslog_id_20201215-030532-IscLJLhb,path=/tmp/avocado_y9jitn48/seabios-20201215-030532-IscLJLhb,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20201215-030532-IscLJLhb,iobase=0x402 \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg0.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg0 \
    -device virtio-blk-pci,id=stg0,drive=drive_stg0,bootindex=1,write-cache=on,num-queues=1,bus=pci.0,addr=0x5 \
    -blockdev node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg1.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg1 \
    -device virtio-blk-pci,id=stg1,drive=drive_stg1,bootindex=2,write-cache=on,num-queues=8,bus=pci.0,addr=0x6 \
    -blockdev node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/workdir/kar/workspace/var/lib/avocado/data/avocado-vt/stg2.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg2 \
    -device virtio-blk-pci,id=stg2,drive=drive_stg2,bootindex=3,write-cache=on,bus=pci.0,addr=0x7 \
    -device virtio-net-pci,mac=9a:52:fd:51:01:1d,id=idV6pVbJ,netdev=id5mFz6m,bus=pci.0,addr=0x8  \
    -netdev tap,id=id5mFz6m,vhost=on,vhostfd=20,fd=16  \
    -vnc :1  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm

The third data disk stg2.qcow2 indeed enable multi-queue same as smp number:
[root@localhost ~]# ls /sys/block/vdd/mq/
0  1  2  3  4  5  6  7


But for the performance it has not  improvement, sometimes it is worse, does it only take affect on specific operation or device?
my test command :
num-queues=1 disk:
time dd if=/dev/zero of=/dev/sdb bs=256k count=10240 oflag=direct
num-queues=8 disk:
time dd if=/dev/zero of=/dev/sdc bs=256k count=10240 oflag=direct

And for the each hotpluged disk it will open many fds related to the smp number.
https://bugzilla.redhat.com/show_bug.cgi?id=1902548#c17

Comment 14 qing.wang 2020-12-16 02:50:58 UTC
Hi,yama, could you please help to provide performance test with this feature.

Comment 15 Yanhui Ma 2020-12-16 05:01:21 UTC
Here are comparison results between 4 queues and 1 queue with qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 and kernel-4.18.0-259.el8.dt3.x86_64:
raw+blk
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/raw.virtio_blk.*.x86_64.html
raw+scsi
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/raw.virtio_scsi.*.x86_64.html

For raw+blk, there is improvement for 4k read and write, but performance regression for 4k and 16k randread, randwrite and randrw.
For raw+scsi, no obvious performance difference.

1 queue: 
-m 4096  \
-smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
-blockdev node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
-device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,num-queues=1,bus=pcie-root-port-3,addr=0x0 \------>specify num-queues=1

4 queues:
-m 4096  \
-smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
-blockdev node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_disk1 \
-device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
-device virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,bus=pcie-root-port-3,addr=0x0 \-------->don't specify num-queues explicitly

Hello Stefan,

Could you please help check above results?

Comment 16 qing.wang 2020-12-24 09:37:27 UTC
Since this feature has been implemented (multi-queue enabled default), could you please open a dedicated bug to track performance issue?

Comment 17 qing.wang 2020-12-25 03:31:18 UTC
The feature has been verified due to https://bugzilla.redhat.com/show_bug.cgi?id=1827722#c11

Comment 18 Yanhui Ma 2020-12-28 03:02:29 UTC
(In reply to qing.wang from comment #16)
> Since this feature has been implemented (multi-queue enabled default), could
> you please open a dedicated bug to track performance issue?

Yes, before reporting a new bug, Stefan, could you please help check results of comment 15?

Comment 19 Stefan Hajnoczi 2021-01-05 14:50:03 UTC
(In reply to Yanhui Ma from comment #15)
> Here are comparison results between 4 queues and 1 queue with
> qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 and
> kernel-4.18.0-259.el8.dt3.x86_64:
> raw+blk
> http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.
> 0/raw.virtio_blk.*.x86_64.html
> raw+scsi
> http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.
> 0/raw.virtio_scsi.*.x86_64.html
> 
> For raw+blk, there is improvement for 4k read and write, but performance
> regression for 4k and 16k randread, randwrite and randrw.
> For raw+scsi, no obvious performance difference.
> 
> 1 queue: 
> -m 4096  \
> -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
> -blockdev
> node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,
> aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
> -blockdev
> node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-
> flush=off,file=file_disk1 \
> -device
> pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,
> chassis=4 \
> -device
> virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,num-
> queues=1,bus=pcie-root-port-3,addr=0x0 \------>specify num-queues=1
> 
> 4 queues:
> -m 4096  \
> -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
> -blockdev
> node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,
> aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
> -blockdev
> node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-
> flush=off,file=file_disk1 \
> -device
> pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,
> chassis=4 \
> -device
> virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,
> bus=pcie-root-port-3,addr=0x0 \-------->don't specify num-queues explicitly
> 
> Hello Stefan,
> 
> Could you please help check above results?

Hi,
Please try the following configuration changes:
1. Disable the I/O scheduler inside the guest (echo none >/sys/block/vdb/queue/scheduler). This will make the num-queues=1 vs num-queues=4 comparison fairer because blk-mq has a different default I/O scheduler for num-queues=1.
2. Use -blockdev ...,aio=native. This avoids the thread pool on the host and is the recommended configuration.
3. Pin the 4 vCPU threads to dedicated host CPUs on the same NUMA node.
4. Add -object iothread,id=iothread0 and -device virtio-blk-pci,id=disk1,...,iothread=iothread0. Pin this iothread to a dedicated host CPU on the same NUMA node.

I suggest first running with #1, then adding #2 and running again, then adding #3 and running again, etc so we can understand how these changes affected the results.

Thanks!

Comment 20 qing.wang 2021-01-06 09:18:25 UTC
I tried #1,2,4, it looks like num-queues=1 have better performance sometimes.
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 2048  \
    -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2  \
    -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,bus=pci.0,addr=0x4 \
    -object iothread,id=iothread0 \
    -blockdev node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/stg0.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg0 \
    -device virtio-blk-pci,id=stg0,drive=drive_stg0,bootindex=1,write-cache=on,num-queues=1,bus=pci.0,addr=0x5,iothread=iothread0 \
    \
    -blockdev node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=native,filename=/home/kvm_autotest_root/images/stg1.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg1 \
    -device virtio-blk-pci,id=stg1,drive=drive_stg1,bootindex=2,write-cache=on,num-queues=8,bus=pci.0,addr=0x6,iothread=iothread0 \
    -blockdev node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/kvm_autotest_root/images/stg2.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_stg2 \
    -device virtio-blk-pci,id=stg2,drive=drive_stg2,bootindex=3,write-cache=on,bus=pci.0,addr=0x7 \
    -device virtio-net-pci,mac=9a:52:fd:51:01:1d,id=idV6pVbJ,netdev=id5mFz6m,bus=pci.0,addr=0x8  \
    -netdev tap,id=id5mFz6m,vhost=on  \
    -vnc :5  \
    -qmp tcp:0:5955,server,nowait \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm -monitor stdio \


[root@localhost ~]# ls /sys/block/vdb/mq/
0
[root@localhost ~]# ls /sys/block/vdc/mq/
0  1  2  3  4  5  6  7
[root@localhost ~]# cat /sys/block/vdb/queue/scheduler
[none] mq-deadline kyber bfq 
[root@localhost ~]# cat /sys/block/vdc/queue/scheduler
[none] mq-deadline kyber bfq 

======================================================
[root@localhost ~]# time dd if=/dev/zero of=/dev/vdc bs=256k count=10240 oflag=direct;time dd if=/dev/zero of=/dev/vdb bs=256k count=10240 2684354560 bytes (2.7 GB, 2.5 GiB) copied, 13.2766 s, 202 MB/s

real	0m13.280s
user	0m0.054s
sys	0m0.941s
10240+0 records in
10240+0 records out
2684354560 bytes (2.7 GB, 2.5 GiB) copied, 13.1277 s, 204 MB/s

real	0m13.131s
user	0m0.024s
sys	0m1.046s

======================================================

Hi,yama,could you please help to double check.

Comment 21 qing.wang 2021-01-06 09:21:30 UTC
Hi Stefan, how to implement you mentioned #3 in comment 19?

I tried adding numactl before qemu-kvm ,but the guest can not boot up.


 numactl -C 2 /usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1'  \
-sandbox on  \
-machine pc 
......

Comment 22 Yanhui Ma 2021-01-06 09:37:54 UTC
(In reply to qing.wang from comment #21)
> Hi Stefan, how to implement you mentioned #3 in comment 19?
> 
> I tried adding numactl before qemu-kvm ,but the guest can not boot up.
> 
> 
>  numactl -C 2 /usr/libexec/qemu-kvm \
> -name 'avocado-vt-vm1'  \
> -sandbox on  \
> -machine pc 
> ......

Hi qinwang,

numactl \
    -m 1  /usr/libexec/qemu-kvm 
... 

then Pin vcpu threads to corresponding physical cpu.

Comment 23 Yanhui Ma 2021-01-06 09:39:36 UTC
(In reply to qing.wang from comment #20)
> I tried #1,2,4, it looks like num-queues=1 have better performance sometimes.
> /usr/libexec/qemu-kvm \
>     -name 'avocado-vt-vm1'  \
>     -sandbox on  \
>     -machine pc  \
>     -nodefaults \
>     -device VGA,bus=pci.0,addr=0x2 \
>     -m 2048  \
>     -smp 8,maxcpus=8,cores=4,threads=1,dies=1,sockets=2  \
>     -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
>     -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
>     -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
>     -blockdev
> node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,
> aio=threads,filename=/home/kvm_autotest_root/images/rhel840-64-virtio.qcow2,
> cache.direct=on,cache.no-flush=off \
>     -blockdev
> node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-
> flush=off,file=file_image1 \
>     -device
> virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,write-cache=on,
> bus=pci.0,addr=0x4 \
>     -object iothread,id=iothread0 \
>     -blockdev
> node-name=file_stg0,driver=file,auto-read-only=on,discard=unmap,aio=native,
> filename=/home/kvm_autotest_root/images/stg0.qcow2,cache.direct=on,cache.no-
> flush=off \
>     -blockdev
> node-name=drive_stg0,driver=qcow2,read-only=off,cache.direct=on,cache.no-
> flush=off,file=file_stg0 \
>     -device
> virtio-blk-pci,id=stg0,drive=drive_stg0,bootindex=1,write-cache=on,num-
> queues=1,bus=pci.0,addr=0x5,iothread=iothread0 \
>     \
>     -blockdev
> node-name=file_stg1,driver=file,auto-read-only=on,discard=unmap,aio=native,
> filename=/home/kvm_autotest_root/images/stg1.qcow2,cache.direct=on,cache.no-
> flush=off \
>     -blockdev
> node-name=drive_stg1,driver=qcow2,read-only=off,cache.direct=on,cache.no-
> flush=off,file=file_stg1 \
>     -device
> virtio-blk-pci,id=stg1,drive=drive_stg1,bootindex=2,write-cache=on,num-
> queues=8,bus=pci.0,addr=0x6,iothread=iothread0 \
>     -blockdev
> node-name=file_stg2,driver=file,auto-read-only=on,discard=unmap,aio=threads,
> filename=/home/kvm_autotest_root/images/stg2.qcow2,cache.direct=on,cache.no-
> flush=off \
>     -blockdev
> node-name=drive_stg2,driver=qcow2,read-only=off,cache.direct=on,cache.no-
> flush=off,file=file_stg2 \
>     -device
> virtio-blk-pci,id=stg2,drive=drive_stg2,bootindex=3,write-cache=on,bus=pci.0,
> addr=0x7 \
>     -device
> virtio-net-pci,mac=9a:52:fd:51:01:1d,id=idV6pVbJ,netdev=id5mFz6m,bus=pci.0,
> addr=0x8  \
>     -netdev tap,id=id5mFz6m,vhost=on  \
>     -vnc :5  \
>     -qmp tcp:0:5955,server,nowait \
>     -rtc base=utc,clock=host,driftfix=slew  \
>     -boot menu=off,order=cdn,once=c,strict=off \
>     -enable-kvm -monitor stdio \
> 
> 
> [root@localhost ~]# ls /sys/block/vdb/mq/
> 0
> [root@localhost ~]# ls /sys/block/vdc/mq/
> 0  1  2  3  4  5  6  7
> [root@localhost ~]# cat /sys/block/vdb/queue/scheduler
> [none] mq-deadline kyber bfq 
> [root@localhost ~]# cat /sys/block/vdc/queue/scheduler
> [none] mq-deadline kyber bfq 
> 
> ======================================================
> [root@localhost ~]# time dd if=/dev/zero of=/dev/vdc bs=256k count=10240
> oflag=direct;time dd if=/dev/zero of=/dev/vdb bs=256k count=10240 2684354560
> bytes (2.7 GB, 2.5 GiB) copied, 13.2766 s, 202 MB/s
> 
> real	0m13.280s
> user	0m0.054s
> sys	0m0.941s
> 10240+0 records in
> 10240+0 records out
> 2684354560 bytes (2.7 GB, 2.5 GiB) copied, 13.1277 s, 204 MB/s
> 
> real	0m13.131s
> user	0m0.024s
> sys	0m1.046s
> 
> ======================================================
> 
> Hi,yama,could you please help to double check.

I will re-test the performance according to Stefan's suggestions.

Comment 25 Yanhui Ma 2021-01-21 04:18:57 UTC
(In reply to Stefan Hajnoczi from comment #19)
> (In reply to Yanhui Ma from comment #15)
> > Here are comparison results between 4 queues and 1 queue with
> > qemu-kvm-5.2.0-1.module+el8.4.0+9091+650b220a.x86_64 and
> > kernel-4.18.0-259.el8.dt3.x86_64:
> > raw+blk
> > http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.
> > 0/raw.virtio_blk.*.x86_64.html
> > raw+scsi
> > http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.
> > 0/raw.virtio_scsi.*.x86_64.html
> > 
> > For raw+blk, there is improvement for 4k read and write, but performance
> > regression for 4k and 16k randread, randwrite and randrw.
> > For raw+scsi, no obvious performance difference.
> > 
> > 1 queue: 
> > -m 4096  \
> > -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
> > -blockdev
> > node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,
> > aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
> > -blockdev
> > node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-
> > flush=off,file=file_disk1 \
> > -device
> > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,
> > chassis=4 \
> > -device
> > virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,num-
> > queues=1,bus=pcie-root-port-3,addr=0x0 \------>specify num-queues=1
> > 
> > 4 queues:
> > -m 4096  \
> > -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
> > -blockdev
> > node-name=file_disk1,driver=host_device,auto-read-only=on,discard=unmap,
> > aio=threads,filename=/dev/nvme0n1,cache.direct=on,cache.no-flush=off \
> > -blockdev
> > node-name=drive_disk1,driver=raw,read-only=off,cache.direct=on,cache.no-
> > flush=off,file=file_disk1 \
> > -device
> > pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,
> > chassis=4 \
> > -device
> > virtio-blk-pci,id=disk1,drive=drive_disk1,bootindex=1,write-cache=on,
> > bus=pcie-root-port-3,addr=0x0 \-------->don't specify num-queues explicitly
> > 
> > Hello Stefan,
> > 
> > Could you please help check above results?
> 
> Hi,
> Please try the following configuration changes:
> 1. Disable the I/O scheduler inside the guest (echo none
> >/sys/block/vdb/queue/scheduler). This will make the num-queues=1 vs
> num-queues=4 comparison fairer because blk-mq has a different default I/O
> scheduler for num-queues=1.

changing scheduler as none for num-queues=1, still some regression for virtio-blk:
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler/raw.virtio_blk.*.x86_64.html
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler/raw.virtio_scsi.*.x86_64.html

> 2. Use -blockdev ...,aio=native. This avoids the thread pool on the host and
> is the recommended configuration.

setting none scheduler and aio=native for both 1 queue and 4 queues, now almost no performance regression for virtio-blk:
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler-native/raw.virtio_blk.*.x86_64.html
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/none-scheduler-native/raw.virtio_scsi.*.x86_64.html

> 3. Pin the 4 vCPU threads to dedicated host CPUs on the same NUMA node.

The test results always pin the 4 vCPU threads to dedicated host CPUs on the same NUMA node, so no need to re-test for this.

> 4. Add -object iothread,id=iothread0 and -device
> virtio-blk-pci,id=disk1,...,iothread=iothread0. Pin this iothread to a
> dedicated host CPU on the same NUMA node.
> 

setting none scheduler and aio=native and adding iothread and pin the iothread to host CPU for both 1 queue and 4 queues, there are slightly regression for both virtio-blk and virtio-scsi randrw.
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/iothread-none-scheduler-native/raw.virtio_blk.*.x86_64.html
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/iothread-none-scheduler-native/raw.virtio_scsi.*.x86_64.html

Hi Stefan,

I have got above results with configuration changes. Pls help check them.

If we only compare none with mq-deadline for single queue, it seems no big difference between the two schedulers:
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/1queuenone-deadline/raw.virtio_blk.*.x86_64.html
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/1queuenone-deadline/raw.virtio_scsi.*.x86_64.html

If we only compare aio=threads and aio=native, native is better than threads:
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/4queuenative-thread/raw.virtio_blk.*.x86_64.html
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/4queuenative-thread/raw.virtio_scsi.*.x86_64.html

If we only compare with iothread and without iothread, there are improvements for some block sizes, some regression for other block sizes.
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/dataplane/4queues/raw.virtio_blk.*.x86_64.html
http://kvm-perf.englab.nay.redhat.com/results/regression/multiqueue_rhel8.4.0/dataplane/4queues/raw.virtio_scsi.*.x86_64.html

> I suggest first running with #1, then adding #2 and running again, then
> adding #3 and running again, etc so we can understand how these changes
> affected the results.
> 
> Thanks!

Comment 26 Stefan Hajnoczi 2021-02-08 14:57:30 UTC
Thank you for running all these comparisons!

Unfortunately the results are mixed - some things improve, others become worse. However, I summed up all the numbers for #4 and the overall trend is a degradation in performance.

I do not see a pattern explaining why there are regressions in some cases :(. I think detailed profiling of the biggest regression cases will be required.

Existing customer VMs will not regress because they use old QEMU machine types that set num-queues=1. If customers encounter regressions with new machine types they can select num-queues=1 manually.

You don't need to do anything. I or someone in my team will investigate further.

The no iothread vs iothread comparison is also interesting because there are (unexpected) significant degradations in some cases that seem to be worth investigating (>-30% bandwidth!). Please raise a separate BZ for the iothread regressions.

Comment 27 Stefan Hajnoczi 2021-02-18 16:51:23 UTC
I have created "Bug 1930286 - randread and randrw regression with virtio-blk multi-queue" so Stefano Garzare can investigate those regressions.

Additionally, I have created "Bug 1930320 - virtio-blk with iothreads can be significantly slower than without" so I can investigate the IOThread regression.

Comment 29 errata-xmlrpc 2021-05-25 06:42:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098

Comment 30 qing.wang 2021-07-26 09:51:49 UTC
*** Bug 1791331 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.