Description of problem: It's too slow to load scsi disk when use 384 vcpus Version-Release number of selected component (if applicable): host: kernel-4.18.0-280.el8.ppc64le qemu-kvm-5.2.0-5.module+el8.4.0+9775+0937c167.ppc64le guest: kernel-4.18.0-280.el8.ppc64le How reproducible: 100% Steps to Reproduce: 1.Set host file limit from 1024 to 8192 2.Boot up guest with command /usr/libexec/qemu-kvm \ -smp 384 \ -m 8192 \ -nodefaults \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \ -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=rhel840-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \ -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \ -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \ -device spapr-vty,id=serial111,chardev=serial_id_serial0 \ -mon chardev=serial_id_serial0,mode=readline \ 3. Actual results: guest stop at "Trying to load: from: /pci@800000020000000/scsi@4/disk@100000000000000 ... Successfully loaded" about 3:18 minutes. Expected results: Load disk and boot up quickly Additional info: qemu-kvm-5.1.0-17.module+el8.3.1+9213+7ace09c3 has no this problem.
Is the bug reproducible using the "pseries-rhel8.3.0" machine type?
If not reproducible using "-machine pseries-rhel8.3.0", please check if reproducible using "-device virtio-scsi-pci,...,num_queues=1" (using the default machine type).
(In reply to Eduardo Habkost from comment #2) > If not reproducible using "-machine pseries-rhel8.3.0", please check if > reproducible using "-device virtio-scsi-pci,...,num_queues=1" (using the > default machine type). Both can't reproduce this issue.
Paolo, Stefan, does any of you want to take this? It looks like a performance regression caused by the new default for virtio num_queues.
Assigning to Stefan to take a look, though it would be nice if PPC people can figure out what the guest is doing when it hangs.
Greg is having a look to this BZ
Hi Greg, When I developed the num-queues sizing code upstream I fixed several bottlenecks. It's possible that there is a ppc-specific bottleneck that I didn't encounter when testing with x86. Using "perf record -a" on the host was useful. "perf report" shows a profile of the hottest functions that the sampling profiler identified. This should point the way to CPU hogs like O(n^2) algorithms. Please let me know if you want to discuss this bug more.
(In reply to Stefan Hajnoczi from comment #7) > Hi Greg, Hi Stefan, > When I developed the num-queues sizing code upstream I fixed several > bottlenecks. It's possible that there is a ppc-specific bottleneck that I > didn't encounter when testing with x86. > Likely. Unrelated to your work, a pseries machine type with a "-smp 384" topology spends nearly 40 s to create the CPU nodes in the device-tree. And, ppc-specific thing, this happens twice in the boot sequence : once during initial machine reset and once when the guest issues the client-architecture-support call (CAS) just before passing the baton to the guest kernel. These 40 s come from the fact that QEMU ends up parsing /proc/cpuinfo 384 times to extract the very same data. I already have a tentative fix for that. > Using "perf record -a" on the host was useful. "perf report" shows a profile > of the hottest functions that the sampling profiler identified. This should > point the way to CPU hogs like O(n^2) algorithms. > I had already tried "gprof", and "perf" now seems to be confirming my previous findings. Single queue run: 88.45% swapper [kernel.kallsyms] [k] power_pmu_enable 4.36% qemu-kvm [kernel.kallsyms] [k] power_pmu_enable 4.20% qemu-kvm [kernel.kallsyms] [k] smp_call_function_single 0.25% kworker/8:0-eve [kernel.kallsyms] [k] smp_call_function_single 0.17% kworker/16:5-ev [kernel.kallsyms] [k] smp_call_function_single 0.13% kworker/48:2-ev [kernel.kallsyms] [k] smp_call_function_single Multi-queue run: 67.88% swapper [kernel.kallsyms] [k] power_pmu_enable 9.47% qemu-kvm [kernel.kallsyms] [k] smp_call_function_single 8.64% qemu-kvm [kernel.kallsyms] [k] power_pmu_enable => 2.79% qemu-kvm qemu-kvm [.] memory_region_ioeventfd_before => 2.12% qemu-kvm qemu-kvm [.] address_space_update_ioeventfds 0.56% kworker/8:0-mm_ [kernel.kallsyms] [k] smp_call_function_single These are called under virtio_scsi_dataplane_start() and _stop(), once per vring. I'm observing nearly 10 s per invocation of virtio_scsi_dataplane_start(). And, other ppc-specific oddity, the SLOF firmware starts/stops the virtio-scsi device at least 3 or 4 times during early boot, so in the end, we've spent 40 s _just_ to start the disk. First thing that comes to mind is that we're adding a bunch of eventfd memory regions, i.e. (384 I/O queues + 2 control queues) * (1 for modern + 1 for legacy) == 772, doing memory_region_transaction_{begin,commit}() each time. This ends up calling address_space_update_ioeventfds() in which we have this nested loop: FOR_EACH_FLAT_RANGE(fr, view) { for (i = 0; i < fr->mr->ioeventfd_nb; ++i) { ^^^^^ 346 Given this is called per-queue, this looks quadratic to me. Maybe it didn't bite on x86 because of less vCPUs ? > Please let me know if you want to discuss this bug more. What about adding all eventfd regions in a single transaction ?
(In reply to Greg Kurz from comment #8) > > What about adding all eventfd regions in a single transaction ? I did some experiments in that direction with virtio-scsi and it sounds promising. I'm now trying with virtio-blk, which has the same issue, hoping to come up with a generic solution.
(In reply to Greg Kurz from comment #9) > (In reply to Greg Kurz from comment #8) > > > > What about adding all eventfd regions in a single transaction ? > > I did some experiments in that direction with virtio-scsi and it > sounds promising. I'm now trying with virtio-blk, which has the same > issue, hoping to come up with a generic solution. Excellent! I remember batching g_realloc() in address_space_update_ioeventfds() to improve performance but didn't change the for loop you mentioned: commit 920d557e5ae58671d335acbcfba3f9a97a02911c Author: Stefan Hajnoczi <stefanha> Date: Tue Feb 18 18:22:26 2020 +0000 memory: batch allocate ioeventfds[] in address_space_update_ioeventfds()
Greg, can you estimate when this might be ready and set a DTM accordingly? Or do we have to defer this until after 8.5?
Set Verified:Tested,SanityOnly as gating/tier1 test pass.
Boot up guest with 384 vcpus smoothly when set host file limit from 1024 to 8192,the bug has been fixed in this build.
Setting to VERIFIED according to comment 17.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4684