Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1927108

Summary:	It's too slow to load scsi disk when use 384 vcpus
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	Xujun Ma <xuma>
Component:	qemu-kvm	Assignee:	Greg Kurz <gkurz>
qemu-kvm sub component:	General	QA Contact:	Xujun Ma <xuma>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	dgibson, ehabkost, gkurz, jinzhao, juzhang, lvivier, pbonzini, qzhang, stefanha, virt-maint
Version:	8.4	Keywords:	Regression, Triaged
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	8.5
Hardware:	ppc64le
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-6.0.0-18.module+el8.5.0+11243+5269aaa1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-16 07:51:42 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1957194

Description Xujun Ma 2021-02-10 04:52:58 UTC

Description of problem:
It's too slow to load scsi  disk when use 384 vcpus

Version-Release number of selected component (if applicable):
host:
kernel-4.18.0-280.el8.ppc64le
qemu-kvm-5.2.0-5.module+el8.4.0+9775+0937c167.ppc64le
guest:
kernel-4.18.0-280.el8.ppc64le

How reproducible:
100%

Steps to Reproduce:
1.Set host file limit from 1024 to 8192 
2.Boot up guest with command
/usr/libexec/qemu-kvm \
 -smp 384  \
 -m 8192 \
 -nodefaults \
 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
 -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=rhel840-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
 -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
 -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
 -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \
 -device spapr-vty,id=serial111,chardev=serial_id_serial0 \
 -mon chardev=serial_id_serial0,mode=readline \
3.

Actual results:
guest stop at "Trying to load:  from: /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded" about 3:18 minutes.
Expected results:
Load disk and boot up quickly
Additional info:
qemu-kvm-5.1.0-17.module+el8.3.1+9213+7ace09c3 has no this problem.

Comment 1 Eduardo Habkost 2021-02-10 20:42:09 UTC

Is the bug reproducible using the "pseries-rhel8.3.0" machine type?

Comment 2 Eduardo Habkost 2021-02-10 21:11:11 UTC

If not reproducible using "-machine pseries-rhel8.3.0", please check if reproducible using "-device virtio-scsi-pci,...,num_queues=1" (using the default machine type).

Comment 3 Xujun Ma 2021-02-18 07:34:35 UTC

(In reply to Eduardo Habkost from comment #2)
> If not reproducible using "-machine pseries-rhel8.3.0", please check if
> reproducible using "-device virtio-scsi-pci,...,num_queues=1" (using the
> default machine type).

Both can't reproduce this issue.

Comment 4 Eduardo Habkost 2021-03-05 20:42:56 UTC

Paolo, Stefan, does any of you want to take this?  It looks like a performance regression caused by the new default for virtio num_queues.

Comment 5 Paolo Bonzini 2021-03-12 13:44:15 UTC

Assigning to Stefan to take a look, though it would be nice if PPC people can figure out what the guest is doing when it hangs.

Comment 6 Laurent Vivier 2021-03-15 07:29:57 UTC

Greg is having a look to this BZ

Comment 7 Stefan Hajnoczi 2021-03-16 11:45:59 UTC

Hi Greg,
When I developed the num-queues sizing code upstream I fixed several bottlenecks. It's possible that there is a ppc-specific bottleneck that I didn't encounter when testing with x86.

Using "perf record -a" on the host was useful. "perf report" shows a profile of the hottest functions that the sampling profiler identified. This should point the way to CPU hogs like O(n^2) algorithms.

Please let me know if you want to discuss this bug more.

Comment 8 Greg Kurz 2021-03-16 16:05:57 UTC

(In reply to Stefan Hajnoczi from comment #7)
> Hi Greg,

Hi Stefan,

> When I developed the num-queues sizing code upstream I fixed several
> bottlenecks. It's possible that there is a ppc-specific bottleneck that I
> didn't encounter when testing with x86.
> 

Likely. Unrelated to your work, a pseries machine type with a "-smp 384"
topology spends nearly 40 s to create the CPU nodes in the device-tree.
And, ppc-specific thing, this happens twice in the boot sequence : once
during initial machine reset and once when the guest issues the
client-architecture-support call (CAS) just before passing the baton to
the guest kernel. These 40 s come from the fact that QEMU ends up parsing
/proc/cpuinfo 384 times to extract the very same data. I already have a
tentative fix for that.

> Using "perf record -a" on the host was useful. "perf report" shows a profile
> of the hottest functions that the sampling profiler identified. This should
> point the way to CPU hogs like O(n^2) algorithms.
> 

I had already tried "gprof", and "perf" now seems to be confirming my previous
findings.

Single queue run:

    88.45%  swapper          [kernel.kallsyms]           [k] power_pmu_enable
     4.36%  qemu-kvm         [kernel.kallsyms]           [k] power_pmu_enable
     4.20%  qemu-kvm         [kernel.kallsyms]           [k] smp_call_function_single
     0.25%  kworker/8:0-eve  [kernel.kallsyms]           [k] smp_call_function_single
     0.17%  kworker/16:5-ev  [kernel.kallsyms]           [k] smp_call_function_single
     0.13%  kworker/48:2-ev  [kernel.kallsyms]           [k] smp_call_function_single

Multi-queue run:

    67.88%  swapper          [kernel.kallsyms]           [k] power_pmu_enable
     9.47%  qemu-kvm         [kernel.kallsyms]           [k] smp_call_function_single
     8.64%  qemu-kvm         [kernel.kallsyms]           [k] power_pmu_enable
=>   2.79%  qemu-kvm         qemu-kvm                    [.] memory_region_ioeventfd_before
=>   2.12%  qemu-kvm         qemu-kvm                    [.] address_space_update_ioeventfds
     0.56%  kworker/8:0-mm_  [kernel.kallsyms]           [k] smp_call_function_single

These are called under virtio_scsi_dataplane_start() and _stop(), once per vring. I'm
observing nearly 10 s per invocation of virtio_scsi_dataplane_start(). And, other
ppc-specific oddity, the SLOF firmware starts/stops the virtio-scsi device at least
3 or 4 times during early boot, so in the end, we've spent 40 s _just_ to start the
disk.

First thing that comes to mind is that we're adding a bunch of eventfd memory regions,
i.e. (384 I/O queues + 2 control queues) * (1 for modern + 1 for legacy) == 772, doing
memory_region_transaction_{begin,commit}() each time. This ends up calling
address_space_update_ioeventfds() in which we have this nested loop:

    FOR_EACH_FLAT_RANGE(fr, view) {
        for (i = 0; i < fr->mr->ioeventfd_nb; ++i) {
                                  ^^^^^
                                   346

Given this is called per-queue, this looks quadratic to me.

Maybe it didn't bite on x86 because of less vCPUs ?

> Please let me know if you want to discuss this bug more.

What about adding all eventfd regions in a single transaction ?

Comment 9 Greg Kurz 2021-03-19 19:09:43 UTC

(In reply to Greg Kurz from comment #8)
> 
> What about adding all eventfd regions in a single transaction ?

I did some experiments in that direction with virtio-scsi and it
sounds promising. I'm now trying with virtio-blk, which has the same
issue, hoping to come up with a generic solution.

Comment 10 Stefan Hajnoczi 2021-03-22 14:07:08 UTC

(In reply to Greg Kurz from comment #9)
> (In reply to Greg Kurz from comment #8)
> > 
> > What about adding all eventfd regions in a single transaction ?
> 
> I did some experiments in that direction with virtio-scsi and it
> sounds promising. I'm now trying with virtio-blk, which has the same
> issue, hoping to come up with a generic solution.

Excellent!

I remember batching g_realloc() in address_space_update_ioeventfds() to improve performance but didn't change the for loop you mentioned:

commit 920d557e5ae58671d335acbcfba3f9a97a02911c
Author: Stefan Hajnoczi <stefanha>
Date:   Tue Feb 18 18:22:26 2020 +0000

    memory: batch allocate ioeventfds[] in address_space_update_ioeventfds()

Comment 11 David Gibson 2021-05-26 05:50:14 UTC

Greg, can you estimate when this might be ready and set a DTM accordingly?  Or do we have to defer this until after 8.5?

Comment 16 Zhenyu Zhang 2021-06-07 01:49:33 UTC

Set Verified:Tested,SanityOnly as gating/tier1 test pass.

Comment 17 Xujun Ma 2021-06-09 05:38:30 UTC

Boot up guest with 384 vcpus smoothly when set host file limit from 1024 to 8192，the bug has been fixed in this build.

Comment 18 Qunfang Zhang 2021-06-09 07:43:20 UTC

Setting to VERIFIED according to comment 17.

Comment 21 errata-xmlrpc 2021-11-16 07:51:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684