2222603 – [upstream][kvm] Patch "vhost: use vhost_tasks for worker threads" introduces 30% performance degradation

This bug has been migrated to another issue tracking site. It has been closed here and may no longer be being monitored.

If you would like to get updates for this issue, or to participate in it, you may do so at Red Hat Issue Tracker .

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2222603 - [upstream][kvm] Patch "vhost: use vhost_tasks for worker threads" introduces 30% performance degradation

Summary: [upstream][kvm] Patch "vhost: use vhost_tasks for worker threads" introduces ...

Keywords:
Status:	CLOSED MIGRATED
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	9.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Michael S. Tsirkin
QA Contact:	Quan Wenli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-13 08:56 UTC by Quan Wenli
Modified:	2023-09-22 13:59 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-22 13:59:03 UTC
Type:	---
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
the performance results with TCP STREAM and TCP RR (13.91 KB, application/zip) 2023-07-20 02:55 UTC, Quan Wenli	no flags	Details
working kernel config (251.62 KB, text/plain) 2023-07-24 17:22 UTC, Mike Christie	no flags	Details
result for comment#14: bad-commit-vs-6.5.0-rc2 (62.63 KB, text/html) 2023-07-25 08:31 UTC, Quan Wenli	no flags	Details
reply for comment#10 (94.15 KB, application/zip) 2023-07-25 10:26 UTC, Quan Wenli	no flags	Details
reply for comment#19, results with 6.3.0 vs 6.4.0 (30.61 KB, application/zip) 2023-07-27 06:15 UTC, Quan Wenli	no flags	Details
Don't inherit parent's sched settings (773 bytes, patch) 2023-08-14 01:08 UTC, Mike Christie	no flags	Details \| Diff
trace file for comment #29 (8.22 MB, application/octet-stream) 2023-08-22 08:21 UTC, Quan Wenli	no flags	Details
trace file for comment #29 (4.38 MB, application/octet-stream) 2023-08-22 08:31 UTC, Quan Wenli	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHEL-7167	0	None	Migrated	None	2023-09-22 13:59:00 UTC
Red Hat Issue Tracker	RHELPLAN-162251	0	None	None	None	2023-07-13 08:58:00 UTC

Description Quan Wenli 2023-07-13 08:56:24 UTC

Description of problem:

I found there are about 30% performance regression in kvm on both TCP_STREAM and TCP_RR tests between 6.3.0-rc6+ and 6.4.0+, I identified the root cause by commit “6e890c5d5021ca7e69bbe203fde42447874d9a82”( vhost: use vhost_tasks for worker threads).

detail results:

TCP_STREAM: http://kvm-perf.hosts.qa.psi.pek2.redhat.com//results/regression/2023-7-3-network-upstream/bad-2/xl710.bridge_test.1q.*netperf.with_jumbo.host_guest.html

TCP_RR: http://kvm-perf.hosts.qa.psi.pek2.redhat.com//results/regression/2023-7-3-network-upstream/bad-2/xl710.bridge_test.1q.*netperf.default.host_guest.html

# git bisect bad
6e890c5d5021ca7e69bbe203fde42447874d9a82 is the first bad commit
commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
Author: Mike Christie <michael.christie>
Date:   Fri Mar 10 16:03:32 2023 -0600

    vhost: use vhost_tasks for worker threads

    For vhost workers we use the kthread API which inherit's its values from
    and checks against the kthreadd thread. This results in the wrong RLIMITs
    being checked, so while tools like libvirt try to control the number of
    threads based on the nproc rlimit setting we can end up creating more
    threads than the user wanted.

    This patch has us use the vhost_task helpers which will inherit its
    values/checks from the thread that owns the device similar to if we did
    a clone in userspace. The vhost threads will now be counted in the nproc
    rlimits. And we get features like cgroups and mm sharing automatically,
    so we can remove those calls.

    Signed-off-by: Mike Christie <michael.christie>
    Acked-by: Michael S. Tsirkin <mst>
    Signed-off-by: Christian Brauner (Microsoft) <brauner>
    Signed-off-by: Christian Brauner <brauner>

 drivers/vhost/vhost.c | 60 +++++++++++----------------------------------------
 drivers/vhost/vhost.h |  4 ++--
Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.boot a vm with vhost
2.run "netserver" on guest
3.run "netperf" client on external host like:
numactl --cpunodebind=0 --membind=0 `command -v python python3 | head -1 ` /tmp/netperf_agent.py 1 /tmp/netperf-2.7.1/src/netperf -D 1 -H 192.168.58.112 -l 15.0 -C -c -t TCP_STREAM -- -m 64 

Actual results:

Expected results:

Additional info:
#cat /tmp/netperf_agent.py

#!/usr/bin/python

import os
import sys

if len(sys.argv) < 4:
    print(""" netperf agent usage:
    %s [session_number] [netperf_path] [netperf_parameters_str]

    $session_number: number of client sessions
    $netperf_path: client path
    $netperf_parameter_str: netperf parameters string""" % sys.argv[0])
    sys.exit()

n = int(sys.argv[1])
path = sys.argv[2]
params = " ".join(sys.argv[3:])

for i in range(n - 1):
    os.system("%s %s &" % (path, params))
os.system("%s %s" % (path, params))

Comment 1 Quan Wenli 2023-07-18 03:28:33 UTC

hi mst, jason 

could you help review this bug?

Thanks, wenli

Comment 2 Michael S. Tsirkin 2023-07-18 06:58:52 UTC

thanks for the report. can we open this bug publicly to have the path author help debug it? nothing confidential here is there? thanks!

Comment 3 Michael S. Tsirkin 2023-07-18 07:00:37 UTC

if you do please zip to the reports and attach them to the bug so all readers can see them.

Comment 4 Quan Wenli 2023-07-18 07:49:08 UTC

(In reply to Michael S. Tsirkin from comment #3)
> if you do please zip to the reports and attach them to the bug so all
> readers can see them.

hi chayang, could I share the report in public ?

Comment 5 Michael S. Tsirkin 2023-07-18 10:01:24 UTC

and just making sure the kernel 6.4.0 you tested does include
f9010dbdce911ee1f1af1398a24b1f9f992e0080
correct?

Comment 6 Quan Wenli 2023-07-18 13:34:27 UTC

(In reply to Michael S. Tsirkin from comment #5)
> and just making sure the kernel 6.4.0 you tested does include
> f9010dbdce911ee1f1af1398a24b1f9f992e0080
> correct?

yes, it includes f9010dbdce911ee1f1af1398a24b1f9f992e0080(fork, vhost: Use CLONE_THREAD to fix freezer/ps regression)

The master I tested with a901a3568fd26ca9c4a82d8bc5ed5b3ed844d451 (Merge tag 'iomap-6.5-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux)

Comment 8 Quan Wenli 2023-07-18 13:37:38 UTC

(In reply to Michael S. Tsirkin from comment #2)
> thanks for the report. can we open this bug publicly to have the path author
> help debug it? nothing confidential here is there? thanks!

I removed the private and upload the zip results.

Comment 9 Quan Wenli 2023-07-20 02:55:01 UTC

Created attachment 1976628 [details]
the performance results with TCP STREAM and TCP RR

Comment 10 Michael S. Tsirkin 2023-07-23 09:29:11 UTC

info request from patch author:

I think I can replicate the problem. I just need some extra info from Quan:

1. Just double check that they are using RHEL 9 on the host running the VMs.
2. The kernel config
3. Any tuning that was done. Is tuned running in guest and/or host running the
VMs and what profile is being used in each.
4. Number of vCPUs and virtqueues being used.
5. Can they dump the contents of:

/sys/kernel/debug/sched

and

sysctl  -a

on the host running the VMs.

6. With the 6.4 kernel, can they also run a quick test and tell me if they set
the scheduler to batch:

ps -T -o comm,pid,tid $QEMU_THREAD

then for each vhost thread do:

chrt -b -p 0 $VHOST_THREAD

Does that end up increasing perf? When I do this I see throughput go up by
around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
and virtqueues per net device in the VM). Note that I'm not saying that is a fix.
It's just a difference I noticed when running some other tests.

Comment 11 Mike Christie 2023-07-24 17:20:14 UTC

Hi Quan,

I'm the author of that vhost patch. Besides the requested info in the last comment, could you also tell me how you build your kernel config for the upstream kernel?

I was able to replicate the problem over the weekend but it was a complete fluke.

I had taken a OL7 uek6 (I work for oracle and OL is the name of our distro based on RHEL so OL7 == RHEL7 and uek is the name of our kernel where uek6 is 5.4 based) kernel config, copied that to the upstream kernel then did a make oldconfig and accepted the defaults. When doing this, I can replicate your issue.

So I did:

cd upstream-kernel-dir
cp /boot/config-$OUR_5.4_BASED_KERNEL_CONFIG .config
make oldconfig
make
make modules_install
make install
reboot


When I took the OL9 (RHEL9) uek7 (uek7 is based on upstream 5.15) kernel config as the base:

cp /boot/config-$OUR_5.14_BASED_KERNEL_CONFIG .config
make oldconfig
make
make modules_install
make install
reboot

and did the same thing, my perf with the vhost patch was actually better in 6.4 than it was in 6.3. With your test, I'm seeing up to a 50% improvement in throughput.

The kernel configs that got made when I did make oldconfig were very different, so I'm thinking the vhost patch is having a an issue with a specific kernel option, and am trying to narrow that down now.

Comment 12 Mike Christie 2023-07-24 17:22:33 UTC

Created attachment 1977344 [details]
working kernel config

Here is the kernel config where I'm seeing a perf improvement vs 6.3. Note that it's based on 6.5-rc2 instead of 6.4. 6.5-rc2 also has the regression, and I wanted to make sure there were no missing fixes in the current upstream kernel.

Comment 13 Mike Christie 2023-07-24 17:24:15 UTC

I meant to write 5.15 not 5.14:

cp /boot/config-$OUR_5.15_BASED_KERNEL_CONFIG .config

Comment 14 Mike Christie 2023-07-24 20:37:02 UTC

I found the issue in my setup.

It's CONFIG_RT_GROUP_SCHED. When that is set:

CONFIG_RT_GROUP_SCHED=y

in the kernel config then perf hits the regression. When that is not set, then in 6.4 and newer throughput it up to 50% than 6.3.

The really weird thing is that as far as I can tell we have no RT threads in qemu. So enabling that must be causing me to go down an unintentional code path causing the regression.

Quan, could you check your CONFIG_RT_GROUP_SCHED setting? If it's set in your kernel config try disabling it and/or try the kernel config from comment 12 to check that it's not that setting mixing with another setting.

Comment 15 Quan Wenli 2023-07-25 08:25:47 UTC

(In reply to Mike Christie from comment #14)
> I found the issue in my setup.
> 
> It's CONFIG_RT_GROUP_SCHED. When that is set:
> 
> CONFIG_RT_GROUP_SCHED=y
> 
> in the kernel config then perf hits the regression. When that is not set,
> then in 6.4 and newer throughput it up to 50% than 6.3.
> 
> The really weird thing is that as far as I can tell we have no RT threads in
> qemu. So enabling that must be causing me to go down an unintentional code
> path causing the regression.
> 
> Quan, could you check your CONFIG_RT_GROUP_SCHED setting?

CONFIG_RT_GROUP_SCHED is not set in comment#0 test. 

> If it's set in
> your kernel config try disabling it and/or try the kernel config from
> comment 12 to check that it's not that setting mixing with another setting.

I try again with the config from comment #12. compared the result with bad commit.there are no difference, it means the throughput are still bad.  

compile kernel with: 

# git checkout v6.5-rc2
Previous HEAD position was 6eaae1980760 Linux 6.5-rc3

# cp /boot/comment12.config .config
cp: overwrite '.config'? y

# make oldconfig
# make
# make modules_install
# make install
# reboot
# diff -ur .config /boot/comment12.config
--- .config	2023-07-25 02:44:01.220332899 -0400
+++ /boot/comment12.config	2023-07-25 02:13:29.534665603 -0400
@@ -2,18 +2,20 @@
 # Automatically generated file; DO NOT EDIT.
 # Linux/x86 6.5.0-rc2 Kernel Configuration
 #
-CONFIG_CC_VERSION_TEXT="gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)"
+CONFIG_CC_VERSION_TEXT="gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0"
 CONFIG_CC_IS_GCC=y
-CONFIG_GCC_VERSION=110401
+CONFIG_GCC_VERSION=110300
 CONFIG_CLANG_VERSION=0
 CONFIG_AS_IS_GNU=y
-CONFIG_AS_VERSION=23502
+CONFIG_AS_VERSION=23800
 CONFIG_LD_IS_BFD=y
-CONFIG_LD_VERSION=23502
+CONFIG_LD_VERSION=23800
 CONFIG_LLD_VERSION=0
 CONFIG_CC_CAN_LINK=y
+CONFIG_CC_CAN_LINK_STATIC=y
 CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
 CONFIG_CC_HAS_ASM_GOTO_TIED_OUTPUT=y
+CONFIG_TOOLS_SUPPORT_RELR=y
 CONFIG_CC_HAS_ASM_INLINE=y
 CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
 CONFIG_PAHOLE_VERSION=0

Comment 16 Quan Wenli 2023-07-25 08:31:41 UTC

Created attachment 1977437 [details]
result for comment#14: bad-commit-vs-6.5.0-rc2

Comment 17 Quan Wenli 2023-07-25 10:24:15 UTC

(In reply to Michael S. Tsirkin from comment #10)
> info request from patch author:
> 
> I think I can replicate the problem. I just need some extra info from Quan:
> 
> 1. Just double check that they are using RHEL 9 on the host running the VMs.

yes, rhel9.3 was running on the host. 

> 2. The kernel config

please check the sched.zip

> 3. Any tuning that was done. Is tuned running in guest and/or host running
> the
> VMs and what profile is being used in each.

yes. 

on host: 
1. stop irqbalance.service
2. set xl710's irq affinity to local numa one by one, details please check sched.zip, the script was running by "set_irq_affinity.sh -x local $dev"
3. pin 4vcpus and 1 vhost thread to numa node 0 one by one
3. virtual-host profile on host
4. disable firewall 

on guest
1. disable firewall 
2. virtual-guest profile on guest

> 4. Number of vCPUs and virtqueues being used.

Four vcpus and only one virtqueus being used. 

 -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
 -device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \
 -device '{"driver": "virtio-net-pci", "mac": "9a:bf:a8:29:ec:22", "id": "idgpaa3N", "netdev": "id2QyKwC", "bus": "pcie-root-port-3", "addr": "0x0"}'  \
 -netdev tap,id=id2QyKwC,vhost=on,vhostfd=16 \

> 5. Can they dump the contents of:
> 
> /sys/kernel/debug/sched
> 
> and
> 
> sysctl  -a
> 
> on the host running the VMs.

please check the sched.zip

> 
> 6. With the 6.4 kernel, can they also run a quick test and tell me if they
> set
> the scheduler to batch:
> 
> ps -T -o comm,pid,tid $QEMU_THREAD
> 
> then for each vhost thread do:
> 
> chrt -b -p 0 $VHOST_THREAD

# ps -T -o comm,pid,tid 38157
COMMAND             PID     TID
qemu-kvm          38157   38157
qemu-kvm          38157   38158
qemu-kvm          38157   38159
vhost-38157       38157   38162
qemu-kvm          38157   38163
qemu-kvm          38157   38164
qemu-kvm          38157   38165
qemu-kvm          38157   38166
qemu-kvm          38157   38167
qemu-kvm          38157   38169
qemu-kvm          38157   38218

# chrt -b -p 0 38162

# chrt -p 38162
pid 38162's current scheduling policy: SCHED_BATCH
pid 38162's current scheduling priority: 0


> 
> Does that end up increasing perf?

no. the throughput are still bad with 6.5.0-rc2.

Category:TCP_MAERTS (TX)
        size|    sessions|  throughput|

       16384|           1|    19998.71| ---> 29977.35(good)
       16384|           2|    24543.59| ---> 39595.60(good)   
       16384|           4|    22531.03| ---> 36445.16(good)
 

 When I do this I see throughput go up by
> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
> and virtqueues per net device in the VM). Note that I'm not saying that is a
> fix.
> It's just a difference I noticed when running some other tests.

Comment 18 Quan Wenli 2023-07-25 10:26:44 UTC

Created attachment 1977447 [details]
reply for comment#10

Comment 19 Mike Christie 2023-07-26 00:10:53 UTC

Ah ok, I thought you were using multiple virtqueues/vhost-threads. The CONFIG_RT_GROUP_SCHED setting and/or the scheduler change should not make a difference.

For tests like these:

xl710.bridge_test.1q.*netperf.default.host_guest.html

you also need commit:

commit 223baf9d17f25e2608dbdff7232c095c1e612268
Author: Mathieu Desnoyers <mathieu.desnoyers>
Date:   Thu Apr 20 10:55:48 2023 -0400

    sched: Fix performance regression introduced by mm_cid

It's in 6.4 but was merged after the commit listed in the test:
6e890c5d5021ca7e69bbe203fde42447874d9a82

Did you still have the perf numbers when comparing 6.3 and 6.4?

For the tests in:

https://bugzilla.redhat.com/show_bug.cgi?id=2222603#c9

were you using just 4 vCPUs and 1 virtqueue?

For test xl710.bridge_test.1q.*netperf.default.host_guest Category TCP_RR and the second run where size=64 and sessions=25 it shows a 21% drop in throughput right?

For that test, does the trans.rate of 210343.28 mean for 25 sessions we got a total of 210343 MB/s? What does the CPU value of 24.7120 mean and how can I get that value?

What physical nic are you using and what's its network speed/bandwidth? I can't replicate the regression, but if that trans.rate is in MB/s my network is a lot slower.

Comment 20 Mike Christie 2023-07-26 00:36:30 UTC

Ignore the question about the trans.rate and CPU. I figured it out. The line about the thr_per_CPU being in MB/sec threw me.

Comment 21 Quan Wenli 2023-07-27 06:13:43 UTC

(In reply to Mike Christie from comment #19)
> Ah ok, I thought you were using multiple virtqueues/vhost-threads. The
> CONFIG_RT_GROUP_SCHED setting and/or the scheduler change should not make a
> difference.
> 
> For tests like these:
> 
> xl710.bridge_test.1q.*netperf.default.host_guest.html
> 
> you also need commit:
> 
> commit 223baf9d17f25e2608dbdff7232c095c1e612268
> Author: Mathieu Desnoyers <mathieu.desnoyers>
> Date:   Thu Apr 20 10:55:48 2023 -0400
> 
>     sched: Fix performance regression introduced by mm_cid
> 
> It's in 6.4 but was merged after the commit listed in the test:
> 6e890c5d5021ca7e69bbe203fde42447874d9a82
> 
> Did you still have the perf numbers when comparing 6.3 and 6.4?

I test again with 6.4.0 ( Merge tag 'iomap-6.5-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux a901a3568fd26ca9c4a82d8bc5ed5b3ed844d451), 

there is no performance difference compared with bad-commit (host: use vhost_tasks for worker threads), I will post results in comment #22

> 
> For the tests in:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2222603#c9
> 
> were you using just 4 vCPUs and 1 virtqueue?

Yes.

> 
> For test xl710.bridge_test.1q.*netperf.default.host_guest Category TCP_RR
> and the second run where size=64 and sessions=25 it shows a 21% drop in
> throughput right?

yes, the bad commit introduce significant degradation on tcp_stream tests and slightly degradation on tcp_rr tests. 

> 
> For that test, does the trans.rate of 210343.28 mean for 25 sessions we got
> a total of 210343 MB/s? 

yes, it's the total trans.rate for 25 session. 
> What does the CPU value of 24.7120 mean and how can
> I get that value?

It was captured with 100-idle% from mpstat on host while test is running. 

> 
> What physical nic are you using and what's its network speed/bandwidth? 

XL710 for 40Gb

> I can't replicate the regression, but if that trans.rate is in MB/s my network
> is a lot slower.

Comment 22 Quan Wenli 2023-07-27 06:15:33 UTC

Created attachment 1980223 [details]
reply for comment#19, results with 6.3.0 vs 6.4.0

Comment 23 Mike Christie 2023-07-27 17:34:31 UTC

Thanks for the info and testing. I just can't seem to replicate the problem. I think I'm just not setting it up the same as you. For your comment:

3. pin 4vcpus and 1 vhost thread to numa node 0 one by one

can you give me some more details? Like give me the taskset -cp info for the vcpu and vhost threads?

For example I wasn't sure if numa node0 has cpus 0-7, do you do:

vcpu0 -> pinned to cpu0
vcpu1 -> pinned to cpu1
vcpu2 -> pinned to cpu2
vcpu3 -> pinned to cpu3
vhost -> pinned to cpu4

?

Or would it be:

vcpu0 -> pinned to cpu0-cpu3
vcpu1 -> pinned to cpu0-cpu3
vcpu2 -> pinned to cpu0-cpu3
vcpu3 -> pinned to cpu0-cpu3
vhost -> pinned to cpu0-cpu3

?

Or do the VM's threads get to use all the cpus on node0 like:

vcpu0 -> pinned to cpu0-cpu7
vcpu1 -> pinned to cpu0-cpu7
vcpu2 -> pinned to cpu0-cpu7
vcpu3 -> pinned to cpu0-cpu7
vhost -> pinned to cpu0-cpu7

?

Comment 24 Michael S. Tsirkin 2023-07-27 17:38:40 UTC

to comment 14: so while we don't know if it's a separate or the same problem,
I guess it has to be fixed anyway? And maybe if we are lucky the investigation will reveal
the source of this one.

Comment 25 Mike Christie 2023-07-27 18:24:20 UTC

Yeah, I'm on both issues. My gut says they are different, but for https://bugzilla.redhat.com/show_bug.cgi?id=2222603#c14 I have to dig into the scheduler code, so while doing that I'm doing things like checking the diffs in how a kthread and vhost_task task/thread might be handled (either explicitly with KTHREAD checks or maybe due to a difference in scheduler settings).

Comment 27 Quan Wenli 2023-08-08 02:12:43 UTC

(In reply to Mike Christie from comment #23)
> Thanks for the info and testing. I just can't seem to replicate the problem.
> I think I'm just not setting it up the same as you. For your comment:
> 
> 3. pin 4vcpus and 1 vhost thread to numa node 0 one by one
> 
> can you give me some more details? Like give me the taskset -cp info for the
> vcpu and vhost threads?
> 
> For example I wasn't sure if numa node0 has cpus 0-7, do you do:

I thought I had replied to it several days ago, but I'm not sure why responses was missing

In my machine, there are two numa nodes, the 10 cpus for each one. 

> 
> vcpu0 -> pinned to cpu0
> vcpu1 -> pinned to cpu1
> vcpu2 -> pinned to cpu2
> vcpu3 -> pinned to cpu3
> vhost -> pinned to cpu4


Yes, pinned vcpu/vhost as above with taskset.  

> 
> ?
> 
> Or would it be:
> 
> vcpu0 -> pinned to cpu0-cpu3
> vcpu1 -> pinned to cpu0-cpu3
> vcpu2 -> pinned to cpu0-cpu3
> vcpu3 -> pinned to cpu0-cpu3
> vhost -> pinned to cpu0-cpu3
> 
> ?
> 
> Or do the VM's threads get to use all the cpus on node0 like:
> 
> vcpu0 -> pinned to cpu0-cpu7
> vcpu1 -> pinned to cpu0-cpu7
> vcpu2 -> pinned to cpu0-cpu7
> vcpu3 -> pinned to cpu0-cpu7
> vhost -> pinned to cpu0-cpu7
> 
> ?

Comment 28 Mike Christie 2023-08-14 01:08:39 UTC

Created attachment 1983225 [details]
Don't inherit parent's sched settings

Hey, Could you test this patch? With kthreads we would always reset the vhost task's sched settings. In 6.4, we will inherit the parent thread's settings.

Comment 29 Mike Christie 2023-08-14 01:12:31 UTC

Quan,

If the patch does not help, then for 6.4 or the current 6.5-rc kernel, could you run perf and get me a trace? I'm thinking that we are just going to be hitting schedule() and switching more often so it might not be helpful, but just in case you are hitting some new locking or something else that I'm missing in the code.

Comment 30 Quan Wenli 2023-08-14 03:26:37 UTC

(In reply to Mike Christie from comment #28)
> Created attachment 1983225 [details]
> Don't inherit parent's sched settings
> 
> Hey, Could you test this patch?

Should this patch be applied to 6.4.0 or on the bad commit?

> With kthreads we would always reset the
> vhost task's sched settings. In 6.4, we will inherit the parent thread's
> settings.

Comment 31 Mike Christie 2023-08-16 16:29:15 UTC

Sorry for the late reply. Instead of taking one long vacation, I've been taking 2 or 3 days off every week in August, so I just got back and then will be gone part next week and the next.

It should be applied to 6.4 or 6.5-rc.

We can't just run off the bad commit, because the SCHED_MM_CID code had a nasty perf regression that affects all threading code which was fixed in one of the 6.4-rcs.

Comment 32 Mike Christie 2023-08-21 16:40:50 UTC

Hey Quan, 3 more questions.

1. When you launch qemu, are you just doing it from the default cgroup? You have not created a cgroup like how libvirtd would right?
2. Can you send me the entire qemu command you are running?
3. Can you tell me your qemu version?

Comment 33 Quan Wenli 2023-08-22 08:19:36 UTC

(In reply to Mike Christie from comment #29)
> Quan,
> 
> If the patch does not help, then for 6.4 or the current 6.5-rc kernel, 

Yes, the tcp stream performance are still bad with the applied patch on 6.5.0-rc5+. 

>could
> you run perf and get me a trace? 

Please check the attached file named "perf.data" 


>I'm thinking that we are just going to be
> hitting schedule() and switching more often so it might not be helpful, but
> just in case you are hitting some new locking or something else that I'm
> missing in the code.

Comment 35 Quan Wenli 2023-08-22 08:31:21 UTC

Created attachment 1984522 [details]
trace file for comment #29

Comment 36 Quan Wenli 2023-08-22 08:48:23 UTC

(In reply to Mike Christie from comment #32)
> Hey Quan, 3 more questions.
> 
> 1. When you launch qemu, are you just doing it from the default cgroup? You
> have not created a cgroup like how libvirtd would right?

yes, I have not created a cgroup. use the "umactl -m 1" to bind the qemu-kvm process to NUMA node 1, and the xl710 network card is also bound to node 1.

> 2. Can you send me the entire qemu command you are running?

[stdlog] MALLOC_PERTURB_=1 numactl \
[stdlog]     -m 1  /usr/libexec/qemu-kvm \
[stdlog]     -S  \
[stdlog]     -name 'avocado-vt-vm1'  \
[stdlog]     -sandbox on  \
[stdlog]     -machine q35,memory-backend=mem-machine_mem \
[stdlog]     -device '{"id": "pcie-root-port-0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x1", "chassis": 1}' \
[stdlog]     -device '{"id": "pcie-pci-bridge-0", "driver": "pcie-pci-bridge", "addr": "0x0", "bus": "pcie-root-port-0"}'  \
[stdlog]     -nodefaults \
[stdlog]     -device '{"driver": "VGA", "bus": "pcie.0", "addr": "0x2"}' \
[stdlog]     -m 4096 \
[stdlog]     -object '{"size": 4294967296, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}'  \
[stdlog]     -smp 4,maxcpus=4,cores=2,threads=1,dies=1,sockets=2  \
[stdlog]     -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
[stdlog]     -chardev socket,path=/var/tmp/avocado_e72h6tuf/monitor-qmpmonitor1-20230822-042450-CObkjfDx,wait=off,server=on,id=qmp_id_qmpmonitor1  \
[stdlog]     -mon chardev=qmp_id_qmpmonitor1,mode=control \
[stdlog]     -chardev socket,path=/var/tmp/avocado_e72h6tuf/monitor-catch_monitor-20230822-042450-CObkjfDx,wait=off,server=on,id=qmp_id_catch_monitor  \
[stdlog]     -mon chardev=qmp_id_catch_monitor,mode=control \
[stdlog]     -device '{"ioport": 1285, "driver": "pvpanic", "id": "idGb5vwB"}' \
[stdlog]     -chardev socket,path=/var/tmp/avocado_e72h6tuf/serial-serial0-20230822-042450-CObkjfDx,wait=off,server=on,id=chardev_serial0 \
[stdlog]     -device '{"id": "serial0", "driver": "isa-serial", "chardev": "chardev_serial0"}'  \
[stdlog]     -chardev socket,id=seabioslog_id_20230822-042450-CObkjfDx,path=/var/tmp/avocado_e72h6tuf/seabios-20230822-042450-CObkjfDx,server=on,wait=off \
[stdlog]     -device isa-debugcon,chardev=seabioslog_id_20230822-042450-CObkjfDx,iobase=0x402 \
[stdlog]     -device '{"id": "pcie-root-port-1", "port": 1, "driver": "pcie-root-port", "addr": "0x1.0x1", "bus": "pcie.0", "chassis": 2}' \
[stdlog]     -device '{"driver": "qemu-xhci", "id": "usb1", "bus": "pcie-root-port-1", "addr": "0x0"}' \
[stdlog]     -device '{"driver": "usb-tablet", "id": "usb-tablet1", "bus": "usb1.0", "port": "1"}' \
[stdlog]     -blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/root/avocado/data/avocado-vt/vl_avocado-vt-vm1_image1.qcow2", "cache": {"direct": true, "no-flush": false}}' \
[stdlog]     -blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
[stdlog]     -device '{"id": "pcie-root-port-2", "port": 2, "driver": "pcie-root-port", "addr": "0x1.0x2", "bus": "pcie.0", "chassis": 3}' \
[stdlog]     -device '{"driver": "virtio-blk-pci", "id": "image1", "drive": "drive_image1", "bootindex": 0, "write-cache": "on", "bus": "pcie-root-port-2", "addr": "0x0"}' \
[stdlog]     -device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \
[stdlog]     -device '{"driver": "virtio-net-pci", "mac": "9a:be:ea:f9:50:95", "id": "idLlti5H", "netdev": "idOtPBXx", "bus": "pcie-root-port-3", "addr": "0x0"}'  \
[stdlog]     -netdev tap,id=idOtPBXx,vhost=on,vhostfd=16,fd=12 \
[stdlog]     -device '{"driver": "rtl8139", "mac": "9a:37:37:37:37:6e", "id": "idnkjOSz", "netdev": "id7AwY5C", "bus": "pcie-pci-bridge-0", "addr": "0x1"}'  \
[stdlog]     -netdev tap,id=id7AwY5C,fd=15  \
[stdlog]     -vnc :0  \
[stdlog]     -rtc base=utc,clock=host,driftfix=slew  \
[stdlog]     -boot menu=off,order=cdn,once=c,strict=off \
[stdlog]     -enable-kvm \
[stdlog]     -device '{"id": "pcie_extra_root_port_0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x3", "chassis": 5}'

> 3. Can you tell me your qemu version?

qemu-kvm-8.0.0-11.el9.x86_64

Comment 37 Mike Christie 2023-08-22 15:44:32 UTC

(In reply to Quan Wenli from comment #33)
> (In reply to Mike Christie from comment #29)
> > Quan,
> > 
> > If the patch does not help, then for 6.4 or the current 6.5-rc kernel, 
> 
> Yes, the tcp stream performance are still bad with the applied patch on
> 6.5.0-rc5+. 

Just to confirm, that the other tests (TCP RR and TCP MAERT) work ok now? Were there any other tests?

How bad is perf for the tcp stream test?

For the tcp stream test you mean the one marked as "TCP_STREAM (RX)" right? For that, you run netserver in the VM, then run your netperf command from a host right?

Are all the session and size test cases hitting bad perf or just certain combos?

Comment 38 Mike Christie 2023-08-22 16:03:24 UTC

Oh yeah, if the patch did help on the RR and MAERT tests, then I think I must have some settings that are different than you. That patch just resets the vhost thread's settings to the defaults instead of inheriting the values from the process that created it. The kthread based approach we used for vhost did that before my patch.

I checked the sched settings you sent before and they matched what I'm using but I must be missing something.

Could you send me the package that you guys get from the customer when they do a support ticket:

https://access.redhat.com/solutions/3592#command

It has the entire system's details in there, so I can check that against my system.

Comment 39 Mike Christie 2023-09-05 16:42:30 UTC

Hey Quan,

Wasn't sure if you've been busy with releases or vacations. Just wanted to make sure you saw the last 2 questions/comments.

Comment 40 Quan Wenli 2023-09-07 07:22:29 UTC

(In reply to Mike Christie from comment #37)
> (In reply to Quan Wenli from comment #33)
> > (In reply to Mike Christie from comment #29)
> > > Quan,
> > > 
> > > If the patch does not help, then for 6.4 or the current 6.5-rc kernel, 
> > 
> > Yes, the tcp stream performance are still bad with the applied patch on
> > 6.5.0-rc5+. 
> 
> Just to confirm, that the other tests (TCP RR and TCP MAERT) work ok now?
> Were there any other tests?

there is a pktgen test which running on the vm, it shows 50% regression on rx side. 

> 
> How bad is perf for the tcp stream test?

from comment #22 results, regression ranging from 8% to 19% on TCP_RR. about 30% of regression is observed on TCP_STREAM(include tx and rx)."


> 
> For the tcp stream test you mean the one marked as "TCP_STREAM (RX)" right?

no, marked as TCP_STREAM (RX)/ TCP_MAERTS (TX), run netserver in the vm and then conduct separate tests with netperf on another host, marking them as TCP_STREAM (RX) and TCP_MAERTS (TX) respectively.

> For that, you run netserver in the VM, then run your netperf command from a
> host right?
> 
> Are all the session and size test cases hitting bad perf or just certain
> combos?

from comment #22 results, most of the tests have regression issues.

Comment 41 Quan Wenli 2023-09-07 08:45:34 UTC

Created attachment 1987504 [details]
Replied for comment #38, "sos report" while test is running

Comment 42 Quan Wenli 2023-09-07 08:46:20 UTC

(In reply to Mike Christie from comment #38)
> Oh yeah, if the patch did help on the RR and MAERT tests, then I think I
> must have some settings that are different than you. That patch just resets
> the vhost thread's settings to the defaults instead of inheriting the values
> from the process that created it. The kthread based approach we used for
> vhost did that before my patch.
> 
> I checked the sched settings you sent before and they matched what I'm using
> but I must be missing something.
> 
> Could you send me the package that you guys get from the customer when they
> do a support ticket:
> 
> https://access.redhat.com/solutions/3592#command
> 
> It has the entire system's details in there, so I can check that against my
> system.

Yes, please check the attachment in comment #41.

Comment 43 Mike Christie 2023-09-12 16:50:41 UTC

(In reply to Quan Wenli from comment #42)
> Yes, please check the attachment in comment #41.

I don't see a comment #41. Is it marked private?

Comment 44 Quan Wenli 2023-09-13 02:40:54 UTC

(In reply to Mike Christie from comment #43)
> (In reply to Quan Wenli from comment #42)
> > Yes, please check the attachment in comment #41.
> 
> I don't see a comment #41. Is it marked private?

Please check again , I remove the private.

Comment 45 RHEL Program Management 2023-09-22 13:56:11 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 46 RHEL Program Management 2023-09-22 13:59:03 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.

Note You need to log in before you can comment on or make changes to this bug.