Bug 2188899
| Summary: | [nfv virt][pvp][cross numa] The vm's vhostuser interface throughput drops significantly after adding emulatorpin cfg | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Yanghang Liu <yanghliu> |
| Component: | qemu-kvm | Assignee: | Virtualization Maintenance <virt-maint> |
| qemu-kvm sub component: | Networking | QA Contact: | Yanghang Liu <yanghliu> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | chayang, coli, jinzhao, juzhang, lvivier, maxime.coquelin, mprivozn, virt-maint, yama, yanghliu |
| Version: | 9.3 | Keywords: | Triaged |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-09-14 06:41:31 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Michal, as you worked on related bug, is the configuration used in this BZ valid? IS the performance drop expected? I don't think it is expected. The linked bug(s) are about ThreadContext, i.e. how QEMU allocates the memory. The emulator thread isn't affected. Yanghang, can you please share the QEMU cmd line in both cases? Also, what is the CPU topology? I'm wondering whether those CPU ids from <emulatorpin/> aren't just another CPU thread to those in <vcpupin/>, in which case the emulator thread can't run really if a vCPU is running. And maybe without <emulatorpin/> kernel is free to schedule the emulator thread onto a different core. (In reply to Michal Privoznik from comment #2) > I don't think it is expected. The linked bug(s) are about ThreadContext, > i.e. how QEMU allocates the memory. The emulator thread isn't affected. > > Yanghang, can you please share the QEMU cmd line in both cases? Also, what > is the CPU topology? I'm wondering whether those CPU ids from <emulatorpin/> > aren't just another CPU thread to those in <vcpupin/>, in which case the > emulator thread can't run really if a vCPU is running. And maybe without > <emulatorpin/> kernel is free to schedule the emulator thread onto a > different core. Hi Michal, Thanks for the confirmation. I have listed the related info in Comment 0, please let me know if I need to provide more info. We can get the detailed test log as well as the full domain xml from: (1) The detailed test log with emulatorpin cfg http://10.73.72.41/log/2023-04-22_23:53/nfv_pvp_2q_cross_numa_with_emulatorpin (2) The detailed test log without emulatorpin cfg http://10.73.72.41/log/2023-04-22_23:53/nfv_pvp_2q_cross_numa_without_emulatorpin And the domain's CPU topology is like: <cputune> <vcpupin vcpu='0' cpuset='30'/> <vcpupin vcpu='1' cpuset='28'/> <vcpupin vcpu='2' cpuset='26'/> <vcpupin vcpu='3' cpuset='24'/> <vcpupin vcpu='4' cpuset='22'/> <vcpupin vcpu='5' cpuset='20'/> <emulatorpin cpuset='25,27,29,31'/> <--- I run my tests with/without this cfg. </cputune> The list of host cores which dpdk-testpmd is running on is 15,31,29,27,25,23,21,19,17 The related cmd line is : # dpdk-testpmd -l 15,31,29,27,25,23,21,19,17 --socket-mem 1024,1024 -n 4 --vdev 'net_vhost0,iface=/tmp/vhost-user1,queues=2,client=1,iommu-support=1' --vdev 'net_vhost1,iface=/tmp/vhost-user2,queues=2,client=1,iommu-support=1' -b 0000:3b:00.0 -b 0000:3b:00.1 -d /usr/lib64/librte_net_vhost.so -- --portmask=f -i --rxd=512 --txd=512 --rxq=2 --txq=2 --nb-cores=8 --forward-mode=io This issue can still be reproduced in:
qemu-kvm-8.0.0-4.el9.x86_64
libvirt-9.3.0-2.el9.x86_64
5.14.0-319.el9.x86_64
seabios-bin-1.16.1-1.el9.noarch
Check point:
Test *with* <emulatorpin cpuset='25,27,29,31'/> cfg:
Throughput(Mpps) : 3.132936
Test *without* <emulatorpin cpuset='25,27,29,31'/> cfg:
Throughput(Mpps) :21.127461
This issue can still be reproduced in:
host:
qemu-kvm-8.0.0-9.el9.x86_64
tuned-2.20.0-1.el9.noarch
libvirt-9.5.0-5.el9.x86_64
openvswitch3.1-3.1.0-42.el9fdp.x86_64
dpdk-22.11-3.el9_2.x86_64
edk2-ovmf-20230524-2.el9.noarch
guest:
5.14.0-346.el9.x86_64
Test log: http://10.73.72.41/log/2023-08-07_20:17/nfv_pvp_1q_cross_numa
Check point:
[1] The statistics of dpdk-testpmd in the VM:
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 12822137264 RX-dropped: 494648264 RX-total: 13316785528
TX-packets: 12548106232 TX-dropped: 274031032 TX-total: 12822137264
[2] The VM Throughput(Mpps) : 2.240211
Hi Laurent, May I ask if you could help cc some developers to look at this bug ? From a QE point of view, we expect this BZ to be handled with priority, because this BZ has customer impact. Yanghang, what is the purpose of using emulatorpin if it drops the performance? Why the customer wants to use it? Could you provide the QEMU command line to reproduce the problem without libvirt? Could you also provide the result of numactl -H on the host? (In reply to Laurent Vivier from comment #8) Hi Laurent, > what is the purpose of using emulatorpin if it drops the performance? > Why the customer wants to use it? As far as I know, <emulatorpin> is a CPU tuning element and it can pin qemu-kvm emulator to physical CPUs. Generally <emulatorpin> should optimize our VM's performance. And <emulatorpin> is also a common cfg used by the OSP NFV QE. > Could you provide the QEMU command line to reproduce the problem without libvirt? I never try to setup cpupin and emulartorpin to in QEMU layer before as we always test nfv virt via libvirt. Currently, I am wonder if this issue to due to the VM's busy CPU and will try with updating my test cfg (like increase the VM CPU number, pin emulator to the housekeep CPU...). The related qemu-kvm cmdline generated by libvirt when I reproducing this issue: /usr/libexec/qemu-kvm \ -name guest=rhel9.3,debug-threads=on \ -S \ -object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-rhel9.3/master-key.aes"}' \ -machine pc-q35-rhel9.2.0,usb=off,vmport=off,kernel_irqchip=split,dump-guest-core=off \ -accel kvm \ -cpu Skylake-Server-IBRS,ss=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,clflushopt=on,umip=on,pku=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rsba=on,skip-l1dfl-vmentry=on,pschange-mc-no=on,tsc-deadline=on,pmu=off \ -m 8192 \ -overcommit mem-lock=on \ -smp 6,sockets=3,dies=1,cores=1,threads=2 \ -object '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/dev/hugepages/libvirt/qemu/1-rhel9.3","share":true,"prealloc":true,"size":8589934592,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0-5,memdev=ram-node0 \ -uuid 3a17e48a-e155-11ed-bfa4-20040fec000c \ -display none \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,fd=22,server=on,wait=off \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc,driftfix=slew \ -global kvm-pit.lost_tick_policy=delay \ -no-hpet \ -no-shutdown \ -boot strict=on \ -device '{"driver":"intel-iommu","id":"iommu0","intremap":"on","caching-mode":true,"device-iotlb":true}' \ -device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \ -device '{"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"}' \ -device '{"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"}' \ -device '{"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"}' \ -device '{"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}' \ -device '{"driver":"pcie-root-port","port":22,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x2.0x6"}' \ -blockdev '{"driver":"file","filename":"/home/images_nfv-virt-rt-kvm/rhel9.3.qcow2","aio":"threads","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device '{"driver":"virtio-blk-pci","iommu_platform":true,"ats":true,"bus":"pci.2","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1,"write-cache":"on"}' \ -netdev '{"type":"tap","fd":"23","vhost":true,"vhostfd":"25","id":"hostnet0"}' \ -device '{"driver":"virtio-net-pci","iommu_platform":true,"ats":true,"netdev":"hostnet0","id":"net0","mac":"88:66:da:5f:dd:11","bus":"pci.1","addr":"0x0"}' \ -chardev socket,id=charnet1,path=/tmp/vhost-user1,server=on \ -netdev '{"type":"vhost-user","chardev":"charnet1","queues":2,"id":"hostnet1"}' \ -device '{"driver":"virtio-net-pci","iommu_platform":true,"ats":true,"mq":true,"vectors":6,"rx_queue_size":1024,"netdev":"hostnet1","id":"net1","mac":"88:66:da:5f:dd:12","bus":"pci.6","addr":"0x0"}' \ -chardev socket,id=charnet2,path=/tmp/vhost-user2,server=on \ -netdev '{"type":"vhost-user","chardev":"charnet2","queues":2,"id":"hostnet2"}' \ -device '{"driver":"virtio-net-pci","iommu_platform":true,"ats":true,"mq":true,"vectors":6,"rx_queue_size":1024,"netdev":"hostnet2","id":"net2","mac":"88:66:da:5f:dd:13","bus":"pci.7","addr":"0x0"}' \ -chardev pty,id=charserial0 \ -device '{"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}' \ -audiodev '{"id":"audio1","driver":"none"}' \ -global ICH9-LPC.noreboot=off \ -watchdog-action reset \ -device '{"driver":"virtio-balloon-pci","iommu_platform":true,"ats":true,"id":"balloon0","bus":"pci.4","addr":"0x0"}' \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on \ > Could you also provide the result of numactl -H on the host? # numactl -H available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 node 0 size: 31616 MB node 0 free: 9008 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 node 1 size: 32191 MB node 1 free: 11144 MB node distances: node 0 1 0: 10 21 1: 21 10 (In reply to Yanghang Liu from comment #9) ... > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 > 46 48 50 52 54 56 58 60 62 > node 0 size: 31616 MB > node 0 free: 9008 MB > node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 > 47 49 51 53 55 57 59 61 63 > node 1 size: 32191 MB > node 1 free: 11144 MB > node distances: > node 0 1 > 0: 10 21 > 1: 21 10 I think you should at least put the emulator cpuset on the same node as the vCPU ones. Things like: <cputune> <vcpupin vcpu='0' cpuset='30'/> <vcpupin vcpu='1' cpuset='28'/> <vcpupin vcpu='2' cpuset='26'/> <vcpupin vcpu='3' cpuset='24'/> <vcpupin vcpu='4' cpuset='22'/> <vcpupin vcpu='5' cpuset='20'/> <emulatorpin cpuset='12,14,16,18'/> </cputune> Could you try? Interesting read: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-numa_and_libvirt#sect-Virtualization_Tuning_Optimization_Guide-NUMA-NUMA_and_libvirt-Using_emulatorpin "In Red Hat Enterprise Linux 7, automatic NUMA balancing is enabled by default. Automatic NUMA balancing reduces the need for manually tuning <emulatorpin>, since the vhost-net emulator thread follows the vCPU tasks more reliably. For more information about automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”." For RHEL-9, we have also: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/configuring_and_managing_virtualization/optimizing-virtual-machine-performance-in-rhel_configuring-and-managing-virtualization#configuring-numa-in-a-virtual-machine_optimizing-virtual-machine-cpu-performance Could you provide the result of "numastat -c qemu-kvm"? See https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/configuring_and_managing_virtualization/optimizing-virtual-machine-performance-in-rhel_configuring-and-managing-virtualization#virtual-machine-performance-monitoring-tools_optimizing-virtual-machine-performance-in-rhel Hi Laurent, Thanks for the info :) I have made a CPU tuning to all my nfv virt cases now and the test result looks good to me. I can get an expected VM Throughput: 20.833937(Mpps) Test env: 5.14.0-362.1.1.el9_3.x86_64 qemu-kvm-8.0.0-13.el9.x86_64 tuned-2.20.0-1.el9.noarch libvirt-9.5.0-6.el9.x86_64 python3-libvirt-9.3.0-1.el9.x86_64 openvswitch3.1-3.1.0-52.el9fdp.x86_64 dpdk-22.11-4.el9.x86_64 edk2-ovmf-20230524-3.el9.noarch seabios-bin-1.16.1-1.el9.noarch The host CPU number: 64(0-63) The host NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63 The host isolated CPU list: 2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63 The host core-numa CPU list used for dpdk-testpmd: 39,41,43,45,47,49,51,53,57,59,61,63 The host CPU list which are pinned to VM CPU : 22,32,30,28,26,24,34,36 The host CPU list which are used for emulartorpin:3,5,7,9 The VM CPU number: 8 (I increased the VCPU number of VM from 6 to 8,in case the VM VCPU are too busy) The VM CPU number used for VM dpdk-testpmd: 5 Let us wait for my auto regression test result and if all passed, I will close this bug as NOTABUG. (Currently I need to waiting for the verification of the testblocker Bug 2234390 and a round of my auto test will take around 15 hours) Close this bug as NOTABUG as the performance is normal after tuning the CPU.
****************************************************
Packets_loss Frame_Size(Byte) Run_No Throughput(Mpps)
0 64 0 20.892077
****************************************************
The regression test result : PASS
Related log:
http://10.73.72.41/log/2023-09-12_20:09/
http://10.73.72.41/log/2023-09-06_16:46/
|
Description of problem: The vm's vhostuser interface throughput drops significantly after adding emulatorpin cfg Version-Release number of selected component (if applicable): 5.14.0-301.el9.x86_64 qemu-kvm-7.2.0-14.el9_2.x86_64 libvirt-9.2.0-1.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1. setup the host kernel option, like CPU isolation,huge-page, iommu # grubby --args="iommu=pt intel_iommu=on default_hugepagesz=1G" --update-kernel=`grubby --default-kernel` # echo "isolated_cores=2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,31,29,27,25,23,21,19,17,15,13,11" >> /etc/tuned/cpu-partitioning-variables.conf tuned-adm profile cpu-partitioning # reboot 2. start a dpdk-testpmd on the host # echo 20 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages # echo 20 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages # modprobe vfio # modprobe vfio-pci # dpdk-devbind.py --bind=vfio-pci 0000:5e:00.0 # dpdk-devbind.py --bind=vfio-pci 0000:5e:00.1 # dpdk-devbind.py --bind=vfio-pci 0000:60:00.0 # dpdk-testpmd -l 15,31,29,27,25,23,21,19,17 --socket-mem 1024,1024 -n 4 --vdev 'net_vhost0,iface=/tmp/vhost-user1,queues=2,client=1,iommu-support=1' --vdev 'net_vhost1,iface=/tmp/vhost-user2,queues=2,client=1,iommu-support=1' -b 0000:3b:00.0 -b 0000:3b:00.1 -d /usr/lib64/librte_net_vhost.so -- --portmask=f -i --rxd=512 --txd=512 --rxq=2 --txq=2 --nb-cores=8 --forward-mode=io testpmd> set portlist 0,2,1,3 testpmd> start 3. start a domain with vhost-user interfaces and <emulatorpin cpuset='25,27,29,31'/> <cputune> <vcpupin vcpu='0' cpuset='30'/> <vcpupin vcpu='1' cpuset='28'/> <vcpupin vcpu='2' cpuset='26'/> <vcpupin vcpu='3' cpuset='24'/> <vcpupin vcpu='4' cpuset='22'/> <vcpupin vcpu='5' cpuset='20'/> <emulatorpin cpuset='25,27,29,31'/> </cputune> ... <interface type='vhostuser'> <mac address='88:66:da:5f:dd:12'/> <source type='unix' path='/tmp/vhost-user1' mode='server'/> <model type='virtio'/> <driver name='vhost' queues='2' rx_queue_size='1024' iommu='on' ats='on'/> </interface> <interface type='vhostuser'> <mac address='88:66:da:5f:dd:13'/> <source type='unix' path='/tmp/vhost-user2' mode='server'/> <model type='virtio'/> <driver name='vhost' queues='2' rx_queue_size='1024' iommu='on' ats='on'/> </interface> Note : the full domain xml is in the test log 4. setup the kernel option in the domain # grubby --args="iommu=pt intel_iommu=on default_hugepagesz=1G" --update-kernel=`grubby --default-kernel` # echo "isolated_cores=1,2,3,4,5" >> /etc/tuned/cpu-partitioning-variables.conf # tuned-adm profile cpu-partitioning # reboot 5. start a dpdk-testpmd in the domain # echo 2 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages # modprobe vfio # modprobe vfio-pci # dpdk-devbind.py --bind=vfio-pci 0000:06:00.0 # dpdk-devbind.py --bind=vfio-pci 0000:07:00.0 # dpdk-testpmd -l 1,2,3,4,5 -n 4 -d /usr/lib64/librte_net_virtio.so -- --nb-cores=4 -i --disable-rss --rxd=512 --txd=512 --rxq=2 --txq=2 testpmd> start 6. do Moongen tests # ./build/MoonGen examples/opnfv-vsperf.lua > /tmp/throughput.log 7. check the Throughput **************************************************** Packets_loss Frame_Size(Byte) Run_No Throughput(Mpps) 0 64 0 3.034078 **************************************************** 8. repeat the above step 1- step 7, but without <emulatorpin cpuset='25,27,29,31'/> **************************************************** Packets_loss Frame_Size(Byte) Run_No Throughput(Mpps) 0 64 0 21.127439 **************************************************** Actual results: The vm's vhostuser interface throughput drops around 85% after adding emulatorpin cfg Expected results: No significant throughput drops Additional info: (1) The detailed test log with emulatorpin cfg http://10.73.72.41/log/2023-04-22_23:53/nfv_pvp_2q_cross_numa_with_emulatorpin (2) The detailed test log without emulatorpin cfg http://10.73.72.41/log/2023-04-22_23:53/nfv_pvp_2q_cross_numa_without_emulatorpin (3) related bug about emulatorpin xml Bug 2154750 - [numatune][cputune] qemu-kvm: Setting CPU affinity failed: Invalid argument Bug 2185039 - [numatune][cputune] qemu-kvm: Setting CPU affinity failed: Invalid argument [rhel-9.2.0.z]