Description of problem: Openstack guest to guest netperf/TCP_STREAM throughput rate over neutron-openvswitch/bnx2x went below 4 Gb Experiment: netperf/TCP_STREAM 180 seconds test, repeat five times. See below for min/ave/max stats: *** guest to guest over neutron-openvswitch/bnx2x --- VLAN *** min/ave/max = 3820.33 / 3943.95 / 3999.42 (Mbps) For reference see the following test case: *** VM to VM over OVS/bnx2x --- VLAN *** min/ave/max = 9149.72 / 9298.19 / 9372.23 (Mbps) Version-Release number of selected component (if applicable): Openstack: 2015-01-23.1/RH7-RHOS-6.0 RHEL7.1 kernel: 3.10.0-227.el7.x86_64 ovs_version: "2.1.3" How reproducible: reproducible Steps to Reproduce: Two hosts needed for the experiment 1. Congigure as Controller/Network/Compute, the other as Compute only 2. Configure a VLAN provider network using 10 Gb bnx2x to bridge the neutron-openvswitch network 3.Configure a guest per host 4.Run netperf/TCP_STREAM between the two guests Actual results: The average throughput rate was under 4 Gb as reported above Expected results: Should be around 9 Gb. Additional info: NOTE: netperf/TCP_MAERTS testing results showed the same issue.
After rebooting of both hosts and re-ran the same experiment, got 5+ Gbps: 5344.24 / 5540.34 / 5828.51 (Mbps) 5371.65 / 5567.38 / 5696.73 (Mbps) netperf/TCP_MAERTS got: 3004.93 / 3333.56 / 3529.27 (Mbps) 2960.47 / 3300.44 / 3681.04 (Mbps)
Created attachment 987198 [details] Spreadsheet of netperf/TCP_STREAM throughput rates over bnx2x and ixgbe The spreadsheet include netperf/TCP_STREAM throughput rates over the following data paths: NIC to NIC OVS/NIC to OVS/NIC VM/OVS/NIC to VM/OVS/NIC Guest/Neutron/OVS/NIC to Guest/Neutron/OVS/NIC Where NIC=bnx2x and ixgbe Experiment: 10 netperf/TCP_STREAM tests, each lasts 180 seconds
Found the root cause: The vhost and the qemu-kvm processes that associated with the Neutron/OVS Guest have their affinity set to even affinity --- 55555555. These should be changed to default value FFFFFFFF. But, the affinity of either process cannot be changed. [root@qe-dell-ovs4 jhsiao]# taskset -p ffffffff 29783 pid 29783's current affinity mask: 55555555 pid 29783's new affinity mask: 55555555 [root@qe-dell-ovs4 jhsiao]# taskset -p ffffffff 29832 pid 29832's current affinity mask: 55555555 pid 29832's new affinity mask: 55555555
If we look at the 'virsh capabilities' output for the compute host, we can see the topology of the host: <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>67062188</memory> <pages unit='KiB' size='4'>16765547</pages> <pages unit='KiB' size='2048'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> </distances> <cpus num='16'> <cpu id='0' socket_id='0' core_id='0' siblings='0,16'/> <cpu id='2' socket_id='0' core_id='1' siblings='2,18'/> <cpu id='4' socket_id='0' core_id='2' siblings='4,20'/> <cpu id='6' socket_id='0' core_id='3' siblings='6,22'/> <cpu id='8' socket_id='0' core_id='4' siblings='8,24'/> <cpu id='10' socket_id='0' core_id='5' siblings='10,26'/> <cpu id='12' socket_id='0' core_id='6' siblings='12,28'/> <cpu id='14' socket_id='0' core_id='7' siblings='14,30'/> <cpu id='16' socket_id='0' core_id='0' siblings='0,16'/> <cpu id='18' socket_id='0' core_id='1' siblings='2,18'/> <cpu id='20' socket_id='0' core_id='2' siblings='4,20'/> <cpu id='22' socket_id='0' core_id='3' siblings='6,22'/> <cpu id='24' socket_id='0' core_id='4' siblings='8,24'/> <cpu id='26' socket_id='0' core_id='5' siblings='10,26'/> <cpu id='28' socket_id='0' core_id='6' siblings='12,28'/> <cpu id='30' socket_id='0' core_id='7' siblings='14,30'/> </cpus> </cell> <cell id='1'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='4'>16777216</pages> <pages unit='KiB' size='2048'>0</pages> <distances> <sibling id='0' value='20'/> <sibling id='1' value='10'/> </distances> <cpus num='16'> <cpu id='1' socket_id='1' core_id='0' siblings='1,17'/> <cpu id='3' socket_id='1' core_id='1' siblings='3,19'/> <cpu id='5' socket_id='1' core_id='2' siblings='5,21'/> <cpu id='7' socket_id='1' core_id='3' siblings='7,23'/> <cpu id='9' socket_id='1' core_id='4' siblings='9,25'/> <cpu id='11' socket_id='1' core_id='5' siblings='11,27'/> <cpu id='13' socket_id='1' core_id='6' siblings='13,29'/> <cpu id='15' socket_id='1' core_id='7' siblings='15,31'/> <cpu id='17' socket_id='1' core_id='0' siblings='1,17'/> <cpu id='19' socket_id='1' core_id='1' siblings='3,19'/> <cpu id='21' socket_id='1' core_id='2' siblings='5,21'/> <cpu id='23' socket_id='1' core_id='3' siblings='7,23'/> <cpu id='25' socket_id='1' core_id='4' siblings='9,25'/> <cpu id='27' socket_id='1' core_id='5' siblings='11,27'/> <cpu id='29' socket_id='1' core_id='6' siblings='13,29'/> <cpu id='31' socket_id='1' core_id='7' siblings='15,31'/> </cpus> </cell> </cells> </topology> Meanwhile, if we look at the guest in question we see it has CPU placement set: # virsh dumpxml instance-00000005 | grep placement <vcpu placement='static' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'>1</vcpu> So this guest is basically being set to run on the first NUMA node. This is good, because otherwise it would randomly float across NUMA nodes and suffer degraded performance due to cross-NUMA memory access.
Also, there is nothing else running on this host that is competing for CPU resources with the single guest vCPU, so I'm sceptical that this CPu pinning has an impact on the TCP network performance
(In reply to Daniel Berrange from comment #8) > Also, there is nothing else running on this host that is competing for CPU > resources with the single guest vCPU, so I'm sceptical that this CPu pinning > has an impact on the TCP network performance Let me explain the performance impact a little bit. It's all about locality. There are two banks of CPUs --- even an odd. And bnx2x belongs to odd bank --- based on HW configuration. For an OVS VM its vhost and qemu-kvm tasks the affinity is set to FFFFFFFF. And, when netperf/TCP_STREAM is running odd CPUs get utilized. And that makes near line rate performance. For a Nova guest, the affinity of both vhost and qemu-kvm is now set to even. When netperf/TCP_STREAM is running some four even CPU's and one odd CPU get significant used. And the degrades the netperf/TCP_STREAM throughtput rate down to under 6 Gb. The key question is what causes only even CPUs got used according to the xml file? Never seen this before.
So, IIUC, the network interface that the guest is connected to is attached to NUMA node 1, while the guest is placed on NUMA node 0. There's not really anything we can do about this in general. While Nova will soon make use of NUMA locality info for assigned PCI devices, there is no equivalent work being done to take into account locality when just connecting to openvswitch NICs. It'll just be pot luck whether any guest is local to the NIC in question or not. If the testing requires that the guest be running on specific host NUMA node in order to reach the throughput, it was only ever be working by luck. eg there were sufficiently few guests on the host that the kernel will have scheduled the guest on the right NUMA node to achieve the performance expected. As you run more guest on the host, inevitably some are going to be on different NUMA nodes and so not meet the performance figures. The new Nova NUMA placment logic means we just see this effect upfront during testing, instead of only once customers load their host up with many guests. IMHO the only real way to solve this is to make the Neutron/Nova integration smarter, so the guest can be connected to a physical NIC that is local to the Node that Nova placed the guest on. AFAIK no one is working on that.
As I mentioned above for an OVS VM, the affinity of its vhost and qemu-kvm is set to default ---- all CPUs (FFFFFFFF in this case). And, let the kernel/driver drive the utilization of CPU's. For ixgbe traffic, only even CPUs are involved, and for bnx2x only odd CPUs are involved. I still don't know why even bank was allocated for bnx2x with the current algorithm.
It's not entirely clear to me what we want to do with this initially, Dan do you have any idea how/where the vhost and qemu-kvm affinities are set right now as based on the comments this seems to be nova specific?
I don't really know enough about the neutron integration to suggest how this should be fixed. I'm just saying that conceptually if we have two physical NICs, both plugged into the same physical network, then when booting a guest we should try to prefer using the physical NIC that has good NUMA affinity. I've no idea how neutron/nova integration would look to achieve this, as neutron is not my area of expertise.
Moving to RHOSP 11/Ocata, we need to define a plan for this. Requires further discussion.
(In reply to Daniel Berrange from comment #15) > I don't really know enough about the neutron integration to suggest how this > should be fixed. I'm just saying that conceptually if we have two physical > NICs, both plugged into the same physical network, then when booting a guest > we should try to prefer using the physical NIC that has good NUMA affinity. > I've no idea how neutron/nova integration would look to achieve this, as > neutron is not my area of expertise. I'm still not really clear on how we would action this, the logic itself isn't significantly different to the handling for passthrough but the problem would seem to be ensuring we have the right information available at the right point in scheduling for these cases. Punting to 13+ and resetting assignee. Also needinfo'ing Franck because I believe we will need further input if we are to move this forward.
*** Bug 1459543 has been marked as a duplicate of this bug. ***
*** Bug 1467442 has been marked as a duplicate of this bug. ***
*** Bug 1500138 has been marked as a duplicate of this bug. ***
Hi, Another objective is to check if we can consider a predictable behavior the fact that the first instance spawned on a DPDK dual-socket compute is always spawned on numa node 0 (if the flavor requests all vCPUs on the same numa node with 'hw:numa_nodes' : '1'). This information could be a key point in order to mitigate the fact that actually there isn't a way to specify NUMA placement in the schedule.
(In reply to Aviv Guetta from comment #36) > Hi, > Another objective is to check if we can consider a predictable behavior the > fact that the first instance spawned on a DPDK dual-socket compute is always > spawned on numa node 0 (if the flavor requests all vCPUs on the same numa > node with 'hw:numa_nodes' : '1'). > > This information could be a key point in order to mitigate the fact that > actually there isn't a way to specify NUMA placement in the schedule. This was discussed on the NFV-DFG mailing list recently. As discussed there, the use of NUMA node 0 is an implementation detail and not something one should/can rely on. Workarounds such as modifying the 'vcpu_pin_set' configuration option are probably more viable while we wait on this option.
This is now completed with commit 45662d77a2da77714f8e792e86ebd64a52270ef5 in upstream.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045
Hi, Do you know if it's planned to implement predictable NUMA node selection using flavor attributes? As you have suggested in your comment, setting vcpu_pin_set (as I understand you by limiting CPUs to the subset of cpus belonging to specific NUMA) is currently the only way to make selection of NUMA predictable. Although selection of NUMA based on NUMA that NIC belongs to seems to be solved by ERRATA, but I still don't see a possibility to choose NUMA in any other situation when we are not bound to NUMA choise based on NIC.