1187945 – [RFE][TechPreview] Take into account NUMA locality of physical NICs when plugging instance VIFS from Neutron networks

Bug 1187945 - [RFE][TechPreview] Take into account NUMA locality of physical NICs when plugging instance VIFS from Neutron networks

Summary: [RFE][TechPreview] Take into account NUMA locality of physical NICs when plug...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	Upstream M3
Target Release:	14.0 (Rocky)
Assignee:	Stephen Finucane
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1459543 1467442 1500138 (view as bug list)
Depends On:
Blocks:	1476900 1500557 1589505 1593645 1611992 1660947 1696658 1696663 1725733 1780141 1782218
TreeView+	depends on / blocked

Reported:	2015-01-31 16:42 UTC by Jean-Tsung Hsiao
Modified:	2023-09-07 18:40 UTC (History)
CC List:	40 users (show)
Fixed In Version:	openstack-nova-18.0.0-0.20180822155218.14d9e9f.0rc1
Doc Type:	Enhancement
Doc Text:
Clone Of:
Clones:	1589505 1593645 1611992 1660947 (view as bug list)
Environment:
Last Closed:	2019-01-11 11:47:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Spreadsheet of netperf/TCP_STREAM throughput rates over bnx2x and ixgbe (17.71 KB, application/vnd.oasis.opendocument.spreadsheet) 2015-02-02 17:46 UTC, Jean-Tsung Hsiao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	541290	'None'	MERGED	Add 'numa-aware-vswitches' spec	2020-10-15 12:03:09 UTC
Red Hat Issue Tracker	OSP-13537	None	None	None	2022-03-13 14:14:01 UTC
Red Hat Product Errata	RHEA-2019:0045	None	None	None	2019-01-11 11:48:10 UTC

Description Jean-Tsung Hsiao 2015-01-31 16:42:26 UTC

Description of problem: Openstack guest to guest netperf/TCP_STREAM throughput rate over neutron-openvswitch/bnx2x went below 4 Gb

Experiment: netperf/TCP_STREAM 180 seconds test, repeat five times.
See below for min/ave/max stats:

*** guest to guest over neutron-openvswitch/bnx2x --- VLAN ***
min/ave/max = 3820.33 / 3943.95 / 3999.42 (Mbps)

For reference see the following test case:
*** VM to VM over OVS/bnx2x --- VLAN ***
min/ave/max = 9149.72 / 9298.19 / 9372.23 (Mbps)

Version-Release number of selected component (if applicable):

Openstack: 2015-01-23.1/RH7-RHOS-6.0
RHEL7.1 kernel: 3.10.0-227.el7.x86_64
ovs_version: "2.1.3"

How reproducible: reproducible


Steps to Reproduce:
Two hosts needed for the experiment
1. Congigure as Controller/Network/Compute, the other as Compute only
2. Configure a VLAN provider network using 10 Gb bnx2x to bridge the neutron-openvswitch network
3.Configure a guest per host
4.Run netperf/TCP_STREAM between the two guests

Actual results:
The average throughput rate was under 4 Gb as reported above

Expected results:
Should be around 9 Gb.

Additional info:
NOTE: netperf/TCP_MAERTS testing results showed the same issue.

Comment 4 Jean-Tsung Hsiao 2015-01-31 20:48:56 UTC

After rebooting of both hosts and re-ran the same experiment, got 5+ Gbps:

5344.24 / 5540.34 / 5828.51 (Mbps)
5371.65 / 5567.38 / 5696.73 (Mbps)

netperf/TCP_MAERTS got:

3004.93 / 3333.56 / 3529.27 (Mbps)
2960.47 / 3300.44 / 3681.04 (Mbps)

Comment 5 Jean-Tsung Hsiao 2015-02-02 17:46:03 UTC

Created attachment 987198 [details]
Spreadsheet of netperf/TCP_STREAM throughput rates over bnx2x and ixgbe

The spreadsheet include netperf/TCP_STREAM throughput rates over the following data paths:

NIC to NIC
OVS/NIC to OVS/NIC
VM/OVS/NIC to VM/OVS/NIC
Guest/Neutron/OVS/NIC to Guest/Neutron/OVS/NIC

Where NIC=bnx2x and ixgbe

Experiment: 10 netperf/TCP_STREAM tests, each lasts 180 seconds

Comment 6 Jean-Tsung Hsiao 2015-02-03 12:23:30 UTC

Found the root cause: 

The vhost and the qemu-kvm processes that associated with the Neutron/OVS Guest have their affinity set to even affinity --- 55555555. These should be changed to default value FFFFFFFF.

But, the affinity of either process cannot be changed.

[root@qe-dell-ovs4 jhsiao]# taskset -p ffffffff 29783
pid 29783's current affinity mask: 55555555
pid 29783's new affinity mask: 55555555
[root@qe-dell-ovs4 jhsiao]# taskset -p ffffffff 29832
pid 29832's current affinity mask: 55555555
pid 29832's new affinity mask: 55555555

Comment 7 Daniel Berrangé 2015-02-04 15:48:07 UTC

If we look at the 'virsh capabilities' output for the compute host, we can see the topology of the host:

    <topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>67062188</memory>
          <pages unit='KiB' size='4'>16765547</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='1' value='20'/>
          </distances>
          <cpus num='16'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0,16'/>
            <cpu id='2' socket_id='0' core_id='1' siblings='2,18'/>
            <cpu id='4' socket_id='0' core_id='2' siblings='4,20'/>
            <cpu id='6' socket_id='0' core_id='3' siblings='6,22'/>
            <cpu id='8' socket_id='0' core_id='4' siblings='8,24'/>
            <cpu id='10' socket_id='0' core_id='5' siblings='10,26'/>
            <cpu id='12' socket_id='0' core_id='6' siblings='12,28'/>
            <cpu id='14' socket_id='0' core_id='7' siblings='14,30'/>
            <cpu id='16' socket_id='0' core_id='0' siblings='0,16'/>
            <cpu id='18' socket_id='0' core_id='1' siblings='2,18'/>
            <cpu id='20' socket_id='0' core_id='2' siblings='4,20'/>
            <cpu id='22' socket_id='0' core_id='3' siblings='6,22'/>
            <cpu id='24' socket_id='0' core_id='4' siblings='8,24'/>
            <cpu id='26' socket_id='0' core_id='5' siblings='10,26'/>
            <cpu id='28' socket_id='0' core_id='6' siblings='12,28'/>
            <cpu id='30' socket_id='0' core_id='7' siblings='14,30'/>
          </cpus>
        </cell>
        <cell id='1'>
          <memory unit='KiB'>67108864</memory>
          <pages unit='KiB' size='4'>16777216</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='20'/>
            <sibling id='1' value='10'/>
          </distances>
          <cpus num='16'>
            <cpu id='1' socket_id='1' core_id='0' siblings='1,17'/>
            <cpu id='3' socket_id='1' core_id='1' siblings='3,19'/>
            <cpu id='5' socket_id='1' core_id='2' siblings='5,21'/>
            <cpu id='7' socket_id='1' core_id='3' siblings='7,23'/>
            <cpu id='9' socket_id='1' core_id='4' siblings='9,25'/>
            <cpu id='11' socket_id='1' core_id='5' siblings='11,27'/>
            <cpu id='13' socket_id='1' core_id='6' siblings='13,29'/>
            <cpu id='15' socket_id='1' core_id='7' siblings='15,31'/>
            <cpu id='17' socket_id='1' core_id='0' siblings='1,17'/>
            <cpu id='19' socket_id='1' core_id='1' siblings='3,19'/>
            <cpu id='21' socket_id='1' core_id='2' siblings='5,21'/>
            <cpu id='23' socket_id='1' core_id='3' siblings='7,23'/>
            <cpu id='25' socket_id='1' core_id='4' siblings='9,25'/>
            <cpu id='27' socket_id='1' core_id='5' siblings='11,27'/>
            <cpu id='29' socket_id='1' core_id='6' siblings='13,29'/>
            <cpu id='31' socket_id='1' core_id='7' siblings='15,31'/>
          </cpus>
        </cell>
      </cells>
    </topology>

Meanwhile, if we look at the guest in question we see it has CPU placement set:

# virsh dumpxml instance-00000005 | grep placement
  <vcpu placement='static' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30'>1</vcpu>


So this guest is basically being set to run on the first NUMA node. This is good, because otherwise it would randomly float across NUMA nodes and suffer degraded performance due to cross-NUMA memory access.

Comment 8 Daniel Berrangé 2015-02-04 15:50:56 UTC

Also, there is nothing else running on this host that is competing for CPU resources with the single guest vCPU, so I'm sceptical that this CPu pinning has an impact on the TCP network performance

Comment 9 Jean-Tsung Hsiao 2015-02-04 16:14:20 UTC

(In reply to Daniel Berrange from comment #8)
> Also, there is nothing else running on this host that is competing for CPU
> resources with the single guest vCPU, so I'm sceptical that this CPu pinning
> has an impact on the TCP network performance

Let me explain the performance impact a little bit. It's all about locality.

There are two banks of CPUs --- even an odd. And bnx2x belongs to odd bank --- based on HW configuration.
For an OVS VM its vhost and qemu-kvm tasks the affinity is set to FFFFFFFF. And, when netperf/TCP_STREAM is running odd CPUs get utilized. And that makes near line rate performance.

For a Nova guest, the affinity of both vhost and qemu-kvm is now set to even. When netperf/TCP_STREAM is running some four even CPU's and one odd CPU get significant used. And the degrades the netperf/TCP_STREAM throughtput rate down to under 6 Gb.

The key question is what causes only even CPUs got used according to the xml file? Never seen this before.

Comment 10 Daniel Berrangé 2015-02-04 16:47:26 UTC

So, IIUC, the network interface that the guest is connected to is attached to NUMA node 1, while the guest is placed on NUMA node 0. There's not really anything we can do about this in general. While Nova will soon make use of NUMA locality info for assigned PCI devices, there is no equivalent work being done to take into account locality when just connecting to openvswitch NICs. It'll just be pot luck whether any guest is local to the NIC in question or not.

If the testing requires that the guest be running on specific host NUMA node in order to reach the throughput, it was only ever be working by luck. eg there were sufficiently few guests on the host that the kernel will have scheduled the guest on the right NUMA node to achieve the performance expected.  As you run more guest on the host, inevitably some are going to be on different NUMA nodes and so not meet the performance figures. The new Nova NUMA placment logic means we just see this effect upfront during testing, instead of only once customers load their host up with many guests.

IMHO the only real way to solve this is to make the Neutron/Nova integration smarter, so the guest can be connected to a physical NIC that is local to the Node that Nova placed the guest on. AFAIK no one is working on that.

Comment 11 Jean-Tsung Hsiao 2015-02-04 17:54:34 UTC

As I mentioned above for an OVS VM, the affinity of its vhost and qemu-kvm is set to default ---- all CPUs (FFFFFFFF in this case). And, let the kernel/driver drive the utilization of CPU's. For ixgbe traffic, only even CPUs are involved, and for bnx2x only odd CPUs are involved.

I still don't know why even bank was allocated for bnx2x with the current algorithm.

Comment 14 Stephen Gordon 2015-03-18 23:38:48 UTC

It's not entirely clear to me what we want to do with this initially, Dan do you have any idea how/where the vhost and qemu-kvm affinities are set right now as based on the comments this seems to be nova specific?

Comment 15 Daniel Berrangé 2015-03-19 09:48:55 UTC

I don't really know enough about the neutron integration to suggest how this should be fixed. I'm just saying that conceptually if we have two physical NICs, both plugged into the same physical network, then when booting a guest we should try to prefer using the physical NIC that has good NUMA affinity. I've no idea how neutron/nova integration would look to achieve this, as neutron is not my area of expertise.

Comment 18 Stephen Gordon 2016-04-15 14:40:45 UTC

Moving to RHOSP 11/Ocata, we need to define a plan for this. Requires further discussion.

Comment 20 Stephen Gordon 2017-01-24 19:43:27 UTC

(In reply to Daniel Berrange from comment #15)
> I don't really know enough about the neutron integration to suggest how this
> should be fixed. I'm just saying that conceptually if we have two physical
> NICs, both plugged into the same physical network, then when booting a guest
> we should try to prefer using the physical NIC that has good NUMA affinity.
> I've no idea how neutron/nova integration would look to achieve this, as
> neutron is not my area of expertise.

I'm still not really clear on how we would action this, the logic itself isn't significantly different to the handling for passthrough but the problem would seem to be ensuring we have the right information available at the right point in scheduling for these cases. Punting to 13+ and resetting assignee.

Also needinfo'ing Franck because I believe we will need further input if we are to move this forward.

Comment 22 Stephen Finucane 2017-07-12 13:59:41 UTC

*** Bug 1459543 has been marked as a duplicate of this bug. ***

Comment 23 Stephen Gordon 2017-07-18 20:38:18 UTC

*** Bug 1467442 has been marked as a duplicate of this bug. ***

Comment 27 Stephen Gordon 2017-10-10 12:34:54 UTC

*** Bug 1500138 has been marked as a duplicate of this bug. ***

Comment 36 Aviv Guetta 2018-02-28 07:29:55 UTC

Hi,
Another objective is to check if we can consider a predictable behavior the fact that the first instance spawned on a DPDK dual-socket compute is always spawned on numa node 0 (if the flavor requests all vCPUs on the same numa node with 'hw:numa_nodes' : '1'). 

This information could be a key point in order to mitigate the fact that actually there isn't a way to specify NUMA placement in the schedule.

Comment 37 Stephen Finucane 2018-03-05 10:22:03 UTC

(In reply to Aviv Guetta from comment #36)
> Hi,
> Another objective is to check if we can consider a predictable behavior the
> fact that the first instance spawned on a DPDK dual-socket compute is always
> spawned on numa node 0 (if the flavor requests all vCPUs on the same numa
> node with 'hw:numa_nodes' : '1'). 
> 
> This information could be a key point in order to mitigate the fact that
> actually there isn't a way to specify NUMA placement in the schedule.

This was discussed on the NFV-DFG mailing list recently. As discussed there, the use of NUMA node 0 is an implementation detail and not something one should/can rely on.

Workarounds such as modifying the 'vcpu_pin_set' configuration option are probably more viable while we wait on this option.

Comment 46 Stephen Finucane 2018-08-02 15:20:18 UTC

This is now completed with commit 45662d77a2da77714f8e792e86ebd64a52270ef5 in upstream.

Comment 49 errata-xmlrpc 2019-01-11 11:47:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Comment 50 Danil Zhigalin 2019-07-18 14:00:25 UTC

Hi,

Do you know if it's planned to implement predictable NUMA node selection using flavor attributes? As you have suggested in your comment, setting vcpu_pin_set (as I understand you by limiting CPUs to the subset of cpus belonging to specific NUMA) is currently the only way to make selection of NUMA predictable. Although selection of NUMA based on NUMA that NIC belongs to seems to be solved by ERRATA, but I still don't see a possibility to choose NUMA in any other situation when we are not bound to NUMA choise based on NIC.

Note You need to log in before you can comment on or make changes to this bug.

aconole
aguetta
amuller
atragler
awaugama
bcafarel
berrange
chrisw
cpaquin
dasmith
djuran
dzhigalin
egallen
eglynn
fbaudin
fherrman
fleitner
jhsiao
jniu
jraju
kchamart
ktraynor
lyarwood
marjones
markmc
mschuppe
nyechiel
rcain
rkhan
sbauza
sgordon
sputhenp
srevivo
stephenfin
supadhya
trozet
tvvcox
vcojot
vromanso
weiyongjun