Bug 1397550

Summary: Unequal pairing of "CPU cores / PMD threads / network interfaces" when involving several NUMA nodes
Product: Red Hat Enterprise Linux 7 Reporter: bmichalo
Component: openvswitch-dpdkAssignee: Kevin Traynor <ktraynor>
Status: CLOSED CURRENTRELEASE QA Contact: Christian Trautman <ctrautma>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: akaris, atheurer, atragler, bmichalo, ctrautma, fleitner, ktraynor, mleitner, osabart, rkhan, sukulkar, tredaelli
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.6.1-3.git20161206.el7fdb openvswitch-2.6.1-10.git20161206.el7fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-26 15:10:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1411862    

Description bmichalo 2016-11-22 19:11:21 UTC
Description of problem:

There is unequal pairing of "CPU cores / PMD threads / network interfaces" when OVS starts and there is more than one NUMA node involved.  When there is only a single NUMA node involved, there seems to be fair pairing (1 CPU core:1 PMD thread:1 network interface) when OVS starts.  When two NUMA nodes are involved, some PMD threads/cores may have more than a single interface associated with them while other PMD threads/core may not have any PMD threads associated with them.  For example:

/usr/local/bin/ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 6:
	isolated : false
	port: vhost-user2	queue-id: 0
	port: vhost-user3	queue-id: 0
pmd thread numa_id 1 core_id 5:
	isolated : false
	port: dpdk2	queue-id: 0
pmd thread numa_id 0 core_id 2:
	isolated : false
	port: dpdk1	queue-id: 0
	port: vhost-user4	queue-id: 0
pmd thread numa_id 0 core_id 8:
	isolated : false
	port: dpdk0	queue-id: 0
pmd thread numa_id 1 core_id 1:
	isolated : false
	port: dpdk3	queue-id: 0
pmd thread numa_id 1 core_id 3:
	isolated : false

pmd thread numa_id 0 core_id 4:
	isolated : false
	port: vhost-user1	queue-id: 0
pmd thread numa_id 1 core_id 7:
	isolated : false

With OVS 2.6, this can be corrected with the command:

"ovs-vsctl set interface <iface> other_config:pmd-rxq-affinity="<queue>:<core>" to set the CPU/pmd thread/interface affinities.  That said, expected behavior should be the same as the 1:1:1 assignment as with a single NUMA node.

OVS 2.5 does not have the same workaround (the command takes but nothing occurs).  


Version-Release number of selected component (if applicable):
RHEL 7.3
DPDK 16.07
OVS 2.6 (and 2.5)


How reproducible:
This issue is easily reproducible.

Steps to Reproduce:
1.  Have network adapters associated with two different NUMA nodes.

2.  Start OVS (+DPDK) and assign interfaces to both NUMA nodes.  Note in the above case there are two virtual machines involved with vhost-user interfaces as well as physical interfaces (dpdk'n').

3.  List the CPU core/PMD thread/interface relationships:
ovs-appctl dpif-netdev/pmd-rxq-show


Actual results:
Unequal pairing of CPU core/PMD thread/network interface

Expected results:
A pairing of 1 CPU core per 1 PMD thread per network interface


Additional info:

Comment 1 Kevin Traynor 2016-11-22 21:33:40 UTC
The interfaces are round-robined to the different pmds/cores automatically by OVS. However, it does not associate interfaces with pmds/cores on different NUMA nodes as this would cause performance issues. In OVS2.5/DPDK2.2 vhost interfaces are associated with NUMA 0.

In the example above, there are 5x interfaces associated with NUMA 0 round-robined over 4x NUMA 0 pmd/cores, as well as 2x NUMA 1 associated interfaces round-robined over 3x NUMA 1 pmd/cores.

Later versions of OVS/DPDK are more flexible but I don't see the above behavior as a bug, as it's done to protect against inter-NUMA performance drops.

Comment 3 bmichalo 2016-11-23 15:26:50 UTC
(In reply to Kevin Traynor from comment #1)
> The interfaces are round-robined to the different pmds/cores automatically
> by OVS. However, it does not associate interfaces with pmds/cores on
> different NUMA nodes as this would cause performance issues. In
> OVS2.5/DPDK2.2 vhost interfaces are associated with NUMA 0.
> 
> In the example above, there are 5x interfaces associated with NUMA 0
> round-robined over 4x NUMA 0 pmd/cores, as well as 2x NUMA 1 associated
> interfaces round-robined over 3x NUMA 1 pmd/cores.
> 
> Later versions of OVS/DPDK are more flexible but I don't see the above
> behavior as a bug, as it's done to protect against inter-NUMA performance
> drops.

In this case, there should be 4 interfaces / cores associated with NUMA node 0 and 4 interfaces / cores associated with NUMA node 1.  I would then think that each of these 8 cores gets a single PMD thread created upon them.  Is there a way to make the mechanism more NUMA aware without (or lessening) the need for manual pinning via "ovs-vsctl set interface <iface> other_config:pmd-rxq-affinity="<queue>:<core>"?

Comment 4 Kevin Traynor 2016-11-25 15:56:35 UTC
(In reply to bmichalo from comment #3)
> In this case, there should be 4 interfaces / cores associated with NUMA node
> 0 and 4 interfaces / cores associated with NUMA node 1.  I would then think
> that each of these 8 cores gets a single PMD thread created upon them.  Is
> there a way to make the mechanism more NUMA aware without (or lessening) the
> need for manual pinning via "ovs-vsctl set interface <iface>
> other_config:pmd-rxq-affinity="<queue>:<core>"?

In OVS2.5:
cores/pmds can be set by the user on any NUMA node using the pmd-cpu-mask field.

Physical NICs are stuck on a NUMA node by virtue of where they are physically placed.

For vhost devices, OVS 2.5 uses DPDK 2.2 and the feature to be able to get the NUMA information about a vhost device was not introduced until DPDK 16.07 (2 releases later).

So for OVS2.5, the parameter you can tune to help is the pmd-cpu-mask. For example,  if you know that all or most of your interfaces are on NUMA 0, then it would be better to have more cores there and less on NUMA 1.

Comment 5 Andrew Theurer 2016-11-28 15:59:12 UTC
(In reply to Kevin Traynor from comment #1)
> OVS2.5/DPDK2.2 vhost interfaces are associated with NUMA 0.

So this is obviously incorrect where the VM's memory is not in NUMA node0, right?  Can we just set a milestone for this BZ for the OVS version that will adopt DPDK 16.07 (OVS 2.6?).  I'd like to keep the BZ open so we can track and test when the target milestone version is available.

Comment 6 bmichalo 2016-11-28 19:34:08 UTC
With OVS 2.6, if I compile DPDK 16.07 with the flag CONFIG_RTE_LIBRTE_VHOST_NUMA=y (instead of the default value CONFIG_RTE_LIBRTE_VHOST_NUMA=n) I see propery core/PMD thread/network interface pairings with proper NUMA affinities across NUMA nodes 0 and 1:

/usr/local/bin/ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 1:
	isolated : false
	port: vhost-user4	queue-id: 0
pmd thread numa_id 1 core_id 7:
	isolated : false
	port: dpdk2	queue-id: 0
pmd thread numa_id 0 core_id 4:
	isolated : false
	port: vhost-user2	queue-id: 0
pmd thread numa_id 1 core_id 5:
	isolated : false
	port: vhost-user3	queue-id: 0
pmd thread numa_id 0 core_id 6:
	isolated : false
	port: dpdk1	queue-id: 0
pmd thread numa_id 1 core_id 3:
	isolated : false
	port: dpdk3	queue-id: 0
pmd thread numa_id 0 core_id 8:
	isolated : false
	port: dpdk0	queue-id: 0
pmd thread numa_id 0 core_id 2:
	isolated : false
	port: vhost-user1	queue-id: 0

Comment 7 bmichalo 2016-11-28 20:47:18 UTC
Given this data, can we change the default value of CONFIG_RTE_LIBRTE_VHOST_NUMA from:

CONFIG_RTE_LIBRTE_VHOST_NUMA=n

to 

CONFIG_RTE_LIBRTE_VHOST_NUMA=y

Comment 8 Flavio Leitner 2016-11-28 22:40:10 UTC
(In reply to bmichalo from comment #7)
unfortunately not in dpdk 2.2 because it's broken.
It causes segfaults when enabled.

Most probably yes in OVS 2.6 + dpdk >=16.07, but needs testing verification yet.

Comment 9 Kevin Traynor 2016-11-29 09:24:09 UTC
(In reply to bmichalo from comment #7)
> Given this data, can we change the default value of
> CONFIG_RTE_LIBRTE_VHOST_NUMA from:
> 
> CONFIG_RTE_LIBRTE_VHOST_NUMA=n
> 
> to 
> 
> CONFIG_RTE_LIBRTE_VHOST_NUMA=y

Also, a point to note is that in dpdk 2.2/16.04 this flag just allows the dpdk vhost library to move some memory based on NUMA nodes. OVS cannot get NUMA information even with this flag =y as the API was not introduced until dpdk 16.07.

Comment 10 Kevin Traynor 2017-04-07 13:31:50 UTC
(In reply to Andrew Theurer from comment #5)
> (In reply to Kevin Traynor from comment #1)
> > OVS2.5/DPDK2.2 vhost interfaces are associated with NUMA 0.
> 
> So this is obviously incorrect where the VM's memory is not in NUMA node0,
> right?  Can we just set a milestone for this BZ for the OVS version that
> will adopt DPDK 16.07 (OVS 2.6?).  I'd like to keep the BZ open so we can
> track and test when the target milestone version is available.

Hi Bill/Andrew, I think you guys have tested OVS-DPDK and NUMA with OVS 2.6.1. Do you have enough testing complete that this bug can be closed?

Comment 11 Andreas Karis 2017-04-17 20:08:35 UTC
From the source RPM:

https://access.redhat.com/downloads/content/openvswitch/2.6.1-10.git20161206.el7fdp/x86_64/fd431d51/package

INSTALL.DPDK-ADVANCED.md
~~~
(...)
### 3.6 NUMA/Cluster on Die

  Ideally inter NUMA datapaths should be avoided where possible as packets
  will go across QPI and there may be a slight performance penalty when
  compared with intra NUMA datapaths. On Intel Xeon Processor E5 v3,
  Cluster On Die is introduced on models that have 10 cores or more.
  This makes it possible to logically split a socket into two NUMA regions
  and again it is preferred where possible to keep critical datapaths
  within the one cluster.

  It is good practice to ensure that threads that are in the datapath are
  pinned to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs
  responsible for forwarding. If DPDK is built with
  CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically
  detect the NUMA socket of the QEMU vCPUs and will be serviced by a PMD
  from the same node provided a core on this node is enabled in the
  pmd-cpu-mask. libnuma packages are required for this feature.
(...)
~~~

This seems to stil be disabled in latest versions of DPDK by default, but OVS 2.6.1 overwrites it:
~~~
[akaris@wks-akaris openvswitch-2.6.1-10.git20161206.el7fdp.src]$ grep CONFIG_RTE_LIBRTE_VHOST_NUMA * -R
dpdk-16.11/mk/rte.app.mk:ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
dpdk-16.11/config/common_base:CONFIG_RTE_LIBRTE_VHOST_NUMA=n
dpdk-16.11/lib/librte_vhost/Makefile:ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
openvswitch-2.6.1/INSTALL.DPDK-ADVANCED.md:  CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically
openvswitch-2.6.1/NEWS:       node that device memory is located on if CONFIG_RTE_LIBRTE_VHOST_NUMA
openvswitch-2.6.1/debian/changelog:       node that device memory is located on if CONFIG_RTE_LIBRTE_VHOST_NUMA
openvswitch.spec:setconf CONFIG_RTE_LIBRTE_VHOST_NUMA y
~~~

~~~
[akaris@wks-akaris openvswitch-2.6.1-3.git20161206.el7fdb.src]$ grep CONFIG_RTE_LIBRTE_VHOST_NUMA * -R
dpdk-16.11/mk/rte.app.mk:ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
dpdk-16.11/config/common_base:CONFIG_RTE_LIBRTE_VHOST_NUMA=n
dpdk-16.11/lib/librte_vhost/Makefile:ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
openvswitch-2.6.1/INSTALL.DPDK-ADVANCED.md:  CONFIG_RTE_LIBRTE_VHOST_NUMA=y, vHost User ports automatically
openvswitch-2.6.1/NEWS:       node that device memory is located on if CONFIG_RTE_LIBRTE_VHOST_NUMA
openvswitch-2.6.1/debian/changelog:       node that device memory is located on if CONFIG_RTE_LIBRTE_VHOST_NUMA
openvswitch.spec:setconf CONFIG_RTE_LIBRTE_VHOST_NUMA y
~~~

~~~
# Enable DPDK libraries needed by OVS
setconf CONFIG_RTE_LIBRTE_VHOST_NUMA y
setconf CONFIG_RTE_LIBRTE_PMD_PCAP y
~~~

If we compare this to OVS 2.5 with DPDK 2.2:
~~~
[akaris@wks-akaris openvswitch-2.5.0-14.git20160727.el7fdp.src]$ grep CONFIG_RTE_LIBRTE_VHOST_NUMA * -R
dpdk-2.2.0/mk/rte.app.mk:ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
dpdk-2.2.0/config/common_linuxapp:CONFIG_RTE_LIBRTE_VHOST_NUMA=n
dpdk-2.2.0/lib/librte_vhost/Makefile:ifeq ($(CONFIG_RTE_LIBRTE_VHOST_NUMA),y)
openvswitch.spec:setconf CONFIG_RTE_LIBRTE_VHOST_NUMA n
~~~

So the question now is:
is dpdk-16.11/config/common_base:CONFIG_RTE_LIBRTE_VHOST_NUMA=n  a default which is overwritten by openvswitch.spec:setconf CONFIG_RTE_LIBRTE_VHOST_NUMA y
Or do both values need to be "y"?

Comment 12 Kevin Traynor 2017-04-18 12:40:51 UTC
(In reply to Andreas Karis from comment #11)

> 
> So the question now is:
> is dpdk-16.11/config/common_base:CONFIG_RTE_LIBRTE_VHOST_NUMA=n  a default
> which is overwritten by openvswitch.spec:setconf
> CONFIG_RTE_LIBRTE_VHOST_NUMA y

Yes, that's correct. This is the relevant step: 

+ grep -q CONFIG_RTE_LIBRTE_VHOST_NUMA x86_64-native-linuxapp-gcc/.config
+ sed -i 's:^CONFIG_RTE_LIBRTE_VHOST_NUMA=.*$:CONFIG_RTE_LIBRTE_VHOST_NUMA=y:g' x86_64-native-linuxapp-gcc/.config

> Or do both values need to be "y"?

No, openvswitch.spec overwrites prior to build.

Comment 13 Andreas Karis 2017-04-18 13:59:29 UTC
Hi,

Perfect, then I guess that this confirms that we enable this in the latest versions of OVS-DPDK.

Thanks!

- Andreas

Comment 14 Andreas Karis 2017-04-18 19:06:04 UTC
Hi,

* On an already compiled DPDK / OVS binary, how can we verify that CONFIG_RTE_LIBRTE_VHOST_NUMA was set? Can I run 'strings' or anything else against a binary? (looking for a safer / better confirmation thatn just looking at the binary)

* Is this https://software.intel.com/en-us/articles/vhost-user-numa-awareness-in-open-vswitch-with-dpdk a good way to test dpdk awareness?
if not, what's redhat 's way to test dpdk awareness?

* as of now, when we I tested the performance,
if instance and pmd are on same numa node, the iperf3 performance around 7 Gb/s,
if instance and pmd are on different numa node, the iperf3 performance is around 5.5 Gb/s.

==> is that something that VHOST_NUMA awareness would fix?

* https://www.youtube.com/watch?v=NXhyN79019U 
VHOST_NUMA and this video are related, or not? as far as I understand, this video shows some experimental feature which is not yet in upstream OVS, correct?

Comment 15 Kevin Traynor 2017-04-19 17:48:32 UTC
(In reply to Andreas Karis from comment #14)
> Hi,
> 
> * On an already compiled DPDK / OVS binary, how can we verify that
> CONFIG_RTE_LIBRTE_VHOST_NUMA was set? Can I run 'strings' or anything else
> against a binary? (looking for a safer / better confirmation thatn just
> looking at the binary)
> 

commented in https://bugzilla.redhat.com/show_bug.cgi?id=1437163#c24

> * Is this
> https://software.intel.com/en-us/articles/vhost-user-numa-awareness-in-open-
> vswitch-with-dpdk a good way to test dpdk awareness?
> if not, what's redhat 's way to test dpdk awareness?
> 

I think it's a good clear test for the functionality. Ultimately some intra-numa perf test on both numa nodes can confirm that the rates are the same, but that test can be affected by something else using the cores etc.

> * as of now, when we I tested the performance,
> if instance and pmd are on same numa node, the iperf3 performance around 7
> Gb/s,
> if instance and pmd are on different numa node, the iperf3 performance is
> around 5.5 Gb/s.
> 
> ==> is that something that VHOST_NUMA awareness would fix?

No, because that actually has cross numa traffic. commented further in https://bugzilla.redhat.com/show_bug.cgi?id=1437163#c24

> 
> * https://www.youtube.com/watch?v=NXhyN79019U 
> VHOST_NUMA and this video are related, or not? as far as I understand, this
> video shows some experimental feature which is not yet in upstream OVS,
> correct?

yes, that's correct. It was just research, they never released any code for it.

Comment 16 Andreas Karis 2017-04-19 18:19:38 UTC
Hi Kevin - good, that clarifies the part about the presentation on youtube :-)

Comment 17 Kevin Traynor 2017-05-12 13:07:05 UTC
(In reply to Andrew Theurer from comment #5)
> (In reply to Kevin Traynor from comment #1)
> > OVS2.5/DPDK2.2 vhost interfaces are associated with NUMA 0.
> 
> So this is obviously incorrect where the VM's memory is not in NUMA node0,
> right?  Can we just set a milestone for this BZ for the OVS version that
> will adopt DPDK 16.07 (OVS 2.6?).  I'd like to keep the BZ open so we can
> track and test when the target milestone version is available.

(In reply to bmichalo from comment #7)
> Given this data, can we change the default value of
> CONFIG_RTE_LIBRTE_VHOST_NUMA from:
> 
> CONFIG_RTE_LIBRTE_VHOST_NUMA=n
> 
> to 
> 
> CONFIG_RTE_LIBRTE_VHOST_NUMA=y

These are both done and enabled as part of OVS 2.6.1 w DPDK 16.11.