Bug 1411606

Summary: ovs-dpdk over 10G links give low performance as 540Kpps@512Byte
Product: Red Hat OpenStack Reporter: Jaison Raju <jraju>
Component: openvswitchAssignee: Eyal Dannon <edannon>
Status: CLOSED WONTFIX QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: aloughla, apevec, atelang, atheurer, atragler, bbdemirel, cchen, chrisw, edannon, fbaudin, fherrman, fleitner, jraju, nyechiel, rhos-maint, rkhan, srevivo, vaggarwa, yrachman
Target Milestone: ---Keywords: Reopened, Unconfirmed
Target Release: ---Flags: vaggarwa: needinfo-
vaggarwa: needinfo-
yrachman: needinfo-
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-01 12:59:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
huawei's performance test archietecture with a TestCenter
none
compute node's /proc/cpuinfo none

Description Jaison Raju 2017-01-10 05:26:01 UTC
Description of problem:
A director deployment dpdk environment gives low performance :

The connection:

    br-link0                                            br-int
   /      \                                            /      \
  dpdk0    phy-br-link0 <-----------------> int-br-link0       VM_Port
           type: patch                      type: patch


Version-Release number of selected component (if applicable):
dpdk-2.2.0-3.el7.x86_64
erlang-kernel-18.3.4.4-1.el7ost.x86_64
kernel-3.10.0-514.2.2.el7.x86_64
kernel-devel-3.10.0-514.2.2.el7.x86_64
kernel-headers-3.10.0-514.2.2.el7.x86_64
kernel-tools-3.10.0-514.2.2.el7.x86_64
kernel-tools-libs-3.10.0-514.2.2.el7.x86_64
openstack-neutron-openvswitch-9.1.0-8.el7ost.noarch
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
python-openvswitch-2.5.0-14.git20160727.el7fdp.noarch


How reproducible:
Always on customer end .

Steps to Reproduce:
1. Deploy dpdk setup using director documentation .
2.
3.

Actual results:


Expected results:


Additional info:
When testing 250Kpps the %sys shoots upto 75% on the 2 cores ovs is pinned on .

Comment 16 Chen 2017-02-27 07:02:33 UTC
Sorry for reopening the bugzilla as the issue hasn't been resolved by huawei.

I got a question regarding comment #10.

"> So just to confirm the cores selected for PMD are the same ones that we set
> in /etc/sysconfig/openvswitch in dpdk options , right ?
Please use different cores, the PMDs has to be clean from any interrupts."

My understanding is, we set the cores which PMD uses in the /etc/sysconfig/openvswitch like followings:

DPDK_OPTIONS = "-l 2,4 -n 4 --socket-mem 1024,0 -w 0000:01:00.1”

the core #2 and core #4 should be selected to be used for PMD scheduling and they should be part of isolated_cores. But how can we make PMD use different cores other than ones specified in /etc/sysconfig/openvswitch ?

Best Regards.
Chen

Comment 17 Yariv 2017-02-28 06:56:36 UTC
(In reply to Chen from comment #16)
> Sorry for reopening the bugzilla as the issue hasn't been resolved by huawei.
> 
> I got a question regarding comment #10.
> 
> "> So just to confirm the cores selected for PMD are the same ones that we
> set
> > in /etc/sysconfig/openvswitch in dpdk options , right ?
> Please use different cores, the PMDs has to be clean from any interrupts."
> 
> My understanding is, we set the cores which PMD uses in the
> /etc/sysconfig/openvswitch like followings:
> 
> DPDK_OPTIONS = "-l 2,4 -n 4 --socket-mem 1024,0 -w 0000:01:00.1”
> 
> the core #2 and core #4 should be selected to be used for PMD scheduling and
> they should be part of isolated_cores. But how can we make PMD use different
> cores other than ones specified in /etc/sysconfig/openvswitch ?
> 
> Best Regards.
> Chen

Hi Chen

Did you use tuned profile? which profile version fro tuned-profiles-cpu-partitioning?

Comment 18 Chen 2017-02-28 08:03:19 UTC
Hi Yariv,

Sorry that was my mistake. I was intending to say "the core #2 and core #4 should be selected to be used for PMD scheduling and they should be part of isolcpus."

They didn't install tuned-profiles-cpu-partitioning at all.

However in compute node,

proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.2.2.el7.x86_64 root=UUID=a69bf0c7-8d41-42c5-b1f0-e64719aa7ffb ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=200 iommu=pt intel_iommu=on isolcpus=2,3,26,27

But etc/sysconfig/openvswitch shows they are not using core #2,3,26,27.

DPDK_OPTIONS = "-l 0 -n 4 --socket-mem 1024,0 -w 0000:05:00.0 -w 0000:05:00.1"

I'm confirming this with the customer now.

Best Regards,
Chen

Comment 19 Chen 2017-02-28 09:54:42 UTC
Hi team,

I also asked the customer about their performance testing method. They are using "Spirent TestCenter" to test the performance. I will attach their test arch later.

What is the recommended method to test the ovs-dpdk performance from RH side ?

Best Regards,
Chen

Comment 20 Chen 2017-02-28 09:56:00 UTC
Created attachment 1258311 [details]
huawei's performance test archietecture with a TestCenter

Comment 25 Chen 2017-03-14 15:01:33 UTC
Created attachment 1262995 [details]
compute node's /proc/cpuinfo

Comment 27 Chen 2017-03-27 08:41:21 UTC
Hi,

The customer tried to use CPUAffinity but the performance didn't improve.

Here are the details shared from customer.


1. tuned.
cat cpu-partitioning-variables.conf
isolated_cores=2-23,26-47, dpdk use 2,3,26,27

2. linux_bridge + ovs_user_bridge
New doc said ovs_bridge + ovs_user_bridge may cause performance problem. We have changed it to linux_bridge + ovs_user_bridge.

3. isolcpus 
[root@compute-0 ~]# cat /etc/systemd/system.conf
...
CPUAffinity=0 1 24 25
[root@compute-0 ~]# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-3.10.0-514.6.2.el7.x86_64 root=UUID=c2bb0683-ebe8-4541-9c5c-a811d0326ae5 ro console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=128 iommu=pt intel_iommu=on nohz=on nohz_full=2-23,26-47 rcu_nocbs=2-23,26-47 intel_pstate=disable nosoftlockup

4. ovs options
[root@compute-0 sysconfig]# cat /etc/sysconfig/openvswitch 
DPDK_OPTIONS = "-l 2,3,26,27 -n 4 --socket-mem 1024,0 -w 0000:05:00.0 -w 0000:05:00.1"

5. iperf3
two vms in different hosts test 5.28Gbits/s.

However, the customer found the following important information as well.

"We found a method can improve to 1.2Mpps.

ovs-vsctl set port br-int tag=4000
ovs-vsctl set port br-link0 tag=4001

This can prevent host dpdk cpu used by kernel."

Do you have any idea about this finding ? Afaik, the customer is using vlan 2504 and vlan 2505 when testing using testpmd.  

Best Regards,
Chen

Comment 30 Andrew Theurer 2017-03-27 12:46:32 UTC
A couple things:

1) In the DPDK_OPTIONS, the  "-l 2,3,26,27", the -l is not for the PMD thread assignment.  It is for a different set of threads for OVS.  This should not be set to the same CPUs as PMD threads.  Please switch to "-l 0".

2) If there is CPU time spent in kernel for a PMD thread, then most likely OVS is trying to forward a packet to a destination which is not of type netdev because the packet is being forwarded to all ports.  This is typical only when the destination of the packet is not known.  However, that behavior should be isolated to the first couple packets.  Once the destination is known, the packet should be forwarded to its destination, which should always be a netdev port.

The default OVS bridge configuration and the default flow rules should be maintained.  There is no reason to override the default rules -that will only cover up the underlying problem.

If the time in kernel persists in the PMD threads, then there must be a reason why the OVS cannot learn the destination for certain packets.  This can happen if the packet generator is not configured properly.  When using testpmd in the VM, the packet generator must be configured to send packets with the destination MAC matching the VM interface, and testpmd must use --forward-mode=macswap and --port-topology=chained -this is when testpmd is using a single virtio interface for a test.  This type of test should be completed successfully before a 2-virtio-interface test is attempted.

In the single interface test, when the packet generator sends the first packet, OVS will forward the packet to all ports, which will cause some time spent in kernel.  However, once testpmd receives and then sends the packet back (while swapping the src and dst MACs), the packet comes back from the VM, and OVS now learns where the destination for that MAC is.  Once this happens for the first packet, OVS no longer needs to forward any packets with this dst MAC to all ports.

Another potential problem is a hardware switch between the packet generator and the compute node.  We have the same potential problem with a hardware switch not knowing the destination for packets, and may forward packets to all ports.  They may cause duplicate packets being sent to the compute node (via different ports).

So, in summary:
-use default OVS config and flows, don't add tags
-use testpmd with macswap and chained options
-packet generator connected directly to compute node
-packet generator must configure dst MAC with -only- MAC belonging to VM interface used by testpmd

Comment 31 Chen 2017-03-27 12:58:35 UTC
Hi Andrew,

Thank you very much for your help. 

Sorry I got a question about point 1, where should I configure the cores which PMD use if it is not -l option in DPDK_OPTIONS ? My understanding is that the customer wants to use core 2,3,26,27 for DPDK. 

@Yariv, comment #24's needinfo is invalid now. let's see whether Andrew's findings could help the issue.

Best Regards,
Chen

Comment 32 Eyal Dannon 2017-03-27 14:20:59 UTC
(In reply to Chen from comment #31)
> Hi Andrew,
> 
> Thank you very much for your help. 
> 
> Sorry I got a question about point 1, where should I configure the cores
> which PMD use if it is not -l option in DPDK_OPTIONS ? My understanding is
> that the customer wants to use core 2,3,26,27 for DPDK. 
> 
> @Yariv, comment #24's needinfo is invalid now. let's see whether Andrew's
> findings could help the issue.
> 
> Best Regards,
> Chen

First calculate the Hex of your lcores using python:
#python
>>> print "%x" % ((1 << 2) | (1<< 3) | (1 << 26) | (1<< 27))
c00000c

use the output to set them as PMD for OVS:
# ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=c00000c

Then check which vhu/dpdk-port assigned to lcore:
# ovs-appctl dpif-netdev/pmd-rxq-show

Let me know if that worked for you,
Eyal

Comment 33 Chen 2017-03-29 06:16:53 UTC
Hi Andrew,

This is the feedback from the customer.

"The second describe is right.
I use the single interface test and set testpmd forward-mode to macswap, There is no kernel usage on PMD thread(keep default setting).
The forward performance is 1.2Mpps at 512B. (intel 82599 10GE, VM and DPDK cpu isolate, VM and DPDK on same NUMA, VM cpu punning)"

And the customer was asking:

"How is your test? And your forward performance?"

I understand that Franck has shared me [1] but there is no numbers for 512B. Do we have such official number ? Is 1.2 Mpps too slow for the 10G NIC card ?

[1] https://docs.google.com/presentation/d/1ObBGuwG-Bx0z1_k-6RUvjbGMOURDio_zcchaQXAz_Ek/edit

Best Regards,
Chen

Comment 34 Andrew Theurer 2017-03-29 19:10:19 UTC
OK, the results you have referenced use 2 x 10Gb interfaces and 2x  virtio interfaces.  Those results will generally be 2x the throughput compared to a single interface test.  If we look at the 1024B test, it was 2.38Mpps, and 1/2 would be 1.19. However, since the customer test is 512B size, it should be able to do 2x (estimate) packet rate as 1024B, so we would be back at 2.38Mpps for this test.

To achieve that rate, the host needs to be using 2 PMD threads to process the packets: 1 for polling the 10Gb interface, and for polling the virtio interface.

Can you ask the customer to run the test again, and during the test, on the compute node as root, run:

ovs-appctl dpif-netdev/pmd-rxq-show

ovs-appctl dpif-netdev/pmd-stats-clear
-wait 10 seconds after above command, then run:
ovs-appctl dpif-netdev/pmd-stats-show

I will need the output from the first and last command.

Comment 35 Chen 2017-03-30 06:42:38 UTC
Hi Andrew,

Thank you very much for your guide.

This is the feedback from the customer, seems they used another port and the speed has reached to 2.2M Mpps.

------- begin -------
[root@compute-0 ~]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 2:
	port: dpdk0	queue-id: 0
	port: vhubcee2161-7d	queue-id: 0
pmd thread numa_id 0 core_id 26:
	port: vhua5633965-d5	queue-id: 0
	port: vhu22c3a609-3d	queue-id: 0

Last time the 1.2Mpps tested from the Port vhubcee2161-7d. As we can see, host interface and VM testpmd port are forward by core2.

This time I use Port vhu22c3a609-3d for testing, host interface on core 2 and VM testpmd port on core 26, the speed up to 2.2Mpps@512B.

I'm glad to see the performance  can up to 2.2Mpps.

[root@compute-0 ~]# ovs-appctl dpif-netdev/pmd-stats-show
main thread:
	emc hits:0
	megaflow hits:0
	miss:0
	lost:0
	polling cycles:2953352 (100.00%)
	processing cycles:0 (0.00%)
pmd thread numa_id 0 core_id 2:
	emc hits:68469538
	megaflow hits:254
	miss:768
	lost:0
	polling cycles:3412856844 (5.01%)
	processing cycles:64677912608 (94.99%)
	avg cycles per packet: 994.45 (68090769452/68470560)
	avg processing cycles per packet: 944.61 (64677912608/68470560)
pmd thread numa_id 0 core_id 26:
	emc hits:68465405
	megaflow hits:0
	miss:0
	lost:0
	polling cycles:26789424116 (41.13%)
	processing cycles:38340654836 (58.87%)
	avg cycles per packet: 951.28 (65130078952/68465405)
	avg processing cycles per packet: 560.00 (38340654836/68465405)

It's clear that OVS+DPDK in redhat can have good performance. But there are a lot of limit for usage.

------- end -------

I'm just wondering whether "ovs-appctl dpif-netdev/pmd-rxq-show" is expected. In order to make the performance best, we can't use vhubcee2161-7d because it is using the same core with dpdk0 ? Do we have such limitations as the customer pointed out ?

Best Regards,
Chen

Comment 36 Chen 2017-03-31 03:22:22 UTC
Okay I understood the PMD assignment is round-robin and we can not manually specify the assignment for 2.5.1 and I'm assuming the customer only used two PMD threads but have 4 ports. The customer should increase the PMD thread number so that every port could have a dedicated PMD thread.

Another concern from the customer is that the customer found if the instance is using CPUs in Numa node1 then the performance will drop a lot. We both understand that all the CPUs involved should be in the same Numa node but the customer said this kind of situation (VM in Numa1 while PMD threads in Numa0) might happen in NFV environment. 

So my questions are:

1. We should ensure the co-location and we don't support such CPU partition am I right ?

2. Should we only set Numa 0's cores in vcpu_pin_set of nova.conf ? (In that way other numa nodes' resources will never be used for DPDK ?)

3. I noticed the following comments in [1]

"For v2.5.1, vhostuser ports are assumed to be on NUMA node0, and therefore will only run on node0.  If possible, plan for physical ports to also be on node 0, so co-location is obtained."

Does that mean, for current ovs version, we should only use Numa 0's resources to achieve OVS-DPDK ? What would happen in future versions ?

[1 https://docs.google.com/presentation/d/1ObBGuwG-Bx0z1_k-6RUvjbGMOURDio_zcchaQXAz_Ek/edit#slide=id.g13fb1e0270_0_17]

Best Regards,
Chen

Comment 37 Chen 2017-04-07 02:38:39 UTC
Hi Andrew, Eyal,

Could you please help me with the questions in comment #36 ? Huawei team is eager to know the answer.

Thank you in advance !

Best Regards,
Chen

Comment 38 Andrew Theurer 2017-04-10 12:28:15 UTC
1. We should ensure the co-location and we don't support such CPU partition am I right ?

It is possible to use PMD threads in different NUMA node, but packet rate can be up to 40% lower.  Sometimes there is no other option if you do not have enough cores in a single NUMA node.

2. Should we only set Numa 0's cores in vcpu_pin_set of nova.conf ? (In that way other numa nodes' resources will never be used for DPDK ?)

If you have enough cpus to do so, that is the recommendation.

3. I noticed the following comments in [1]

  "For v2.5.1, vhostuser ports are assumed to be on NUMA node0, and therefore will only run on node0.  If possible, plan for physical ports to also be on node 0, so co-location is obtained."

  Does that mean, for current ovs version, we should only use Numa 0's resources to achieve OVS-DPDK ? What would happen in future versions ?

You can still use node1 for the physical network PMD as long as the network adapter is located in numa node 1.  OVS 2.6 will allow other NUMA nodes to be used for the VM ports.

For OVS 2.5 the most ideal is have everything in NUMA node0

For OVS 2.6, you have a many more options, including isolating PMD threads to specific ports, and not isolating others.  This way you do not need to have 1 PMD thread per port (and use a lot of cores), but only 1 PMD thread per port which has high speed requirement.

Comment 40 Franck Baudin 2017-06-01 12:59:31 UTC
FIxed with RHOSP11 that ships OVS 2.6.1