Bug 1378586

Summary: RFE: Inconsistent 64 bytes Xena to testpmd/vhostuser throughput caused by mixed queues in a pmd core
Product: Red Hat Enterprise Linux 7 Reporter: Jean-Tsung Hsiao <jhsiao>
Component: openvswitchAssignee: Flavio Leitner <fleitner>
Status: CLOSED CURRENTRELEASE QA Contact: Jean-Tsung Hsiao <jhsiao>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: aloughla, atelang, atheurer, atragler, bmichalo, ctrautma, dshaks, fbaudin, fleitner, jhsiao, kzhang, mleitner, osabart, rcain, sukulkar, tli, ttracy, vchundur
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.6.1-10.git20161206.el7fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-13 21:48:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jean-Tsung Hsiao 2016-09-22 19:41:56 UTC
Description of problem: Inconsistent 64 bytes Xena to testpmd/vhostuser throughput caused by mixed queues in a pmd core

Two different throughput rates have been observed --- 14.7 Mpps versus 13.6 Mpps.

*** test case ***

Xena to testpmd/vhostuser, 4Q, single direction

*** Throughput rate = 14.7 Mpps ***

This rate has been observed when there are no mixed queues in any of the four pmd cores, like:

[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-showpmd thread numa_id 1 core_id 17:
	port: dpdk0	queue-id: 0
	port: dpdk1	queue-id: 0
	port: vhost0	queue-id: 0
	port: vhost1	queue-id: 0
pmd thread numa_id 1 core_id 19:
	port: dpdk0	queue-id: 1
	port: dpdk1	queue-id: 1
	port: vhost0	queue-id: 1
	port: vhost1	queue-id: 1
pmd thread numa_id 1 core_id 23:
	port: dpdk0	queue-id: 3
	port: dpdk1	queue-id: 2
	port: vhost0	queue-id: 2
	port: vhost1	queue-id: 2
pmd thread numa_id 1 core_id 21:
	port: dpdk0	queue-id: 2
	port: dpdk1	queue-id: 3
	port: vhost0	queue-id: 3
	port: vhost1	queue-id: 3


*** Throughput rate = 13.6 Mpps ***

This rate has been observed when there are mixed queues in some of the four pmd cores, like:

[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 17:
	port: dpdk0	queue-id: 0
	port: dpdk1	queue-id: 0
	port: vhost0	queue-id: 0
	port: vhost1	queue-id: 0
pmd thread numa_id 1 core_id 19:
	port: dpdk0	queue-id: 1
	port: dpdk1	queue-id: 1
	port: vhost0	queue-id: 1
	port: vhost1	queue-id: 1
pmd thread numa_id 1 core_id 23:
	port: dpdk0	queue-id: 3
	port: dpdk1	queue-id: 2
	port: vhost0	queue-id: 2
	port: vhost1	queue-id: 2
pmd thread numa_id 1 core_id 21:
	port: dpdk0	queue-id: 2
	port: dpdk1	queue-id: 3
	port: vhost0	queue-id: 3
	port: vhost1	queue-id: 3

Version-Release number of selected component (if applicable):
[root@netqe5 XenaScripts]# rpm -qa | grep openvswitch
openvswitch-2.5.0-5.git20160628.el7fdb.x86_64
[root@netqe5 XenaScripts]# rpm -qa | grep dpdk
kernel-kernel-networking-ovs-dpdk-vhostuser-1.0-5.noarch
dpdk-tools-16.04-4.el7fdb.x86_64
dpdk-16.04-4.el7fdb.x86_64
[root@netqe5 XenaScripts]# uname -a
Linux netqe5.knqe.lab.eng.bos.redhat.com 3.10.0-506.el7.x86_64 #1 SMP Mon Sep 12 23:31:02 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux


How reproducible: reproducible


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jean-Tsung Hsiao 2016-09-22 19:55:34 UTC
A correction:


*** Throughput rate = 14.7 Mpps ***

This rate has been observed when there are no mixed queues in any of the four pmd cores, like:

[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 17:
	port: dpdk0	queue-id: 0
	port: dpdk1	queue-id: 0
	port: vhost0	queue-id: 0
	port: vhost1	queue-id: 0
pmd thread numa_id 1 core_id 19:
	port: dpdk0	queue-id: 1
	port: dpdk1	queue-id: 1
	port: vhost0	queue-id: 1
	port: vhost1	queue-id: 1
pmd thread numa_id 1 core_id 21:
	port: dpdk0	queue-id: 2
	port: dpdk1	queue-id: 2
	port: vhost0	queue-id: 2
	port: vhost1	queue-id: 2
pmd thread numa_id 1 core_id 23:
	port: dpdk0	queue-id: 3
	port: dpdk1	queue-id: 3
	port: vhost0	queue-id: 3
	port: vhost1	queue-id: 3

Comment 3 Jean-Tsung Hsiao 2016-09-22 23:49:07 UTC

(In reply to Jean-Tsung Hsiao from comment #1)
> A correction:
> 
> 
> *** Throughput rate = 14.7 Mpps ***
> 
> This rate has been observed when there are no mixed queues in any of the
> four pmd cores, like:
> 
> [root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
> pmd thread numa_id 1 core_id 17:
> 	port: dpdk0	queue-id: 0
> 	port: dpdk1	queue-id: 0
> 	port: vhost0	queue-id: 0
> 	port: vhost1	queue-id: 0
> pmd thread numa_id 1 core_id 19:
> 	port: dpdk0	queue-id: 1
> 	port: dpdk1	queue-id: 1
> 	port: vhost0	queue-id: 1
> 	port: vhost1	queue-id: 1
> pmd thread numa_id 1 core_id 21:
> 	port: dpdk0	queue-id: 2
> 	port: dpdk1	queue-id: 2
> 	port: vhost0	queue-id: 2
> 	port: vhost1	queue-id: 2
> pmd thread numa_id 1 core_id 23:
> 	port: dpdk0	queue-id: 3
> 	port: dpdk1	queue-id: 3
> 	port: vhost0	queue-id: 3
> 	port: vhost1	queue-id: 3

With openvswitch-2.5.0-10.git20160727.el7fdb.x86_64 I nerver got such a queue to core alignment like above after 5 tries.

That explains why I never got 14.7 Mpps with openvswitch-2.5.0-10.git20160727.el7fdb.x86_64.

Comment 4 Jean-Tsung Hsiao 2016-09-23 02:24:14 UTC
Below are three sets of 4 queues to 8 cores mapping --- most of the cores hosting more than one queue.

[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 11:
    port: dpdk0    queue-id: 1
    port: vhost0    queue-id: 0
pmd thread numa_id 1 core_id 15:
    port: dpdk0    queue-id: 3
    port: vhost0    queue-id: 1
pmd thread numa_id 1 core_id 9:
    port: dpdk0    queue-id: 0
    port: vhost0    queue-id: 2
pmd thread numa_id 1 core_id 17:
    port: dpdk1    queue-id: 0
    port: vhost0    queue-id: 3
pmd thread numa_id 1 core_id 21:
    port: dpdk1    queue-id: 1
    port: vhost1    queue-id: 0
pmd thread numa_id 1 core_id 13:
    port: dpdk0    queue-id: 2
    port: vhost1    queue-id: 1
pmd thread numa_id 1 core_id 19:
    port: dpdk1    queue-id: 2
    port: vhost1    queue-id: 2
pmd thread numa_id 1 core_id 23:
    port: dpdk1    queue-id: 3
    port: vhost1    queue-id: 3

[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 11:
    port: dpdk0    queue-id: 1
    port: vhost0    queue-id: 0
pmd thread numa_id 1 core_id 9:
    port: dpdk0    queue-id: 0
    port: vhost0    queue-id: 1
pmd thread numa_id 1 core_id 13:
    port: dpdk0    queue-id: 2
    port: vhost0    queue-id: 2
pmd thread numa_id 1 core_id 17:
    port: dpdk1    queue-id: 0
    port: vhost0    queue-id: 3
pmd thread numa_id 1 core_id 19:
    port: dpdk1    queue-id: 1
    port: vhost1    queue-id: 0
pmd thread numa_id 1 core_id 15:
    port: dpdk0    queue-id: 3
    port: vhost1    queue-id: 1
pmd thread numa_id 1 core_id 21:
    port: dpdk1    queue-id: 2
    port: vhost1    queue-id: 2
pmd thread numa_id 1 core_id 23:
    port: dpdk1    queue-id: 3
    port: vhost1    queue-id: 3

[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 19:
    port: dpdk1    queue-id: 0
    port: vhost0    queue-id: 0
pmd thread numa_id 1 core_id 13:
    port: dpdk0    queue-id: 2
    port: vhost0    queue-id: 1
pmd thread numa_id 1 core_id 17:
    port: dpdk1    queue-id: 1
    port: vhost0    queue-id: 2
pmd thread numa_id 1 core_id 15:
    port: dpdk0    queue-id: 3
    port: vhost0    queue-id: 3
pmd thread numa_id 1 core_id 9:
    port: dpdk0    queue-id: 0
    port: vhost1    queue-id: 0
pmd thread numa_id 1 core_id 11:
    port: dpdk0    queue-id: 1
    port: vhost1    queue-id: 1
pmd thread numa_id 1 core_id 21:
    port: dpdk1    queue-id: 2
    port: vhost1    queue-id: 2
pmd thread numa_id 1 core_id 23:
    port: dpdk1    queue-id: 3
    port: vhost1    queue-id: 3

Comment 5 Christian Trautman 2016-10-17 20:24:52 UTC
This is very easy to reproduce with 4PMDs using 2 queues with uni-directional traffic.

With the following mapping I can get close to 9mpps.

pmd thread numa_id 0 core_id 18:
	port: dpdk0	queue-id: 0
	port: dpdkvhostuser0	queue-id: 0
pmd thread numa_id 0 core_id 22:
	port: dpdk0	queue-id: 1
	port: dpdkvhostuser0	queue-id: 1
pmd thread numa_id 0 core_id 42:
	port: dpdk1	queue-id: 0
	port: dpdkvhostuser1	queue-id: 0
pmd thread numa_id 0 core_id 46:
	port: dpdk1	queue-id: 1
	port: dpdkvhostuser1	queue-id: 1

throughput_rx_fps, Value: 8737506

If the queues cross hyper-threads such that a queue port ends up crossed as seen on PMDs 18 and 22 then the performance drops.

pmd thread numa_id 0 core_id 46:
	port: dpdk1	queue-id: 0
	port: dpdkvhostuser0	queue-id: 0
pmd thread numa_id 0 core_id 18:
	port: dpdk0	queue-id: 0
	port: dpdkvhostuser0	queue-id: 1
pmd thread numa_id 0 core_id 22:
	port: dpdk0	queue-id: 1
	port: dpdkvhostuser1	queue-id: 0
pmd thread numa_id 0 core_id 42:
	port: dpdk1	queue-id: 1
	port: dpdkvhostuser1	queue-id: 1

throughput_rx_fps, Value: 4910132

We have been chasing this one for a while as our MQ testing can result in inconsistent results based on the mapping. 

If I move up to 8PMDs with 2 Queue there is less chance of an inconsistent result, but it still can occur.

The following mapping produced 13407173 fps with bi-directional traffic.

[DEBUG]  2016-10-17 16:20:41,286 : (ovs_dpdk_vhost) - cmd : ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 14:
	port: dpdk0	queue-id: 1
pmd thread numa_id 0 core_id 36:
	port: dpdk1	queue-id: 0
pmd thread numa_id 0 core_id 18:
	port: dpdk1	queue-id: 1
pmd thread numa_id 0 core_id 22:
	port: dpdkvhostuser0	queue-id: 0
pmd thread numa_id 0 core_id 12:
	port: dpdk0	queue-id: 0
pmd thread numa_id 0 core_id 46:
	port: dpdkvhostuser0	queue-id: 1
pmd thread numa_id 0 core_id 42:
	port: dpdkvhostuser1	queue-id: 0
pmd thread numa_id 0 core_id 38:
	port: dpdkvhostuser1	queue-id: 1

This performance result occurs infrequently (maybe one in 6 tries). 

I usually get performance similar to 17530658 fps with bi-directional traffic. This mapping would reproduce this number.

pmd thread numa_id 0 core_id 12:
	port: dpdk0	queue-id: 0
pmd thread numa_id 0 core_id 22:
	port: dpdk1	queue-id: 0
pmd thread numa_id 0 core_id 36:
	port: dpdk1	queue-id: 1
pmd thread numa_id 0 core_id 38:
	port: dpdkvhostuser0	queue-id: 0
pmd thread numa_id 0 core_id 42:
	port: dpdkvhostuser0	queue-id: 1
pmd thread numa_id 0 core_id 14:
	port: dpdk0	queue-id: 1
pmd thread numa_id 0 core_id 18:
	port: dpdkvhostuser1	queue-id: 0
pmd thread numa_id 0 core_id 46:
	port: dpdkvhostuser1	queue-id: 1

Comment 6 liting 2016-11-29 07:34:49 UTC
In the gating CI of ovs 2.5 git 22 in rhel 7.3, there are following cases result less than the baseline test, it may be caused by this bug.

without vlan traffic:
pvp_cont 4queue 8pmd 10.05 (baseline 18)
pvp_tput 2queue 8pmd 64, 0.002 9.55 (baseline 14)
pvp_tput 1 queue 4pmd testpmd 64, 0.00 6.64 (baseline 7)

with vlan traffic:
pvp_tput 2queue 4pmd 64, 0.002 9.94(baseline 16)
pvp_tput 2queue 8pmd 64, 0.002 9.66(baseline 16)
pvp_tput 1 queue 4pmd testpmd 64, 0.00 6.45(baseline 8)
pvp_tput 2 queue 4pmd 64, 0.00 7.19(baseline 8)

Comment 7 Flavio Leitner 2016-11-30 13:31:48 UTC
Looking at the package changelog between -5 (working version according with summary) and -10 (bad version according with comment#3):

* Fri Aug 26 2016 Panu Matilainen <pmatilai> - 2.5.0-10.git20160727
- Fix adding ukeys for same flow by different pmds (#1364898)

* Thu Jul 28 2016 Flavio Leitner <fbl> - 2.5.0-9.git20160727
- Fixed ifup-ovs to support DPDK Bond (#1360426)

* Thu Jul 28 2016 Flavio Leitner <fbl> - 2.5.0-8.git20160727
- Fixed ifup-ovs to delete the ports first (#1359890)

* Wed Jul 27 2016 Flavio Leitner <fbl> - 2.5.0-7.git20160727
- pull bugfixes from upstream 2.5 branch (#1360431)

* Tue Jul 26 2016 Flavio Leitner <fbl> - 2.5.0-6.git20160628
- Removed redundant provides for openvswitch
- Added epoch to the provides for -static package

* Thu Jul 21 2016 Flavio Leitner <fbl> - 2.5.0-5.git20160628
- Renamed to openvswitch (dpdk enabled)
- Enabled sub-packages
- Removed conflicts to openvswitch
- Increased epoch to give this package preference over stable

Comment 8 Flavio Leitner 2016-11-30 13:48:40 UTC
These are the possible related changes:
* Fri Aug 26 2016 Panu Matilainen <pmatilai> - 2.5.0-10.git20160727
- Fix adding ukeys for same flow by different pmds (#1364898)

* Wed Jul 27 2016 Flavio Leitner <fbl> - 2.5.0-7.git20160727
- pull bugfixes from upstream 2.5 branch (#1360431)


Most probably something in the -7 update, these are the suspicious ones:


Author: Kevin Traynor <kevin.traynor>
Date:   Fri Jun 10 17:49:38 2016 +0100

    netdev-dpdk: Remove vhost send retries when no packets have been sent.
    
    If the guest is connected but not servicing the virt queue, this leads
    to vhost send retries until timeout. This is fine in isolation but if
    there are other high rate queues also being serviced by the same PMD
    it can lead to a performance hit on those queues. Change to only retry
    when at least some packets have been successfully sent on the previous
    attempt.
    
    Also, limit retries to avoid a similar delays if packets are being sent
    at a very low rate due to few available descriptors.

commit 4338c1d35faee97e2a0f4f83736286d3fdfc2c9a
Author: Zoltán Balogh <zoltan.balogh>
Date:   Fri Jul 15 10:28:33 2016 +0000

    netdev-dpdk: vhost-user port link state fix
    
    OVS reports that link state of a vhost-user port (type=dpdkvhostuser) is
    DOWN, even when traffic is running through the port between a Virtual
    Machine and the vSwitch. Changing admin state with the
    "ovs-ofctl mod-port <BR> <PORT> up/down" command over OpenFlow does
    affect neither the reported link state nor the traffic.
    
    The patch below does the flowing:
     - Triggers link state change by altering netdev's change_seq member.
     - Controls sending/receiving of packets through vhost-user port
       according to the port's current admin state.
     - Sets admin state of newly created vhost-user port to UP.

commit 2e6a1eae96615cab458757d62c95ce9993df7202
Author: Flavio Leitner <fbl>
Date:   Tue Jul 5 10:33:38 2016 -0300

    dpif-netdev: Remove PMD latency on seq_mutex
    
    The PMD thread needs to keep processing RX queues in order
    to achieve maximum throughput. It also needs to sweep emc
    cache and quiesce which use seq_mutex. That mutex can
    eventually block the PMD thread causing latency spikes and
    affecting the throughput.
    
    Since there is no requirement for running those tasks at a
    specific time, this patch extend seq API to allow tentative
    locking instead.


Those should not change the way queues are mapped, so I think -5 worked by chance?

Comment 9 Jean-Tsung Hsiao 2016-11-30 14:07:12 UTC
(In reply to Flavio Leitner from comment #8)
> These are the possible related changes:
> * Fri Aug 26 2016 Panu Matilainen <pmatilai> -
> 2.5.0-10.git20160727
> - Fix adding ukeys for same flow by different pmds (#1364898)
> 
> * Wed Jul 27 2016 Flavio Leitner <fbl> - 2.5.0-7.git20160727
> - pull bugfixes from upstream 2.5 branch (#1360431)
> 
> 
> Most probably something in the -7 update, these are the suspicious ones:
> 
> 
> Author: Kevin Traynor <kevin.traynor>
> Date:   Fri Jun 10 17:49:38 2016 +0100
> 
>     netdev-dpdk: Remove vhost send retries when no packets have been sent.
>     
>     If the guest is connected but not servicing the virt queue, this leads
>     to vhost send retries until timeout. This is fine in isolation but if
>     there are other high rate queues also being serviced by the same PMD
>     it can lead to a performance hit on those queues. Change to only retry
>     when at least some packets have been successfully sent on the previous
>     attempt.
>     
>     Also, limit retries to avoid a similar delays if packets are being sent
>     at a very low rate due to few available descriptors.
> 
> commit 4338c1d35faee97e2a0f4f83736286d3fdfc2c9a
> Author: Zoltán Balogh <zoltan.balogh>
> Date:   Fri Jul 15 10:28:33 2016 +0000
> 
>     netdev-dpdk: vhost-user port link state fix
>     
>     OVS reports that link state of a vhost-user port (type=dpdkvhostuser) is
>     DOWN, even when traffic is running through the port between a Virtual
>     Machine and the vSwitch. Changing admin state with the
>     "ovs-ofctl mod-port <BR> <PORT> up/down" command over OpenFlow does
>     affect neither the reported link state nor the traffic.
>     
>     The patch below does the flowing:
>      - Triggers link state change by altering netdev's change_seq member.
>      - Controls sending/receiving of packets through vhost-user port
>        according to the port's current admin state.
>      - Sets admin state of newly created vhost-user port to UP.
> 
> commit 2e6a1eae96615cab458757d62c95ce9993df7202
> Author: Flavio Leitner <fbl>
> Date:   Tue Jul 5 10:33:38 2016 -0300
> 
>     dpif-netdev: Remove PMD latency on seq_mutex
>     
>     The PMD thread needs to keep processing RX queues in order
>     to achieve maximum throughput. It also needs to sweep emc
>     cache and quiesce which use seq_mutex. That mutex can
>     eventually block the PMD thread causing latency spikes and
>     affecting the throughput.
>     
>     Since there is no requirement for running those tasks at a
>     specific time, this patch extend seq API to allow tentative
>     locking instead.
> 
> 
> Those should not change the way queues are mapped, so I think -5 worked by
> chance?

Hi Flavio,

Yes, it is by chance. That's been my conclusion after so many tests with various OVS packages including -22.

Comment 10 Flavio Leitner 2016-12-01 05:10:44 UTC
Hi,

What happens is that when the first port is added, the PMD threads are created and the queues are distributed using one algorithm.

Then when the following ports are added, ovs distributes each queue to the least loaded PMD thread, so it might happen to map to different PMD threads.

I have a patch to fix it.

However, it's not clear to me why this would cause a performance issue because each PMD would have to poll at least one RX queue anyways, then send to the egress device queue which is given by the PMD thread itself (pmd->tx_qid). Unless the traffic is unbalanced, then one PMD thread could be too busy while others are not so loaded.

Perhaps you can explain why it happens, otherwise could you please get some outputs while in good state and while in bad state for me to compare?

Capture perf: 'perf record -g -C <pmd_cpu1>,<pmd_cpu2>,.. sleep 60'
for each pmd_cpu:
  perf report -g --no-children -C <pmd_cpu> --stdio 

and the stats:
  // clear the stats after the reproducer is stable
  ovs-appctl dpif-netdev/pmd-stats-clear
  // wait one minute
  sleep 60
  // capture the PMD stats
  ovs-appctl dpif-netdev/pmd-stats-show

This should help to identify the root cause for the perf issue.

Thanks!

Comment 11 Flavio Leitner 2016-12-01 05:21:07 UTC
As a work around, you can set pmd-cpu-mask to one PMD thread and then back to the value you need.  That should re-distribute all queues properly.

Comment 12 Flavio Leitner 2016-12-01 05:33:24 UTC
Brew build with the test patch applied to fix the queue ordering:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12180883

Comment 13 Jean-Tsung Hsiao 2016-12-01 17:05:07 UTC
(In reply to Flavio Leitner from comment #11)
> As a work around, you can set pmd-cpu-mask to one PMD thread and then back
> to the value you need.  That should re-distribute all queues properly.

This workaround is good. I have tried it with -14 fdP and -22 fdP.

Comment 14 Jean-Tsung Hsiao 2016-12-01 17:11:23 UTC
(In reply to Flavio Leitner from comment #10)
> Hi,
> 
> What happens is that when the first port is added, the PMD threads are
> created and the queues are distributed using one algorithm.
> 
> Then when the following ports are added, ovs distributes each queue to the
> least loaded PMD thread, so it might happen to map to different PMD threads.
> 
> I have a patch to fix it.
> 
> However, it's not clear to me why this would cause a performance issue
> because each PMD would have to poll at least one RX queue anyways, then send
> to the egress device queue which is given by the PMD thread itself
> (pmd->tx_qid). Unless the traffic is unbalanced, then one PMD thread could
> be too busy while others are not so loaded.
> 
> Perhaps you can explain why it happens, otherwise could you please get some

My thinking is that when a queue cross more than cores, there could be contention issues --- cash, memory, datapath, ...

Probably, it's hard to measure using perf I think.

Anyway, the reality is that if no queue crossing cores, we got the best Mpps rate. We have proved it again and again.

> outputs while in good state and while in bad state for me to compare?
> 
> Capture perf: 'perf record -g -C <pmd_cpu1>,<pmd_cpu2>,.. sleep 60'
> for each pmd_cpu:
>   perf report -g --no-children -C <pmd_cpu> --stdio 
> 
> and the stats:
>   // clear the stats after the reproducer is stable
>   ovs-appctl dpif-netdev/pmd-stats-clear
>   // wait one minute
>   sleep 60
>   // capture the PMD stats
>   ovs-appctl dpif-netdev/pmd-stats-show
> 
> This should help to identify the root cause for the perf issue.
> 
> Thanks!

Comment 15 Flavio Leitner 2016-12-02 18:38:55 UTC
(In reply to Christian Trautman from comment #5)
> This is very easy to reproduce with 4PMDs using 2 queues with
> uni-directional traffic.
> 
> With the following mapping I can get close to 9mpps.
> 
> pmd thread numa_id 0 core_id 18:
> 	port: dpdk0	queue-id: 0
> 	port: dpdkvhostuser0	queue-id: 0
> pmd thread numa_id 0 core_id 22:
> 	port: dpdk0	queue-id: 1
> 	port: dpdkvhostuser0	queue-id: 1
> pmd thread numa_id 0 core_id 42:
> 	port: dpdk1	queue-id: 0
> 	port: dpdkvhostuser1	queue-id: 0
> pmd thread numa_id 0 core_id 46:
> 	port: dpdk1	queue-id: 1
> 	port: dpdkvhostuser1	queue-id: 1
> 
> throughput_rx_fps, Value: 8737506
> 
> If the queues cross hyper-threads such that a queue port ends up crossed as
> seen on PMDs 18 and 22 then the performance drops.
> 
> pmd thread numa_id 0 core_id 46:
> 	port: dpdk1	queue-id: 0
> 	port: dpdkvhostuser0	queue-id: 0
> pmd thread numa_id 0 core_id 18:
> 	port: dpdk0	queue-id: 0
> 	port: dpdkvhostuser0	queue-id: 1
> pmd thread numa_id 0 core_id 22:
> 	port: dpdk0	queue-id: 1
> 	port: dpdkvhostuser1	queue-id: 0
> pmd thread numa_id 0 core_id 42:
> 	port: dpdk1	queue-id: 1
> 	port: dpdkvhostuser1	queue-id: 1
> 
> throughput_rx_fps, Value: 4910132

The second scenario provides bad tput because the workload is not balanced.
If you look at core#22, it will be polling dpdk0 which is most probably receiving traffic from the network and also polling vhostuser1 which is most probably receiving traffic from the guest, so fully loaded PMD.

On the other hand, core#46 most probably is idling because dpdk1 does not receive traffic and neither dpdkvhostuser0 (TX only). 

Comparing with the first scenario above, each PMD would be polling only one device.


However, if you change the test scenario to be:
 dpdk0 -> vhostuser1 --------\
                              testpmd
 dpdk1 <- vhostuser0 --------/

Then on the first mapping you would have core#18 and core#22 fully loaded while core#42 and core#46, not good either.

So, this is another issue and the only way to fix it is to load balance the workload at run time, which is another bug and 2.7 material (or even newer).

Comment 16 Flavio Leitner 2016-12-02 20:07:08 UTC
Regarding to the scenario described in the summary, it seems that when the queues with the same ID are not in the same PMD, we double the amount of entries in the EMC cache and then the additional cost would explain the ~1Mpps drop.

I still have to try this on my testbed, but based on a code review, the PMD tx_qid are sequencial starting from zero and it also distributes the queues sequentially starting from zero.

Then in the ideal situation we have:
core#0 (tx_qid=0)
 dpdk0q0  -> vhu0q0  -----------\
                               testpmd
 dpdk1q0  <- vhu1q0  -----------/

So, the PMD EMC sees the same stream twice, saving space/reducing costs.

However, if the core polls on another queue:

core#0 (tx_qid=2)
  dpdk0q2 -> vhu0q2 -----------\
  dpdk1q2 <- vhu1q3 ----------------------\
                              testpmd   testpmd
core#1 (tx_qid=3)              /           /
  dpdk1q3 <- vhu1q2 -----------           /
  dpdk0q3 -> vhu0q3 ----------------------
  
Then each core seems twice more streams, which leads to a bigger EMC per PMD and an additional cost most probably.

I will try to confirm in my testbed as a next step. So, no need for perf or stats outputs anymore and they won't help for this particular scenario.
Thanks

Comment 17 Flavio Leitner 2016-12-21 15:24:12 UTC
*** Bug 1358010 has been marked as a duplicate of this bug. ***

Comment 18 Flavio Leitner 2017-06-07 17:34:06 UTC
Now OVS provides queue affinity to make sure that the queues are always initialized as configured.  Does that solve the problem here?

Comment 19 Jean-Tsung Hsiao 2017-06-09 02:23:06 UTC
(In reply to Flavio Leitner from comment #18)
> Now OVS provides queue affinity to make sure that the queues are always
> initialized as configured.  Does that solve the problem here?

Yes, that should work.