| Summary: | RFE: Inconsistent 64 bytes Xena to testpmd/vhostuser throughput caused by mixed queues in a pmd core | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Jean-Tsung Hsiao <jhsiao> |
| Component: | openvswitch | Assignee: | Flavio Leitner <fleitner> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jean-Tsung Hsiao <jhsiao> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 7.3 | CC: | aloughla, atelang, atheurer, atragler, bmichalo, ctrautma, dshaks, fbaudin, fleitner, jhsiao, kzhang, mleitner, osabart, rcain, sukulkar, tli, ttracy, vchundur |
| Target Milestone: | rc | Keywords: | FutureFeature |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | openvswitch-2.6.1-10.git20161206.el7fdp | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-06-13 21:48:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Jean-Tsung Hsiao
2016-09-22 19:41:56 UTC
A correction: *** Throughput rate = 14.7 Mpps *** This rate has been observed when there are no mixed queues in any of the four pmd cores, like: [root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 1 core_id 17: port: dpdk0 queue-id: 0 port: dpdk1 queue-id: 0 port: vhost0 queue-id: 0 port: vhost1 queue-id: 0 pmd thread numa_id 1 core_id 19: port: dpdk0 queue-id: 1 port: dpdk1 queue-id: 1 port: vhost0 queue-id: 1 port: vhost1 queue-id: 1 pmd thread numa_id 1 core_id 21: port: dpdk0 queue-id: 2 port: dpdk1 queue-id: 2 port: vhost0 queue-id: 2 port: vhost1 queue-id: 2 pmd thread numa_id 1 core_id 23: port: dpdk0 queue-id: 3 port: dpdk1 queue-id: 3 port: vhost0 queue-id: 3 port: vhost1 queue-id: 3 (In reply to Jean-Tsung Hsiao from comment #1) > A correction: > > > *** Throughput rate = 14.7 Mpps *** > > This rate has been observed when there are no mixed queues in any of the > four pmd cores, like: > > [root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show > pmd thread numa_id 1 core_id 17: > port: dpdk0 queue-id: 0 > port: dpdk1 queue-id: 0 > port: vhost0 queue-id: 0 > port: vhost1 queue-id: 0 > pmd thread numa_id 1 core_id 19: > port: dpdk0 queue-id: 1 > port: dpdk1 queue-id: 1 > port: vhost0 queue-id: 1 > port: vhost1 queue-id: 1 > pmd thread numa_id 1 core_id 21: > port: dpdk0 queue-id: 2 > port: dpdk1 queue-id: 2 > port: vhost0 queue-id: 2 > port: vhost1 queue-id: 2 > pmd thread numa_id 1 core_id 23: > port: dpdk0 queue-id: 3 > port: dpdk1 queue-id: 3 > port: vhost0 queue-id: 3 > port: vhost1 queue-id: 3 With openvswitch-2.5.0-10.git20160727.el7fdb.x86_64 I nerver got such a queue to core alignment like above after 5 tries. That explains why I never got 14.7 Mpps with openvswitch-2.5.0-10.git20160727.el7fdb.x86_64. Below are three sets of 4 queues to 8 cores mapping --- most of the cores hosting more than one queue.
[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 11:
port: dpdk0 queue-id: 1
port: vhost0 queue-id: 0
pmd thread numa_id 1 core_id 15:
port: dpdk0 queue-id: 3
port: vhost0 queue-id: 1
pmd thread numa_id 1 core_id 9:
port: dpdk0 queue-id: 0
port: vhost0 queue-id: 2
pmd thread numa_id 1 core_id 17:
port: dpdk1 queue-id: 0
port: vhost0 queue-id: 3
pmd thread numa_id 1 core_id 21:
port: dpdk1 queue-id: 1
port: vhost1 queue-id: 0
pmd thread numa_id 1 core_id 13:
port: dpdk0 queue-id: 2
port: vhost1 queue-id: 1
pmd thread numa_id 1 core_id 19:
port: dpdk1 queue-id: 2
port: vhost1 queue-id: 2
pmd thread numa_id 1 core_id 23:
port: dpdk1 queue-id: 3
port: vhost1 queue-id: 3
[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 11:
port: dpdk0 queue-id: 1
port: vhost0 queue-id: 0
pmd thread numa_id 1 core_id 9:
port: dpdk0 queue-id: 0
port: vhost0 queue-id: 1
pmd thread numa_id 1 core_id 13:
port: dpdk0 queue-id: 2
port: vhost0 queue-id: 2
pmd thread numa_id 1 core_id 17:
port: dpdk1 queue-id: 0
port: vhost0 queue-id: 3
pmd thread numa_id 1 core_id 19:
port: dpdk1 queue-id: 1
port: vhost1 queue-id: 0
pmd thread numa_id 1 core_id 15:
port: dpdk0 queue-id: 3
port: vhost1 queue-id: 1
pmd thread numa_id 1 core_id 21:
port: dpdk1 queue-id: 2
port: vhost1 queue-id: 2
pmd thread numa_id 1 core_id 23:
port: dpdk1 queue-id: 3
port: vhost1 queue-id: 3
[root@netqe5 dpdk-multique-scripts]# ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 1 core_id 19:
port: dpdk1 queue-id: 0
port: vhost0 queue-id: 0
pmd thread numa_id 1 core_id 13:
port: dpdk0 queue-id: 2
port: vhost0 queue-id: 1
pmd thread numa_id 1 core_id 17:
port: dpdk1 queue-id: 1
port: vhost0 queue-id: 2
pmd thread numa_id 1 core_id 15:
port: dpdk0 queue-id: 3
port: vhost0 queue-id: 3
pmd thread numa_id 1 core_id 9:
port: dpdk0 queue-id: 0
port: vhost1 queue-id: 0
pmd thread numa_id 1 core_id 11:
port: dpdk0 queue-id: 1
port: vhost1 queue-id: 1
pmd thread numa_id 1 core_id 21:
port: dpdk1 queue-id: 2
port: vhost1 queue-id: 2
pmd thread numa_id 1 core_id 23:
port: dpdk1 queue-id: 3
port: vhost1 queue-id: 3
This is very easy to reproduce with 4PMDs using 2 queues with uni-directional traffic. With the following mapping I can get close to 9mpps. pmd thread numa_id 0 core_id 18: port: dpdk0 queue-id: 0 port: dpdkvhostuser0 queue-id: 0 pmd thread numa_id 0 core_id 22: port: dpdk0 queue-id: 1 port: dpdkvhostuser0 queue-id: 1 pmd thread numa_id 0 core_id 42: port: dpdk1 queue-id: 0 port: dpdkvhostuser1 queue-id: 0 pmd thread numa_id 0 core_id 46: port: dpdk1 queue-id: 1 port: dpdkvhostuser1 queue-id: 1 throughput_rx_fps, Value: 8737506 If the queues cross hyper-threads such that a queue port ends up crossed as seen on PMDs 18 and 22 then the performance drops. pmd thread numa_id 0 core_id 46: port: dpdk1 queue-id: 0 port: dpdkvhostuser0 queue-id: 0 pmd thread numa_id 0 core_id 18: port: dpdk0 queue-id: 0 port: dpdkvhostuser0 queue-id: 1 pmd thread numa_id 0 core_id 22: port: dpdk0 queue-id: 1 port: dpdkvhostuser1 queue-id: 0 pmd thread numa_id 0 core_id 42: port: dpdk1 queue-id: 1 port: dpdkvhostuser1 queue-id: 1 throughput_rx_fps, Value: 4910132 We have been chasing this one for a while as our MQ testing can result in inconsistent results based on the mapping. If I move up to 8PMDs with 2 Queue there is less chance of an inconsistent result, but it still can occur. The following mapping produced 13407173 fps with bi-directional traffic. [DEBUG] 2016-10-17 16:20:41,286 : (ovs_dpdk_vhost) - cmd : ovs-appctl dpif-netdev/pmd-rxq-show pmd thread numa_id 0 core_id 14: port: dpdk0 queue-id: 1 pmd thread numa_id 0 core_id 36: port: dpdk1 queue-id: 0 pmd thread numa_id 0 core_id 18: port: dpdk1 queue-id: 1 pmd thread numa_id 0 core_id 22: port: dpdkvhostuser0 queue-id: 0 pmd thread numa_id 0 core_id 12: port: dpdk0 queue-id: 0 pmd thread numa_id 0 core_id 46: port: dpdkvhostuser0 queue-id: 1 pmd thread numa_id 0 core_id 42: port: dpdkvhostuser1 queue-id: 0 pmd thread numa_id 0 core_id 38: port: dpdkvhostuser1 queue-id: 1 This performance result occurs infrequently (maybe one in 6 tries). I usually get performance similar to 17530658 fps with bi-directional traffic. This mapping would reproduce this number. pmd thread numa_id 0 core_id 12: port: dpdk0 queue-id: 0 pmd thread numa_id 0 core_id 22: port: dpdk1 queue-id: 0 pmd thread numa_id 0 core_id 36: port: dpdk1 queue-id: 1 pmd thread numa_id 0 core_id 38: port: dpdkvhostuser0 queue-id: 0 pmd thread numa_id 0 core_id 42: port: dpdkvhostuser0 queue-id: 1 pmd thread numa_id 0 core_id 14: port: dpdk0 queue-id: 1 pmd thread numa_id 0 core_id 18: port: dpdkvhostuser1 queue-id: 0 pmd thread numa_id 0 core_id 46: port: dpdkvhostuser1 queue-id: 1 In the gating CI of ovs 2.5 git 22 in rhel 7.3, there are following cases result less than the baseline test, it may be caused by this bug. without vlan traffic: pvp_cont 4queue 8pmd 10.05 (baseline 18) pvp_tput 2queue 8pmd 64, 0.002 9.55 (baseline 14) pvp_tput 1 queue 4pmd testpmd 64, 0.00 6.64 (baseline 7) with vlan traffic: pvp_tput 2queue 4pmd 64, 0.002 9.94(baseline 16) pvp_tput 2queue 8pmd 64, 0.002 9.66(baseline 16) pvp_tput 1 queue 4pmd testpmd 64, 0.00 6.45(baseline 8) pvp_tput 2 queue 4pmd 64, 0.00 7.19(baseline 8) Looking at the package changelog between -5 (working version according with summary) and -10 (bad version according with comment#3): * Fri Aug 26 2016 Panu Matilainen <pmatilai> - 2.5.0-10.git20160727 - Fix adding ukeys for same flow by different pmds (#1364898) * Thu Jul 28 2016 Flavio Leitner <fbl> - 2.5.0-9.git20160727 - Fixed ifup-ovs to support DPDK Bond (#1360426) * Thu Jul 28 2016 Flavio Leitner <fbl> - 2.5.0-8.git20160727 - Fixed ifup-ovs to delete the ports first (#1359890) * Wed Jul 27 2016 Flavio Leitner <fbl> - 2.5.0-7.git20160727 - pull bugfixes from upstream 2.5 branch (#1360431) * Tue Jul 26 2016 Flavio Leitner <fbl> - 2.5.0-6.git20160628 - Removed redundant provides for openvswitch - Added epoch to the provides for -static package * Thu Jul 21 2016 Flavio Leitner <fbl> - 2.5.0-5.git20160628 - Renamed to openvswitch (dpdk enabled) - Enabled sub-packages - Removed conflicts to openvswitch - Increased epoch to give this package preference over stable These are the possible related changes:
* Fri Aug 26 2016 Panu Matilainen <pmatilai> - 2.5.0-10.git20160727
- Fix adding ukeys for same flow by different pmds (#1364898)
* Wed Jul 27 2016 Flavio Leitner <fbl> - 2.5.0-7.git20160727
- pull bugfixes from upstream 2.5 branch (#1360431)
Most probably something in the -7 update, these are the suspicious ones:
Author: Kevin Traynor <kevin.traynor>
Date: Fri Jun 10 17:49:38 2016 +0100
netdev-dpdk: Remove vhost send retries when no packets have been sent.
If the guest is connected but not servicing the virt queue, this leads
to vhost send retries until timeout. This is fine in isolation but if
there are other high rate queues also being serviced by the same PMD
it can lead to a performance hit on those queues. Change to only retry
when at least some packets have been successfully sent on the previous
attempt.
Also, limit retries to avoid a similar delays if packets are being sent
at a very low rate due to few available descriptors.
commit 4338c1d35faee97e2a0f4f83736286d3fdfc2c9a
Author: Zoltán Balogh <zoltan.balogh>
Date: Fri Jul 15 10:28:33 2016 +0000
netdev-dpdk: vhost-user port link state fix
OVS reports that link state of a vhost-user port (type=dpdkvhostuser) is
DOWN, even when traffic is running through the port between a Virtual
Machine and the vSwitch. Changing admin state with the
"ovs-ofctl mod-port <BR> <PORT> up/down" command over OpenFlow does
affect neither the reported link state nor the traffic.
The patch below does the flowing:
- Triggers link state change by altering netdev's change_seq member.
- Controls sending/receiving of packets through vhost-user port
according to the port's current admin state.
- Sets admin state of newly created vhost-user port to UP.
commit 2e6a1eae96615cab458757d62c95ce9993df7202
Author: Flavio Leitner <fbl>
Date: Tue Jul 5 10:33:38 2016 -0300
dpif-netdev: Remove PMD latency on seq_mutex
The PMD thread needs to keep processing RX queues in order
to achieve maximum throughput. It also needs to sweep emc
cache and quiesce which use seq_mutex. That mutex can
eventually block the PMD thread causing latency spikes and
affecting the throughput.
Since there is no requirement for running those tasks at a
specific time, this patch extend seq API to allow tentative
locking instead.
Those should not change the way queues are mapped, so I think -5 worked by chance?
(In reply to Flavio Leitner from comment #8) > These are the possible related changes: > * Fri Aug 26 2016 Panu Matilainen <pmatilai> - > 2.5.0-10.git20160727 > - Fix adding ukeys for same flow by different pmds (#1364898) > > * Wed Jul 27 2016 Flavio Leitner <fbl> - 2.5.0-7.git20160727 > - pull bugfixes from upstream 2.5 branch (#1360431) > > > Most probably something in the -7 update, these are the suspicious ones: > > > Author: Kevin Traynor <kevin.traynor> > Date: Fri Jun 10 17:49:38 2016 +0100 > > netdev-dpdk: Remove vhost send retries when no packets have been sent. > > If the guest is connected but not servicing the virt queue, this leads > to vhost send retries until timeout. This is fine in isolation but if > there are other high rate queues also being serviced by the same PMD > it can lead to a performance hit on those queues. Change to only retry > when at least some packets have been successfully sent on the previous > attempt. > > Also, limit retries to avoid a similar delays if packets are being sent > at a very low rate due to few available descriptors. > > commit 4338c1d35faee97e2a0f4f83736286d3fdfc2c9a > Author: Zoltán Balogh <zoltan.balogh> > Date: Fri Jul 15 10:28:33 2016 +0000 > > netdev-dpdk: vhost-user port link state fix > > OVS reports that link state of a vhost-user port (type=dpdkvhostuser) is > DOWN, even when traffic is running through the port between a Virtual > Machine and the vSwitch. Changing admin state with the > "ovs-ofctl mod-port <BR> <PORT> up/down" command over OpenFlow does > affect neither the reported link state nor the traffic. > > The patch below does the flowing: > - Triggers link state change by altering netdev's change_seq member. > - Controls sending/receiving of packets through vhost-user port > according to the port's current admin state. > - Sets admin state of newly created vhost-user port to UP. > > commit 2e6a1eae96615cab458757d62c95ce9993df7202 > Author: Flavio Leitner <fbl> > Date: Tue Jul 5 10:33:38 2016 -0300 > > dpif-netdev: Remove PMD latency on seq_mutex > > The PMD thread needs to keep processing RX queues in order > to achieve maximum throughput. It also needs to sweep emc > cache and quiesce which use seq_mutex. That mutex can > eventually block the PMD thread causing latency spikes and > affecting the throughput. > > Since there is no requirement for running those tasks at a > specific time, this patch extend seq API to allow tentative > locking instead. > > > Those should not change the way queues are mapped, so I think -5 worked by > chance? Hi Flavio, Yes, it is by chance. That's been my conclusion after so many tests with various OVS packages including -22. Hi, What happens is that when the first port is added, the PMD threads are created and the queues are distributed using one algorithm. Then when the following ports are added, ovs distributes each queue to the least loaded PMD thread, so it might happen to map to different PMD threads. I have a patch to fix it. However, it's not clear to me why this would cause a performance issue because each PMD would have to poll at least one RX queue anyways, then send to the egress device queue which is given by the PMD thread itself (pmd->tx_qid). Unless the traffic is unbalanced, then one PMD thread could be too busy while others are not so loaded. Perhaps you can explain why it happens, otherwise could you please get some outputs while in good state and while in bad state for me to compare? Capture perf: 'perf record -g -C <pmd_cpu1>,<pmd_cpu2>,.. sleep 60' for each pmd_cpu: perf report -g --no-children -C <pmd_cpu> --stdio and the stats: // clear the stats after the reproducer is stable ovs-appctl dpif-netdev/pmd-stats-clear // wait one minute sleep 60 // capture the PMD stats ovs-appctl dpif-netdev/pmd-stats-show This should help to identify the root cause for the perf issue. Thanks! As a work around, you can set pmd-cpu-mask to one PMD thread and then back to the value you need. That should re-distribute all queues properly. Brew build with the test patch applied to fix the queue ordering: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=12180883 (In reply to Flavio Leitner from comment #11) > As a work around, you can set pmd-cpu-mask to one PMD thread and then back > to the value you need. That should re-distribute all queues properly. This workaround is good. I have tried it with -14 fdP and -22 fdP. (In reply to Flavio Leitner from comment #10) > Hi, > > What happens is that when the first port is added, the PMD threads are > created and the queues are distributed using one algorithm. > > Then when the following ports are added, ovs distributes each queue to the > least loaded PMD thread, so it might happen to map to different PMD threads. > > I have a patch to fix it. > > However, it's not clear to me why this would cause a performance issue > because each PMD would have to poll at least one RX queue anyways, then send > to the egress device queue which is given by the PMD thread itself > (pmd->tx_qid). Unless the traffic is unbalanced, then one PMD thread could > be too busy while others are not so loaded. > > Perhaps you can explain why it happens, otherwise could you please get some My thinking is that when a queue cross more than cores, there could be contention issues --- cash, memory, datapath, ... Probably, it's hard to measure using perf I think. Anyway, the reality is that if no queue crossing cores, we got the best Mpps rate. We have proved it again and again. > outputs while in good state and while in bad state for me to compare? > > Capture perf: 'perf record -g -C <pmd_cpu1>,<pmd_cpu2>,.. sleep 60' > for each pmd_cpu: > perf report -g --no-children -C <pmd_cpu> --stdio > > and the stats: > // clear the stats after the reproducer is stable > ovs-appctl dpif-netdev/pmd-stats-clear > // wait one minute > sleep 60 > // capture the PMD stats > ovs-appctl dpif-netdev/pmd-stats-show > > This should help to identify the root cause for the perf issue. > > Thanks! (In reply to Christian Trautman from comment #5) > This is very easy to reproduce with 4PMDs using 2 queues with > uni-directional traffic. > > With the following mapping I can get close to 9mpps. > > pmd thread numa_id 0 core_id 18: > port: dpdk0 queue-id: 0 > port: dpdkvhostuser0 queue-id: 0 > pmd thread numa_id 0 core_id 22: > port: dpdk0 queue-id: 1 > port: dpdkvhostuser0 queue-id: 1 > pmd thread numa_id 0 core_id 42: > port: dpdk1 queue-id: 0 > port: dpdkvhostuser1 queue-id: 0 > pmd thread numa_id 0 core_id 46: > port: dpdk1 queue-id: 1 > port: dpdkvhostuser1 queue-id: 1 > > throughput_rx_fps, Value: 8737506 > > If the queues cross hyper-threads such that a queue port ends up crossed as > seen on PMDs 18 and 22 then the performance drops. > > pmd thread numa_id 0 core_id 46: > port: dpdk1 queue-id: 0 > port: dpdkvhostuser0 queue-id: 0 > pmd thread numa_id 0 core_id 18: > port: dpdk0 queue-id: 0 > port: dpdkvhostuser0 queue-id: 1 > pmd thread numa_id 0 core_id 22: > port: dpdk0 queue-id: 1 > port: dpdkvhostuser1 queue-id: 0 > pmd thread numa_id 0 core_id 42: > port: dpdk1 queue-id: 1 > port: dpdkvhostuser1 queue-id: 1 > > throughput_rx_fps, Value: 4910132 The second scenario provides bad tput because the workload is not balanced. If you look at core#22, it will be polling dpdk0 which is most probably receiving traffic from the network and also polling vhostuser1 which is most probably receiving traffic from the guest, so fully loaded PMD. On the other hand, core#46 most probably is idling because dpdk1 does not receive traffic and neither dpdkvhostuser0 (TX only). Comparing with the first scenario above, each PMD would be polling only one device. However, if you change the test scenario to be: dpdk0 -> vhostuser1 --------\ testpmd dpdk1 <- vhostuser0 --------/ Then on the first mapping you would have core#18 and core#22 fully loaded while core#42 and core#46, not good either. So, this is another issue and the only way to fix it is to load balance the workload at run time, which is another bug and 2.7 material (or even newer). Regarding to the scenario described in the summary, it seems that when the queues with the same ID are not in the same PMD, we double the amount of entries in the EMC cache and then the additional cost would explain the ~1Mpps drop.
I still have to try this on my testbed, but based on a code review, the PMD tx_qid are sequencial starting from zero and it also distributes the queues sequentially starting from zero.
Then in the ideal situation we have:
core#0 (tx_qid=0)
dpdk0q0 -> vhu0q0 -----------\
testpmd
dpdk1q0 <- vhu1q0 -----------/
So, the PMD EMC sees the same stream twice, saving space/reducing costs.
However, if the core polls on another queue:
core#0 (tx_qid=2)
dpdk0q2 -> vhu0q2 -----------\
dpdk1q2 <- vhu1q3 ----------------------\
testpmd testpmd
core#1 (tx_qid=3) / /
dpdk1q3 <- vhu1q2 ----------- /
dpdk0q3 -> vhu0q3 ----------------------
Then each core seems twice more streams, which leads to a bigger EMC per PMD and an additional cost most probably.
I will try to confirm in my testbed as a next step. So, no need for perf or stats outputs anymore and they won't help for this particular scenario.
Thanks
*** Bug 1358010 has been marked as a duplicate of this bug. *** Now OVS provides queue affinity to make sure that the queues are always initialized as configured. Does that solve the problem here? (In reply to Flavio Leitner from comment #18) > Now OVS provides queue affinity to make sure that the queues are always > initialized as configured. Does that solve the problem here? Yes, that should work. |