Bug 1344787

Summary: [FdBeta] latency regression
Product: Red Hat Enterprise Linux 7 Reporter: Karl Rister <krister>
Component: openvswitch-dpdkAssignee: Flavio Leitner <fleitner>
Status: CLOSED CURRENTRELEASE QA Contact: ovs-qe
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: aloughla, atheurer, atragler, fleitner, jean-mickael.guerin, krister, kzhang, vincent.jardin
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.5.0-21.git20160727.el7fdb Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1397481 (view as bug list) Environment:
Last Closed: 2017-01-12 15:51:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1397481    
Attachments:
Description Flags
netdev-dpdk: Use instant sending instead of queueing none

Description Karl Rister 2016-06-10 19:44:56 UTC
Description of problem:

An upstream patch:

https://github.com/openvswitch/ovs/commit/347ba9bb9b7a8e053ced54016f903e749ebecb7f

Breaks an assumption made in a different part of the code:

https://github.com/openvswitch/ovs/blob/master/lib/netdev-dpdk.c#L713

In lib/netdev-dpdk.c:netdev_dpdk_alloc_txq an assumption is made that each queue is mapped to a PMD thread that is running on the core that shares it's index valud.  The above commit makes this assumption no longer valid which can result in flushing being disabled on the queue.

In my case, with my device on the machine's second socket and the PMD threads assigned to cores on that socket (14,16,18,20), the new code assigns queue id's (0,1,2,3) whose index values would indicate that they are located on the first socket which results in flushing being disabled.  When my device was on the first socket this problem was not apparent since the PMD threads were assigned to cores whose indexes where less than 11 which would be the same on same socket of the 0 indexed queue values -- if that is what was meant.

My system is a 24 core, 12 core/socket system with hyperthreading disabled so cores 0-11 are on socket 0 and cores 12-23 are on socket 1.

The disablement of flushing due to this problem increases the average packet latency from ~5.7usec to ~54usec and maximum latency from ~26usec to ~110usec when running a unidirectional test at 2Mpps.

Version-Release number of selected component (if applicable):

Any commit newer than the one linked above as of yesterday.

How reproducible:

Requires a very short test of a few minutes and happens every time.

Steps to Reproduce:
1. Configure KVM host for reealtime, reserving 4 cpus for openvswitch PMD and 2 cpus for KVM rt-vcpus
2. Set up same KVM host with openvswitch-dpdk, two bridges, each with 1 10Gb dpdk port and 1 vhostuserdpdk port.  Create 4 dpdk PMD threads and set to fifo:95
2. Create VM with 2 extra virtio-net interfaces, each assigned to 1 of the ovs bridges
3. On an external system, run a packet generator which can measure packet latency and record minimum, average, and maximum latency over a 12 hour period.


Actual results:
Average latency of ~57usec.


Expected results:
Average latency of <8usec.


Additional info:
The 10Gb dpdk ports must be located on a PCI adapter that is connected to a socket other than socket 0 to observe this problem.

Comment 2 Karl Rister 2016-07-14 15:04:28 UTC
It looks like the problematic code that I previously identified has been removed in the following commit:

commit b59cc14e032da370021794bfd1bdd3d67e88a9a3
Author: Ilya Maximets <i.maximets>
Date:   Mon Jun 27 16:28:16 2016 +0300

    netdev-dpdk: Use instant sending instead of queueing of packets.
    
    Current implementarion of TX packet's queueing is broken in several ways:
    
        * TX queue flushing implemented on receive assumes that all
          core_id-s are sequential and starts from zero. This may lead
          to situation when packets will stuck in queue forever and,
          also, this influences on latency.
    
        * For a long time flushing logic depends on uninitialized
          'txq_needs_locking', because it usually calculated after
          'netdev_dpdk_alloc_txq' but used inside of this function
          for initialization of 'flush_tx'.
    
    Testing shows no performance difference with and without queueing.
    Lets remove queueing at all because it doesn't work properly now and
    also does not increase performance.
    
    Signed-off-by: Ilya Maximets <i.maximets>
    Acked-by: Daniele Di Proietto <diproiettod>

I don't see the particular latency issue that I had previously identified on upstream but there is something wrong as bidirectional latency is quite uneven currently:

Current upstream:

Device 0 -> 1
6.567602 Mpps
17.8454 usec average latency

Device 1 -> 0
6.567602 Mpps
203.0460 usec average latency

openvswitch-dpdk-2.4.0-0.10346.git97bab959.1.el7:

Device 0 -> 1
6.388642 Mpps
19.2421 usec average latency

Device 1 -> 0
6.388642 Mpps
18.8085 usec average latency

Comment 5 Karl Rister 2016-09-06 20:03:35 UTC
I am seeing the same problem in openvswitch-2.5.0-10.git20160727.el7fdb.x86_64.rpm.

We are running tests with 0.002% and 0% loss requirements and with that RPM I can only get 0.5506 Mpps each direction for 0.002% while passing and I can't even pass a single run for 0% with a very short test (60 second search, 120 second validation).

If I apply my hacked fix I instantly get much better behavior with 0.002% loss getting 6.7937 Mpps each way and 0% loss getting 4.9410 Mpps each direction.

This is my hacked fix:

--- lib/netdev-dpdk.c.orig      2016-07-19 11:59:58.644824282 -0400
+++ lib/netdev-dpdk.c   2016-07-19 12:01:27.031887289 -0400
@@ -571,16 +571,7 @@
     for (i = 0; i < n_txqs; i++) {
         int numa_id = ovs_numa_get_numa_id(i);
 
-        if (!netdev->txq_needs_locking) {
-            /* Each index is considered as a cpu core id, since there should
-             * be one tx queue for each cpu core.  If the corresponding core
-             * is not on the same numa node as 'netdev', flags the
-             * 'flush_tx'. */
-            netdev->tx_q[i].flush_tx = netdev->socket_id == numa_id;
-        } else {
-            /* Queues are shared among CPUs. Always flush */
-            netdev->tx_q[i].flush_tx = true;
-        }
+        netdev->tx_q[i].flush_tx = true;
 
         /* Initialize map for vhost devices. */
         netdev->tx_q[i].map = OVS_VHOST_QUEUE_MAP_UNKNOWN;

Comment 6 Thadeu Lima de Souza Cascardo 2016-09-19 16:27:08 UTC
Hi, Karl.

Would you be able to provide me a system where I can reproduce and experiment with this?

Thanks.
Cascardo.

Comment 7 Karl Rister 2016-09-19 17:51:25 UTC
Hi Cascardo

I can give you some time on my test environment but I only have one such environment so it can't be open ended.  I'll drop you can email that you can reply to when you are ready to use it.

Karl

Comment 8 Thadeu Lima de Souza Cascardo 2016-09-20 19:57:05 UTC
Hi, Karl.

Does the latest version on branch 2.6 solve the issue for you? I noticed the following commit.

Hope that helps.
Cascardo.

commit b59cc14e032da370021794bfd1bdd3d67e88a9a3
Author: Ilya Maximets <i.maximets>
Date:   Mon Jun 27 16:28:16 2016 +0300

    netdev-dpdk: Use instant sending instead of queueing of packets.
    
    Current implementarion of TX packet's queueing is broken in several ways:
    
        * TX queue flushing implemented on receive assumes that all
          core_id-s are sequential and starts from zero. This may lead
          to situation when packets will stuck in queue forever and,
          also, this influences on latency.
    
        * For a long time flushing logic depends on uninitialized
          'txq_needs_locking', because it usually calculated after
          'netdev_dpdk_alloc_txq' but used inside of this function
          for initialization of 'flush_tx'.
    
    Testing shows no performance difference with and without queueing.
    Lets remove queueing at all because it doesn't work properly now and
    also does not increase performance.
    
    Signed-off-by: Ilya Maximets <i.maximets>
    Acked-by: Daniele Di Proietto <diproiettod>

Comment 9 Karl Rister 2016-09-20 20:02:51 UTC
Hi Cascardo,

I identified that commit back in comment 2 as being the upstream change that fixed this particular problem.  My recollection is that the change was quite a bit more pervasive than my simple hack.

I haven't been able to test upstream lately since I'm backlogged on RH build testing and other runs.

Karl

Comment 10 Thadeu Lima de Souza Cascardo 2016-09-21 12:39:56 UTC
Sorry I missed that. For some reason, I thought that commit referred to something else. But you also mentioned uneven bidirectional latency. Was that with branch 2.6 with that commit applied?

If so, we have two problems here:

1) The latency problem on NUMA systems on branch 2.5.
2) Uneven bidirectional latency on branch 2.6.

Is that correct?

Cascardo.

Comment 11 Karl Rister 2016-09-21 12:55:52 UTC
Yes, although I have only seen the uneven bidirectional latency that one time.  Before looking into that I think we would need to reproduce it on current 2.6 branch/master (and there are a lot of higher priority tasks piling up at the moment).  The NUMA latency issue has been replicated on multiple source trees/branches including the RH built fast data path RPMs.  I think that this is the most pressing issue that I have seen at the moment.

Karl

Comment 12 Thadeu Lima de Souza Cascardo 2016-09-21 17:43:31 UTC
Created attachment 1203436 [details]
netdev-dpdk: Use instant sending instead of queueing

Hi, Karl.

This is a backport of that commit to 2.5. Can you try it? If it passes your tests, we can try to push this backport upstream, as it is closer to what is in master. Otherwise, we can try to argue that your patch is worth applying.

Thanks.
Cascardo.

Comment 13 Karl Rister 2016-10-12 17:39:47 UTC
Hi Cascardo

Sorry for the delay.  I had some long running tests that needed to complete and then had some issues with the test bed that had to be resolved before I could test your patch.

Your patch looks good from my testing, it is actually a little better than mine with the latency being lower.  There is slight degradation in the the achieved throughput but that can be a noisy statistic for this workload (a binary search to find the highest throughput with <= 0.002% packet loss over a 5 minute run followed by a 15 minute validation).

#####################################################
Baseline (openvswitch-2.5.0-10.git20160727.el7fdb):

Device 0->1:
  Throughput: 6.8683 Mpps
  Min. Latency: 39.9780 usec
  Avg. Latency: 61.1226 usec
  Max. Latency: 89.1110 usec

Device 1->0:
  Throughput: 6.8683 Mpps
  Min. Latency: 41.0660 usec
  Avg. Latency: 58.7778 usec
  Max. Latency: 89.4720 usec

#####################################################
My hack fix:

Device 0->1:
  Throughput: 6.5106 Mpps
  Min. Latency: 11.0850 usec 
  Avg. Latency: 15.1282 usec
  Max. Latency: 31.4210 usec

Device 1->0:
  Throughput: 6.5106 Mpps
  Min. Latency: 13.2320 usec
  Avg. Latency: 22.9893 usec
  Max. Latency: 37.9040 usec

#####################################################
Your fix:

Device 0->1:
  Throughput: 6.3941 Mpps
  Min. Latency: 10.5410 usec
  Avg. Latency: 14.1309 usec
  Max. Latency: 28.9880 usec

Device 1->0:
  Throughput: 6.3941 Mpps
  Min. Latency: 11.9780 usec
  Avg. Latency: 18.0692 usec
  Max. Latency: 29.5200

Comment 14 Thadeu Lima de Souza Cascardo 2016-10-26 17:22:55 UTC
Thanks, Karl, for your testing.

Sent backport upstream, waiting for feedback or to be applied to branch 2.5.

http://openvswitch.org/pipermail/dev/2016-October/080954.html

Cascardo.

Comment 15 Thadeu Lima de Souza Cascardo 2016-11-29 20:22:21 UTC
Flavio, just a reminder that Daniele should be pinged about applying this upstream.

Cascardo.

Comment 16 Flavio Leitner 2016-12-29 15:16:55 UTC
Patch is commited upstream:
https://github.com/openvswitch/ovs/commit/a1b2fc05e6a260c6580fa7a40d02cd7dbf2989ea