Bug 1616001

Summary:

[RFE] Add option to OVS 2.9 for legacy RXQ assignment to cores

Product:

Red Hat Enterprise Linux 7

Reporter:

Andreas Karis <akaris>

Component:

openvswitch

Assignee:

Kevin Traynor <ktraynor>

Status:

CLOSED ERRATA

QA Contact:

liting <tli>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.5

CC:

akaris, atelang, atragler, cfields, ctrautma, echaudro, fbaudin, gmuthukr, ktraynor, lmanasko, marjones, ovs-qe, psahoo, qding, rkhan, tredaelli

Target Milestone:

Keywords:

FutureFeature

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

openvswitch-2.9.0-56.el7fdp.1

Doc Type:

Enhancement

Doc Text:

With this update, the pmd-rxq-assign configuration has been added to Poll Mode Drivers (PMDs) cores. This allows users to select a round-robin assignment.

Story Points:

---

Clone Of:

Clones:

1631797 (view as bug list)

Environment:

Last Closed:

2018-11-05 14:59:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1628669

Attachments:

Description	Flags
SRPM with patch 1	none
SRPM with patch 2	none

Description Andreas Karis 2018-08-14 17:13:42 UTC

Description of problem:
This is an RFE to add an option to OVS 2.9 for legacy RXQ assignment to cores. We have a customer who noticed in their performance tests with OVS 2.9 that RXQ assignment as in 2.6.1 provides significantly better performance. I'm attaching an SRPM  with a patch to revert to the legacy behavior. However, I unfortunately don't know how to read other_options from the OVS configuration, so that part of work is still missing.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Andreas Karis 2018-08-14 17:17:26 UTC

Created attachment 1475918 [details]
SRPM with patch 1

Comment 4 Andreas Karis 2018-08-14 17:18:28 UTC

Created attachment 1475919 [details]
SRPM with patch 2

Comment 5 Andreas Karis 2018-08-14 17:22:54 UTC

Test results without the test build (OVS 2.9 standard behavior):

First VM:
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :   12608444/12582488     12608444/12582488
       Max/Tx     :     6402208/6398400       6402208/6398400
MBits/s Rx/Tx     :           8455/4299             8455/4299

Second VM:
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :     6402635/6401132       6402635/6401132
       Max/Tx     :   14308416/14305600     14308416/14305600
MBits/s Rx/Tx     :           4301/9613             4301/9613

Test results with the test build (OVS 2.9 with OVS 2.6.1 behavior) openvswitch-2.9.0-19.el7.1.02143196.x86_64.tar.gz:

First VM:
  Flags:Port      :   P-----R--------:0
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :   12199750/12189416     12199750/12189416
       Max/Tx     :   14315520/12945376     14315520/12945376
MBits/s Rx/Tx     :           8191/8699             8191/8699

Second VM:
  Flags:Port      :   P-----R--------:0
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :   14203632/12227512     14203632/12227512
       Max/Tx     :   12965600/12958912     12965600/12958912

Notice the increased RX performance.

Comment 6 Andreas Karis 2018-08-14 17:53:14 UTC

Created an RPM build with:
rpmbuild -ba ../SPECS/openvswitch.spec --without check 2>&1 | tee
/root/buildlog.txt

And updated the spec file to include the patch.

And patch:
~~~
[root@undercloud-r430 ~]# cat rpmbuild/SOURCES/0012-revert-rxq-ordering.patch
--- openvswitch-2.9.0.orig/lib/dpif-netdev.c 2018-02-19 17:49:44.867277147 -0500
+++ openvswitch-2.9.0.modif/lib/dpif-netdev.c 2018-08-02
16:44:57.916173757 -0400
@@ -3453,25 +3453,35 @@
 /* Returns the next pmd from the numa node in
  * incrementing or decrementing order. */
 static struct dp_netdev_pmd_thread *
-rr_numa_get_pmd(struct rr_numa *numa)
+rr_numa_get_pmd(struct rr_numa *numa,bool always_increment)
 {
     int numa_idx = numa->cur_index;

-    if (numa->idx_inc == true) {
+    if (always_increment) {
         /* Incrementing through list of pmds. */
         if (numa->cur_index == numa->n_pmds-1) {
             /* Reached the last pmd. */
-            numa->idx_inc = false;
+            numa->cur_index = 0;
         } else {
             numa->cur_index++;
         }
     } else {
-        /* Decrementing through list of pmds. */
-        if (numa->cur_index == 0) {
-            /* Reached the first pmd. */
-            numa->idx_inc = true;
+        if (numa->idx_inc == true) {
+            /* Incrementing through list of pmds. */
+            if (numa->cur_index == numa->n_pmds-1) {
+                /* Reached the last pmd. */
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index++;
+            }
         } else {
-            numa->cur_index--;
+            /* Decrementing through list of pmds. */
+            if (numa->cur_index == 0) {
+                /* Reached the first pmd. */
+                numa->idx_inc = true;
+            } else {
+                numa->cur_index--;
+            }
         }
     }
     return numa->pmds[numa_idx];
@@ -3581,11 +3591,11 @@
         }
     }

-    if (n_rxqs > 1) {
+    //if (n_rxqs > 1) {
         /* Sort the queues in order of the processing cycles
          * they consumed during their last pmd interval. */
-        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
-    }
+        // qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
+    //}

     rr_numa_list_populate(dp, &rr);
     /* Assign the sorted queues to pmds in round robin. */
@@ -3605,7 +3615,7 @@
                          netdev_rxq_get_queue_id(rxqs[i]->rx));
                 continue;
             }
-            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
+            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa,true);
             VLOG_WARN("There's no available (non-isolated) pmd thread "
                       "on numa node %d. Queue %d on port \'%s\' will "
                       "be assigned to the pmd on core %d "
@@ -3614,7 +3624,7 @@
                       netdev_rxq_get_name(rxqs[i]->rx),
                       rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
         } else {
-        rxqs[i]->pmd = rr_numa_get_pmd(numa);
+        rxqs[i]->pmd = rr_numa_get_pmd(numa,true);
         VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
                   "rx queue %d (measured processing cycles %"PRIu64").",
                   rxqs[i]->pmd->core_id, numa_id,
~~~

In my lab with standard build:
~~~
[root@overcloud-compute-0 ~]# grep 'on numa node 0 assigned port'
/var/log/openvswitch/ovs-vswitchd.log | tail -n 9
2018-08-02T22:26:44.776Z|00299|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 1 (measured processing cycles
518187).
2018-08-02T22:26:44.776Z|00300|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'dpdk0' rx queue 0 (measured processing cycles 223997).
2018-08-02T22:26:44.776Z|00301|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 0 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00302|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 2 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00303|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 3 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00304|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 4 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00305|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 5 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00306|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 6 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00307|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 7 (measured processing cycles
0).
~~~

In my lab with test build openvswitch-2.9.0-19.el7.1.02143196.x86_64:
~~~
[root@overcloud-compute-0 ~]# grep 'on numa node 0 assigned port'
/var/log/openvswitch/ovs-vswitchd.log | tail -n 9
2018-08-02T22:30:57.750Z|00091|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'dpdk0' rx queue 0 (measured processing cycles 0).
2018-08-02T22:30:57.750Z|00092|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 0 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00093|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 1 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00094|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 2 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00095|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 3 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00096|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 4 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00097|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 5 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00098|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 6 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00099|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 7 (measured processing cycles
0).
~~~

Looks like I can easily revert to the old behavior (at least more or less).

Now, if I can figure out a way to pass the correct struct smap into
the method, I could even make this configurable, e.g.:
~~~
[root@undercloud-r430 SOURCES]# cat 0012-revert-rxq-ordering.patch
--- openvswitch-2.9.0.orig/lib/dpif-netdev.c    2018-02-19
17:49:44.867277147 -0500
+++ openvswitch-2.9.0.modif/lib/dpif-netdev.c    2018-08-02
17:30:14.868626933 -0400
@@ -3453,25 +3453,35 @@
 /* Returns the next pmd from the numa node in
  * incrementing or decrementing order. */
 static struct dp_netdev_pmd_thread *
-rr_numa_get_pmd(struct rr_numa *numa)
+rr_numa_get_pmd(struct rr_numa *numa,bool always_increment)
 {
     int numa_idx = numa->cur_index;

-    if (numa->idx_inc == true) {
+    if (always_increment) {
         /* Incrementing through list of pmds. */
         if (numa->cur_index == numa->n_pmds-1) {
             /* Reached the last pmd. */
-            numa->idx_inc = false;
+            numa->cur_index = 0;
         } else {
             numa->cur_index++;
         }
     } else {
-        /* Decrementing through list of pmds. */
-        if (numa->cur_index == 0) {
-            /* Reached the first pmd. */
-            numa->idx_inc = true;
+        if (numa->idx_inc == true) {
+            /* Incrementing through list of pmds. */
+            if (numa->cur_index == numa->n_pmds-1) {
+                /* Reached the last pmd. */
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index++;
+            }
         } else {
-            numa->cur_index--;
+            /* Decrementing through list of pmds. */
+            if (numa->cur_index == 0) {
+                /* Reached the first pmd. */
+                numa->idx_inc = true;
+            } else {
+                numa->cur_index--;
+            }
         }
     }
     return numa->pmds[numa_idx];
@@ -3581,7 +3591,9 @@
         }
     }

-    if (n_rxqs > 1) {
+    bool legacy_rxq_sort = smap_get_bool(const struct smap *smap,
"legacy_rxq_sort", false)
+
+    if (!(legacy_rxq_sort) && n_rxqs > 1) {
         /* Sort the queues in order of the processing cycles
          * they consumed during their last pmd interval. */
         qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
@@ -3605,7 +3617,7 @@
                          netdev_rxq_get_queue_id(rxqs[i]->rx));
                 continue;
             }
-            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
+            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa,legacy_rxq_sort);
             VLOG_WARN("There's no available (non-isolated) pmd thread "
                       "on numa node %d. Queue %d on port \'%s\' will "
                       "be assigned to the pmd on core %d "
@@ -3614,7 +3626,7 @@
                       netdev_rxq_get_name(rxqs[i]->rx),
                       rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
         } else {
-        rxqs[i]->pmd = rr_numa_get_pmd(numa);
+        rxqs[i]->pmd = rr_numa_get_pmd(numa,legacy_rxq_sort);
         VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
                   "rx queue %d (measured processing cycles %"PRIu64").",
                   rxqs[i]->pmd->core_id, numa_id,
~~~


But I don't know how to get `const struct smap *smap`

Comment 7 Eelco Chaudron 2018-08-15 09:50:35 UTC

Is this about rebalancing added to 2.9? Or only the initial assignment?

What about the other_config:pmd-rxq-affinity option, can't they use this to manually mimic the 2.6 behaviour?

I'll find some time next week to dig through the code changes to better understand your request. 

In the meantime can you add some details on why you see the performance differences, and how you test this? I assume it has to due to how queues and traffic are distributed.

Comment 26 Rashid Khan 2018-09-04 19:19:37 UTC

Feature cannot be backported to older LTS releases 
So not sure what we can do.

Comment 48 liting 2018-10-22 02:53:27 UTC

Run the vsperf CI performance testing on openvswitch-2.9.0-70.el7fdp with pmd-rxq-assign=roundrobin and without pmd-rxq-assign=roundrobin, The case with pmd-rxq-assign=roundrobin work well. The detail as following.

job link(without configure pmd-rxq-assign=roundrobin):
https://beaker.engineering.redhat.com/jobs/2874931
result link:
https://docs.google.com/spreadsheets/d/1FR6Bs6BMANmDC_eEaqDetV-A_ZWd_SV7-J2P1UaoSNQ/edit#gid=1295184293
job link:(with configure pmd-rxq-assign=roundrobin)
https://beaker.engineering.redhat.com/jobs/2973629
result link:
https://docs.google.com/spreadsheets/d/1cxLlI5fmtwp-qfCD6ep2h07rWI7ThqTosU2MhhDWCz8/edit#gid=1319507937

Comment 50 errata-xmlrpc 2018-11-05 14:59:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3500