RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1616001 - [RFE] Add option to OVS 2.9 for legacy RXQ assignment to cores
Summary: [RFE] Add option to OVS 2.9 for legacy RXQ assignment to cores
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openvswitch
Version: 7.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Kevin Traynor
QA Contact: liting
URL:
Whiteboard:
Depends On:
Blocks: 1628669
TreeView+ depends on / blocked
 
Reported: 2018-08-14 17:13 UTC by Andreas Karis
Modified: 2021-12-10 16:59 UTC (History)
16 users (show)

Fixed In Version: openvswitch-2.9.0-56.el7fdp.1
Doc Type: Enhancement
Doc Text:
With this update, the pmd-rxq-assign configuration has been added to Poll Mode Drivers (PMDs) cores. This allows users to select a round-robin assignment.
Clone Of:
: 1631797 (view as bug list)
Environment:
Last Closed: 2018-11-05 14:59:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
SRPM with patch 1 (15.00 MB, application/x-rpm)
2018-08-14 17:17 UTC, Andreas Karis
no flags Details
SRPM with patch 2 (8.77 MB, application/octet-stream)
2018-08-14 17:18 UTC, Andreas Karis
no flags Details

Description Andreas Karis 2018-08-14 17:13:42 UTC
Description of problem:
This is an RFE to add an option to OVS 2.9 for legacy RXQ assignment to cores. We have a customer who noticed in their performance tests with OVS 2.9 that RXQ assignment as in 2.6.1 provides significantly better performance. I'm attaching an SRPM  with a patch to revert to the legacy behavior. However, I unfortunately don't know how to read other_options from the OVS configuration, so that part of work is still missing.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Andreas Karis 2018-08-14 17:17:26 UTC
Created attachment 1475918 [details]
SRPM with patch 1

Comment 4 Andreas Karis 2018-08-14 17:18:28 UTC
Created attachment 1475919 [details]
SRPM with patch 2

Comment 5 Andreas Karis 2018-08-14 17:22:54 UTC
Test results without the test build (OVS 2.9 standard behavior):

First VM:
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :   12608444/12582488     12608444/12582488
       Max/Tx     :     6402208/6398400       6402208/6398400
MBits/s Rx/Tx     :           8455/4299             8455/4299

Second VM:
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :     6402635/6401132       6402635/6401132
       Max/Tx     :   14308416/14305600     14308416/14305600
MBits/s Rx/Tx     :           4301/9613             4301/9613

Test results with the test build (OVS 2.9 with OVS 2.6.1 behavior) openvswitch-2.9.0-19.el7.1.02143196.x86_64.tar.gz:

First VM:
  Flags:Port      :   P-----R--------:0
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :   12199750/12189416     12199750/12189416
       Max/Tx     :   14315520/12945376     14315520/12945376
MBits/s Rx/Tx     :           8191/8699             8191/8699

Second VM:
  Flags:Port      :   P-----R--------:0
Link State        :       <UP-10000-FD>     ----TotalRate----
Pkts/s Max/Rx     :   14203632/12227512     14203632/12227512
       Max/Tx     :   12965600/12958912     12965600/12958912

Notice the increased RX performance.

Comment 6 Andreas Karis 2018-08-14 17:53:14 UTC
Created an RPM build with:
rpmbuild -ba ../SPECS/openvswitch.spec --without check 2>&1 | tee
/root/buildlog.txt

And updated the spec file to include the patch.

And patch:
~~~
[root@undercloud-r430 ~]# cat rpmbuild/SOURCES/0012-revert-rxq-ordering.patch
--- openvswitch-2.9.0.orig/lib/dpif-netdev.c 2018-02-19 17:49:44.867277147 -0500
+++ openvswitch-2.9.0.modif/lib/dpif-netdev.c 2018-08-02
16:44:57.916173757 -0400
@@ -3453,25 +3453,35 @@
 /* Returns the next pmd from the numa node in
  * incrementing or decrementing order. */
 static struct dp_netdev_pmd_thread *
-rr_numa_get_pmd(struct rr_numa *numa)
+rr_numa_get_pmd(struct rr_numa *numa,bool always_increment)
 {
     int numa_idx = numa->cur_index;

-    if (numa->idx_inc == true) {
+    if (always_increment) {
         /* Incrementing through list of pmds. */
         if (numa->cur_index == numa->n_pmds-1) {
             /* Reached the last pmd. */
-            numa->idx_inc = false;
+            numa->cur_index = 0;
         } else {
             numa->cur_index++;
         }
     } else {
-        /* Decrementing through list of pmds. */
-        if (numa->cur_index == 0) {
-            /* Reached the first pmd. */
-            numa->idx_inc = true;
+        if (numa->idx_inc == true) {
+            /* Incrementing through list of pmds. */
+            if (numa->cur_index == numa->n_pmds-1) {
+                /* Reached the last pmd. */
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index++;
+            }
         } else {
-            numa->cur_index--;
+            /* Decrementing through list of pmds. */
+            if (numa->cur_index == 0) {
+                /* Reached the first pmd. */
+                numa->idx_inc = true;
+            } else {
+                numa->cur_index--;
+            }
         }
     }
     return numa->pmds[numa_idx];
@@ -3581,11 +3591,11 @@
         }
     }

-    if (n_rxqs > 1) {
+    //if (n_rxqs > 1) {
         /* Sort the queues in order of the processing cycles
          * they consumed during their last pmd interval. */
-        qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
-    }
+        // qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
+    //}

     rr_numa_list_populate(dp, &rr);
     /* Assign the sorted queues to pmds in round robin. */
@@ -3605,7 +3615,7 @@
                          netdev_rxq_get_queue_id(rxqs[i]->rx));
                 continue;
             }
-            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
+            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa,true);
             VLOG_WARN("There's no available (non-isolated) pmd thread "
                       "on numa node %d. Queue %d on port \'%s\' will "
                       "be assigned to the pmd on core %d "
@@ -3614,7 +3624,7 @@
                       netdev_rxq_get_name(rxqs[i]->rx),
                       rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
         } else {
-        rxqs[i]->pmd = rr_numa_get_pmd(numa);
+        rxqs[i]->pmd = rr_numa_get_pmd(numa,true);
         VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
                   "rx queue %d (measured processing cycles %"PRIu64").",
                   rxqs[i]->pmd->core_id, numa_id,
~~~

In my lab with standard build:
~~~
[root@overcloud-compute-0 ~]# grep 'on numa node 0 assigned port'
/var/log/openvswitch/ovs-vswitchd.log | tail -n 9
2018-08-02T22:26:44.776Z|00299|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 1 (measured processing cycles
518187).
2018-08-02T22:26:44.776Z|00300|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'dpdk0' rx queue 0 (measured processing cycles 223997).
2018-08-02T22:26:44.776Z|00301|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 0 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00302|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 2 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00303|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 3 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00304|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 4 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00305|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 5 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00306|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 6 (measured processing cycles
0).
2018-08-02T22:26:44.776Z|00307|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 7 (measured processing cycles
0).
~~~

In my lab with test build openvswitch-2.9.0-19.el7.1.02143196.x86_64:
~~~
[root@overcloud-compute-0 ~]# grep 'on numa node 0 assigned port'
/var/log/openvswitch/ovs-vswitchd.log | tail -n 9
2018-08-02T22:30:57.750Z|00091|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'dpdk0' rx queue 0 (measured processing cycles 0).
2018-08-02T22:30:57.750Z|00092|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 0 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00093|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 1 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00094|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 2 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00095|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 3 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00096|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 4 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00097|dpif_netdev|INFO|Core 6 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 5 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00098|dpif_netdev|INFO|Core 8 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 6 (measured processing cycles
0).
2018-08-02T22:30:57.750Z|00099|dpif_netdev|INFO|Core 2 on numa node 0
assigned port 'vhu6e4076aa-c2' rx queue 7 (measured processing cycles
0).
~~~

Looks like I can easily revert to the old behavior (at least more or less).

Now, if I can figure out a way to pass the correct struct smap into
the method, I could even make this configurable, e.g.:
~~~
[root@undercloud-r430 SOURCES]# cat 0012-revert-rxq-ordering.patch
--- openvswitch-2.9.0.orig/lib/dpif-netdev.c    2018-02-19
17:49:44.867277147 -0500
+++ openvswitch-2.9.0.modif/lib/dpif-netdev.c    2018-08-02
17:30:14.868626933 -0400
@@ -3453,25 +3453,35 @@
 /* Returns the next pmd from the numa node in
  * incrementing or decrementing order. */
 static struct dp_netdev_pmd_thread *
-rr_numa_get_pmd(struct rr_numa *numa)
+rr_numa_get_pmd(struct rr_numa *numa,bool always_increment)
 {
     int numa_idx = numa->cur_index;

-    if (numa->idx_inc == true) {
+    if (always_increment) {
         /* Incrementing through list of pmds. */
         if (numa->cur_index == numa->n_pmds-1) {
             /* Reached the last pmd. */
-            numa->idx_inc = false;
+            numa->cur_index = 0;
         } else {
             numa->cur_index++;
         }
     } else {
-        /* Decrementing through list of pmds. */
-        if (numa->cur_index == 0) {
-            /* Reached the first pmd. */
-            numa->idx_inc = true;
+        if (numa->idx_inc == true) {
+            /* Incrementing through list of pmds. */
+            if (numa->cur_index == numa->n_pmds-1) {
+                /* Reached the last pmd. */
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index++;
+            }
         } else {
-            numa->cur_index--;
+            /* Decrementing through list of pmds. */
+            if (numa->cur_index == 0) {
+                /* Reached the first pmd. */
+                numa->idx_inc = true;
+            } else {
+                numa->cur_index--;
+            }
         }
     }
     return numa->pmds[numa_idx];
@@ -3581,7 +3591,9 @@
         }
     }

-    if (n_rxqs > 1) {
+    bool legacy_rxq_sort = smap_get_bool(const struct smap *smap,
"legacy_rxq_sort", false)
+
+    if (!(legacy_rxq_sort) && n_rxqs > 1) {
         /* Sort the queues in order of the processing cycles
          * they consumed during their last pmd interval. */
         qsort(rxqs, n_rxqs, sizeof *rxqs, compare_rxq_cycles);
@@ -3605,7 +3617,7 @@
                          netdev_rxq_get_queue_id(rxqs[i]->rx));
                 continue;
             }
-            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
+            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa,legacy_rxq_sort);
             VLOG_WARN("There's no available (non-isolated) pmd thread "
                       "on numa node %d. Queue %d on port \'%s\' will "
                       "be assigned to the pmd on core %d "
@@ -3614,7 +3626,7 @@
                       netdev_rxq_get_name(rxqs[i]->rx),
                       rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
         } else {
-        rxqs[i]->pmd = rr_numa_get_pmd(numa);
+        rxqs[i]->pmd = rr_numa_get_pmd(numa,legacy_rxq_sort);
         VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
                   "rx queue %d (measured processing cycles %"PRIu64").",
                   rxqs[i]->pmd->core_id, numa_id,
~~~


But I don't know how to get `const struct smap *smap`

Comment 7 Eelco Chaudron 2018-08-15 09:50:35 UTC
Is this about rebalancing added to 2.9? Or only the initial assignment?

What about the other_config:pmd-rxq-affinity option, can't they use this to manually mimic the 2.6 behaviour?

I'll find some time next week to dig through the code changes to better understand your request. 

In the meantime can you add some details on why you see the performance differences, and how you test this? I assume it has to due to how queues and traffic are distributed.

Comment 26 Rashid Khan 2018-09-04 19:19:37 UTC
Feature cannot be backported to older LTS releases 
So not sure what we can do.

Comment 48 liting 2018-10-22 02:53:27 UTC
Run the vsperf CI performance testing on openvswitch-2.9.0-70.el7fdp with pmd-rxq-assign=roundrobin and without pmd-rxq-assign=roundrobin, The case with pmd-rxq-assign=roundrobin work well. The detail as following.

job link(without configure pmd-rxq-assign=roundrobin):
https://beaker.engineering.redhat.com/jobs/2874931
result link:
https://docs.google.com/spreadsheets/d/1FR6Bs6BMANmDC_eEaqDetV-A_ZWd_SV7-J2P1UaoSNQ/edit#gid=1295184293
job link:(with configure pmd-rxq-assign=roundrobin)
https://beaker.engineering.redhat.com/jobs/2973629
result link:
https://docs.google.com/spreadsheets/d/1cxLlI5fmtwp-qfCD6ep2h07rWI7ThqTosU2MhhDWCz8/edit#gid=1319507937

Comment 50 errata-xmlrpc 2018-11-05 14:59:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3500


Note You need to log in before you can comment on or make changes to this bug.