2006605 – [OVS offload] openvswitch2.16 has a much lower flow insertion rate with RHEL8.5 than RHEL8.4

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 2006605 - [OVS offload] openvswitch2.16 has a much lower flow insertion rate with RHEL8.5 than RHEL8.4

Summary: [OVS offload] openvswitch2.16 has a much lower flow insertion rate with RHEL8...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch2.16
Sub Component:
Version:	FDP 21.G
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Michael Santana
QA Contact:	qding
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2109452 2172625
TreeView+	depends on / blocked

Reported:	2021-09-22 02:43 UTC by qding
Modified:	2023-05-05 03:00 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-05-05 03:00:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1555	0	None	None	None	2021-09-22 02:44:08 UTC

Internal Links: 2109452

Description qding 2021-09-22 02:43:08 UTC

Description of problem:

openvswitch2.16 has a much lower flow insertion rate with RHEL8.5 than RHEL8.4

Avg insert rate: 
         RHEL-8.5   RHEL-8.4
pvp:     34601      55478
vxlan:   15643      114232
geneve:  15053      95197
vf_lag:  16242      107803


RHEL-8.5.0-20210902.5, kernel-4.18.0-339.el8.x86_64
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_pvp.png
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_vxlan.png
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_geneve.png
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_vf_lag.png


RHEL-8.4.0-updates-20210803.2, kernel-4.18.0-305.12.1.el8_4.x86_64
beaker job: https://beaker.engineering.redhat.com/jobs/5815225

http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_pvp.png
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_vxlan.png
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_geneve.png
http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_vf_lag.png

Comment 2 Marcelo Ricardo Leitner 2021-10-15 21:46:46 UTC

My only suspicion is on 

commit 464b5b13e6d251c65b3158af5df16057243f1619
Author: Eelco Chaudron <echaudro>
Date:   Mon May 17 09:20:28 2021 -0400

    netdev-offload-tc: Verify the flower rule installed.

but it's a really light suspicion because all this commit does is introduce parsing and comparing.
I can't see how it can give a ~90% drop like in the vxlan case.

Unless.. it's failing to parse the flows, it's logging errors (in ovs-vswitchd.log) and failing back to dp:ovs.
Can you please check that this is not the case?

I still want to try 2.16 on my test system and see how it goes.

Comment 3 Marcelo Ricardo Leitner 2021-10-18 20:39:00 UTC

A theory I can't reproduce..

[root@wsfd-netdev89 ~]# ovs-appctl dpctl/dump-flows -m
ufid:fa10b525-4821-4086-ab23-79c756fb01f0, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp130s0f1_0),packet_type(ns=0/0,id=0/0),eth(src=0e:60:c5:26:b6:fe,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), packets:38, bytes:2280, used:0.770s, offloaded:yes, dp:tc, actions:bond9

ufid:91622fc2-8667-4d53-bc2d-69515ad08563, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(bond9),packet_type(ns=0/0,id=0/0),eth(src=c0:03:80:fd:b6:83,dst=01:80:c2:00:00:0e),eth_type(0x88cc), packets:0, bytes:0, used:6.900s, offloaded:yes, dp:tc, actions:enp130s0f0_0

[root@wsfd-netdev89 ~]# rpm -q openvswitch2.16
openvswitch2.16-2.16.0-8.el8fdp.x86_64

It got offloaded just nice. Perhaps something in the flows you're using leads to a different result.

Comment 8 Marcelo Ricardo Leitner 2021-10-19 15:08:15 UTC

(In reply to qding from comment #0)
> Description of problem:
> 
> openvswitch2.16 has a much lower flow insertion rate with RHEL8.5 than
> RHEL8.4
> 
> Avg insert rate: 
>          RHEL-8.5   RHEL-8.4
> pvp:     34601      55478
> vxlan:   15643      114232
> geneve:  15053      95197
> vf_lag:  16242      107803

I noticed that the ovs 2.16 tests are doing 600k flows while 2.15 are doing 200k.
So for 2.15 the test ends in 1.6s, way before revalidator threads start to kick in.

So that we compare apples to apples more closely, can you please try 2.16 with 200k flows too?

Comment 9 qding 2021-10-20 14:04:17 UTC

I don't use 600k flows. I used 200k flows for all the test. PVP has bi-directional traffic so there are 2*200k flows are offloaded in total. Other tests have uni-directional traffic so that 200k flows are offloaded in total.

I ran a test with flows=100k and it's better that it has the report for PVP, but the rate is still low. I found lots of traces below in console.log and no such traces for ovs2.15. Does it mean something?

[ 1150.723188] openvswitch: cpu_id mismatch with handler threads 
[ 1185.857167] openvswitch: cpu_id mismatch with handler threads 
[ 1191.188495] openvswitch: cpu_id mismatch with handler threads 
[ 1191.194265] openvswitch: cpu_id mismatch with handler threads 

beaker job: https://beaker.engineering.redhat.com/jobs/5912846
console log: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/59128/5912846/10831669/console.log

beaker job: https://beaker.engineering.redhat.com/jobs/5912849
console log: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/59128/5912849/10831674/console.log

Comment 10 Marcelo Ricardo Leitner 2021-10-20 14:54:59 UTC

(In reply to qding from comment #9)
> I don't use 600k flows. I used 200k flows for all the test. PVP has
> bi-directional traffic so there are 2*200k flows are offloaded in total.
> Other tests have uni-directional traffic so that 200k flows are offloaded in
> total.
> 
> I ran a test with flows=100k and it's better that it has the report for PVP,

Hmmm. Then it seems the flows are expiring during the test and getting reinstalled, leading to a higher total count of flows installed.

> but the rate is still low. I found lots of traces below in console.log and
> no such traces for ovs2.15. Does it mean something?
> 
> [ 1150.723188] openvswitch: cpu_id mismatch with handler threads 
...
> console log:
> https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/
> 59128/5912849/10831674/console.log

This is new to me. Seems quite impactful:
[ 1597.803339] openvswitch: cpu_id mismatch with handler threads 
[ 1671.169115] runtest.sh (2445): drop_caches: 3 
[ 1673.323111] ovs_dp_get_upcall_portid: 471825 callbacks suppressed 
[ 1673.329211] openvswitch: cpu_id mismatch with handler threads

Comment 11 Marcelo Ricardo Leitner 2021-10-20 15:04:25 UTC

(In reply to Marcelo Ricardo Leitner from comment #10)
> > but the rate is still low. I found lots of traces below in console.log and
> > no such traces for ovs2.15. Does it mean something?
> > 
> > [ 1150.723188] openvswitch: cpu_id mismatch with handler threads 
> ...
> > console log:
> > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/
> > 59128/5912849/10831674/console.log
> 
> This is new to me. Seems quite impactful:
> [ 1597.803339] openvswitch: cpu_id mismatch with handler threads 
> [ 1671.169115] runtest.sh (2445): drop_caches: 3 
> [ 1673.323111] ovs_dp_get_upcall_portid: 471825 callbacks suppressed 
> [ 1673.329211] openvswitch: cpu_id mismatch with handler threads

These were introduced by
b83d23a2a38b ("openvswitch: Introduce per-cpu upcall dispatch")

Hi Mark,
Is there a way that we can tell ovs to use the previous model just for trying?
Also, we didn't configure n-handler-threads, so I'm not sure why ovs is complaining about mismatch in there.
Maybe it didn't account for CPU pinning/isolation?

Comment 12 Mark Gray 2021-10-20 15:30:31 UTC

Hi Marcelo,

The only way would be to use an older kernel or update the userspace code.

Almost certainly what is happening here is that all the upcalls are getting sent to one handler thread which is limiting your performance. The way this is supposed to work is that there should be a handler thread for every kernel thread. This message indicates that this is not the case. Any idea why that would be the case?

Mark

Comment 13 Marcelo Ricardo Leitner 2021-10-20 21:22:56 UTC

Thanks Mark. I'm not sure what could cause the mismatch between kernel thread and handler threads, though, maybe the traffic pattern is affecting it too then.
AFAIR the test is using a varying MAC address and constant IPs. Then it may be lighting up only 1 NIC queue, which now maps to 1 ovs handler thread.
qding, can you please confirm if that's still the case? And also the output of "ethtool -l <representors>".

ovs is logging:
https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log
2021-09-20T01:16:43.337Z|00001|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6
2021-09-20T01:16:43.629Z|00001|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12
2021-09-20T01:16:43.650Z|00002|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6
2021-09-20T01:18:04.421Z|00002|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12
2021-09-20T01:18:04.703Z|00003|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12
2021-09-20T01:18:04.950Z|00003|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6
2021-09-20T01:18:05.003Z|00004|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12
2021-09-20T01:18:05.194Z|00004|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6
2021-09-20T01:18:05.331Z|00005|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12
2021-09-20T01:18:05.451Z|00005|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6
which could be one handler for each representor.


But still, on the thread counts:
https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log
says:
2021-09-20T01:15:32.751Z|00032|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 24, setting n-revalidator-threads to 7
2021-09-20T01:15:32.751Z|00033|ofproto_dpif_upcall|INFO|Starting 31 threads

While the kernel has been configured with:
https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/59128/5912849/10831674/console.log
isolcpus=1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 
...
[    0.000000] smpboot: Allowing 48 CPUs, 0 hotplug CPUs

That's
$ echo 1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 | sed 's/,/\n/g' | wc -l
24

out of a 48 CPUs system. Seems ovs is starting more threads than it should?

Comment 14 Mark Gray 2021-10-21 08:05:16 UTC

(In reply to Marcelo Ricardo Leitner from comment #13)
> Thanks Mark. I'm not sure what could cause the mismatch between kernel
> thread and handler threads, though, maybe the traffic pattern is affecting
> it too then.

This could also be an issue. 

FYI: If there is a mismatch between the number of kernel threads (K) and number of user threads (U) where K > U, the kernel threads will distribute to the userspace threads in the following way [4]

> AFAIR the test is using a varying MAC address and constant IPs. Then it may
> be lighting up only 1 NIC queue, which now maps to 1 ovs handler thread.

Yes and this is, now, expected behavior. Previously, OVS would (kind of) randomly distribute these upcalls to handler threads which was causing a thundering herd issue and packet ordering issues. Now, like in regular Linux networking, you can assign flows to queues and OVS will maintain that flow->queue mapping throughout the upcall path. Therefore, if your flows are not distributed well across NIC queues, it could cause the issue you have noted.

> qding, can you please confirm if that's still the case? And also the output
> of "ethtool -l <representors>".
> 
> ovs is logging:
> https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/
> 58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log
> 2021-09-20T01:16:43.337Z|00001|dpif_netlink(handler8)|WARN|system@ovs-system:
> lost packet on handler 6
> 2021-09-20T01:16:43.629Z|00001|dpif_netlink(handler14)|WARN|system@ovs-
> system: lost packet on handler 12
> 2021-09-20T01:16:43.650Z|00002|dpif_netlink(handler8)|WARN|system@ovs-system:
> lost packet on handler 6
> 2021-09-20T01:18:04.421Z|00002|dpif_netlink(handler14)|WARN|system@ovs-
> system: lost packet on handler 12
> 2021-09-20T01:18:04.703Z|00003|dpif_netlink(handler14)|WARN|system@ovs-
> system: lost packet on handler 12
> 2021-09-20T01:18:04.950Z|00003|dpif_netlink(handler8)|WARN|system@ovs-system:
> lost packet on handler 6
> 2021-09-20T01:18:05.003Z|00004|dpif_netlink(handler14)|WARN|system@ovs-
> system: lost packet on handler 12
> 2021-09-20T01:18:05.194Z|00004|dpif_netlink(handler8)|WARN|system@ovs-system:
> lost packet on handler 6
> 2021-09-20T01:18:05.331Z|00005|dpif_netlink(handler14)|WARN|system@ovs-
> system: lost packet on handler 12
> 2021-09-20T01:18:05.451Z|00005|dpif_netlink(handler8)|WARN|system@ovs-system:
> lost packet on handler 6
> which could be one handler for each representor.
> 
> 
> But still, on the thread counts:
> https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/
> 58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log
> says:
> 2021-09-20T01:15:32.751Z|00032|ofproto_dpif_upcall|INFO|Overriding
> n-handler-threads to 24, setting n-revalidator-threads to 7
> 2021-09-20T01:15:32.751Z|00033|ofproto_dpif_upcall|INFO|Starting 31 threads
> 
> While the kernel has been configured with:
> https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/
> 59128/5912849/10831674/console.log
> isolcpus=1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 
> ...
> [    0.000000] smpboot: Allowing 48 CPUs, 0 hotplug CPUs
> 
> That's
> $ echo 1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 |
> sed 's/,/\n/g' | wc -l
> 24
> 
> out of a 48 CPUs system. Seems ovs is starting more threads than it should?

When OVS starts it will ask the dpif_provider if it needs a certain number of handler threads [1]. The dpif provider can reply that it does and it can return that number of threads. In the 'dpif_netlink' case, it returns the number of CPU cores seen by userspace [2]. Using this information, OVS will spawn an appropriate number of handler threads and a correspondingly appropriate number of revalidator threads (following a heuristic that has been there for a long time and I don't know the origin). In your case, it detects 24 non-isolated cores and, therefore, creates 24 handler threads and (24 / 4 + 1) revalidator threads which equates to 24 + 7 = 31 threads. Looking at the output you show above, that looks to be the case. We don't allow the user to set the number of handler threads anymore as we don't believe it adds any value now. For example, if you set the number of threads X higher than the number of cores, when all the threads are scheduled to a core, at least X threads will be idle.

[1] https://github.com/openvswitch/ovs/blob/a621ac5eafe38809116d65397618d1ce8559be53/ofproto/ofproto-dpif-upcall.c#L643
[2] https://github.com/openvswitch/ovs/blob/3486d81d17dadea72f0a580fdf00c92298d0f349/lib/dpif-netlink.c#L2728
[3] https://github.com/openvswitch/ovs/blob/a621ac5eafe38809116d65397618d1ce8559be53/ofproto/ofproto-dpif-upcall.c#L646
[4] https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/net/openvswitch/datapath.c#L1648

Comment 19 Marcelo Ricardo Leitner 2022-01-21 13:41:25 UTC

The patch introduced in comment #11 has side effects that are quite visible. It changes the system load balance (comment #14) AFAIK beyond of what is configurable, or at least would require sysadmin intervention to adjust it.

OVS HWOL will not configure any special thread handling. It will rely on the NIC IRQs and vswitchd threads just as the non-HWOL setups will. On comment #18 we tried to match the number of IRQs to the number of kernel threads identified in comment #14, to no good result.

This change was backported to rhel8 via https://bugzilla.redhat.com/show_bug.cgi?id=1992773 , yet AFAIK the userspace counterpart is not there, so no harm for now. But considering the load balance change above, this change is quite risky and I'm not sure how much exposure/testing it got.

I need help from the OVS team here to double check this change and, if possible, suggest proper tunings that now need to be done if they decide to proceed with this change.

Thanks!

Comment 20 Eelco Chaudron 2022-01-21 13:48:50 UTC

Assigning to Michael as he is already working on this issue.

Comment 21 Eelco Chaudron 2022-01-21 13:51:08 UTC

Also added Aaron on CC as he knows more details, but what I remembered these changes were required to avoid out-of-order upcall packet handling.

Comment 22 Ilya Maximets 2022-01-21 15:52:03 UTC

I suppose, there are more questions to the actual test setup and the NIC
driver/HW here than to the per-cpu dispatch.  It looks like that regardless
of the number of queues configured on representors, packets are still
received from a single queue.  Could you confirm that interrupts from a VF
representor are actually received on a single CPU code (1 core per representor)?
If that's the case, then the OVS behavior seems correct to me, because
if we will use the old way of upcall distribution, a lot of these packets will
be processed out-of-order causing significant TCP performance issues and
the thundering herd problem by waking up too many handler threads.

For the test/driver/HW I have a few questions:

1. What is the traffic flow in this setup?
   Is it going from the outside, goes to VF (inside VM?), then goes back?

2. Assuming the traffic enters the setup from PF, how many queues configured
   on the PF?

3. When the packet is sent to the actual VF (not a representor) from the VM
   side, how many threads are doing that?  If XPS in kernel is working, a
   single thread may only utilize a single Tx queue.

4. When the packet goes between VF and VF-rep, does the NIC perform RSS
   hashing and packet distribution across available Rx queues on the
   destination port, or it directly maps Tx queue of the source port to
   the Rx queue of the destination?  Important if XPS in kernel is working.

Comment 23 Aaron Conole 2022-01-25 14:37:32 UTC

There is definitely a bug in the way the userspace and kernelspace are programmed.

In the isolcpus case, we have the following CPUs active:

0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46

That means we build an array 24 elements long, and populate it with handler sockets:

[sock0,sock1,sock2,....]

Here's the problem, we reference everything by processor ID, so we will use 0,2,4...

BUT, that means we will be skipping elements.  ex:

ksoftirqd cpu = 0, index to [sock0]
ksoftirqd cpu = 2 (which should index to sock1), index to [sock2]
...
ksoftirqd cpu = 24 (which should index to sock11), index to [sock0] and warn

This means we effectively overload on all the even numbers and completely skip all the odd numbers.

To fix this, userspace needs to provide a full map of all CPUs and populate it appropriately:

step 1: create array that is sizeof(all cpus) rather than sizeof(active cpus)
step 2: fill in all the cpus that are active, appropriately
step 3: use some mechanism to fill in the inactive CPUs (incase they go active)
step 4: send this to kernel

I guess this will explain the performance drop, and the messages.  WDYT?

Comment 24 Ilya Maximets 2022-01-25 15:34:31 UTC

(In reply to Aaron Conole from comment #23)
> I guess this will explain the performance drop, and the messages.  WDYT?

Good point.  I think, that explains the kernel message:
  'openvswitch: cpu_id mismatch with handler threads'
And the distribution definitely needs to be fixed.

However, I'm not sure if that explains the performance drop, because results
in comment #18 says that performance is low even if CPUs are not isolated.

Comment 33 Marcelo Ricardo Leitner 2022-06-30 02:39:29 UTC

Raising priority as this will become important with OCP 4.11 with OVS 2.16.

Comment 43 Marcelo Ricardo Leitner 2022-10-10 12:45:20 UTC

Nice! Then we're almost good here now.

Summary due to private comments above: the test was using a fixed udp src port for the tunnel header, which caused the NIC to deliver all packets through the same queue. Previously, OVS had an internal, unwanted, RPS-like packet distribution system, which got removed, and then the fixed udp src port became very visible in tests. By fixing the src port so it better mimics real world usage, the NIC can spread it over multiple queues, and the numbers are much better.

The only remaining low number is with VF LAG. Apparently the NIC is not distributing it well in this case. Can you please confirm it through IRQ counts/CPU usage?

Comment 49 Michael Santana 2023-05-01 19:34:07 UTC

Hi all

Thank you @i.maximets @mleitner @qding for solving this issue!

To summarize once again so we can close this BZ
* "openvswitch: cpu_id mismatch with handler threads" error message has been resolved [Bug 2102449]
* Fixing the test scripts to use multiqueue significantly improved the test results [Comments 44, 46]

@qding, can we close this BZ if there are no further actions necessary for this?

Note You need to log in before you can comment on or make changes to this bug.