Description of problem: openvswitch2.16 has a much lower flow insertion rate with RHEL8.5 than RHEL8.4 Avg insert rate: RHEL-8.5 RHEL-8.4 pvp: 34601 55478 vxlan: 15643 114232 geneve: 15053 95197 vf_lag: 16242 107803 RHEL-8.5.0-20210902.5, kernel-4.18.0-339.el8.x86_64 http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_pvp.png http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_vxlan.png http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_geneve.png http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.5.0/fl_change-rate_vf_lag.png RHEL-8.4.0-updates-20210803.2, kernel-4.18.0-305.12.1.el8_4.x86_64 beaker job: https://beaker.engineering.redhat.com/jobs/5815225 http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_pvp.png http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_vxlan.png http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_geneve.png http://netqe-bj.usersys.redhat.com/share/qding/flow_insertion_rate/RHEL-8.4.0/fl_change-rate_vf_lag.png
My only suspicion is on commit 464b5b13e6d251c65b3158af5df16057243f1619 Author: Eelco Chaudron <echaudro> Date: Mon May 17 09:20:28 2021 -0400 netdev-offload-tc: Verify the flower rule installed. but it's a really light suspicion because all this commit does is introduce parsing and comparing. I can't see how it can give a ~90% drop like in the vxlan case. Unless.. it's failing to parse the flows, it's logging errors (in ovs-vswitchd.log) and failing back to dp:ovs. Can you please check that this is not the case? I still want to try 2.16 on my test system and see how it goes.
A theory I can't reproduce.. [root@wsfd-netdev89 ~]# ovs-appctl dpctl/dump-flows -m ufid:fa10b525-4821-4086-ab23-79c756fb01f0, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(enp130s0f1_0),packet_type(ns=0/0,id=0/0),eth(src=0e:60:c5:26:b6:fe,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=0.0.0.0/0.0.0.0,tip=0.0.0.0/0.0.0.0,op=0/0,sha=00:00:00:00:00:00/00:00:00:00:00:00,tha=00:00:00:00:00:00/00:00:00:00:00:00), packets:38, bytes:2280, used:0.770s, offloaded:yes, dp:tc, actions:bond9 ufid:91622fc2-8667-4d53-bc2d-69515ad08563, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(bond9),packet_type(ns=0/0,id=0/0),eth(src=c0:03:80:fd:b6:83,dst=01:80:c2:00:00:0e),eth_type(0x88cc), packets:0, bytes:0, used:6.900s, offloaded:yes, dp:tc, actions:enp130s0f0_0 [root@wsfd-netdev89 ~]# rpm -q openvswitch2.16 openvswitch2.16-2.16.0-8.el8fdp.x86_64 It got offloaded just nice. Perhaps something in the flows you're using leads to a different result.
(In reply to qding from comment #0) > Description of problem: > > openvswitch2.16 has a much lower flow insertion rate with RHEL8.5 than > RHEL8.4 > > Avg insert rate: > RHEL-8.5 RHEL-8.4 > pvp: 34601 55478 > vxlan: 15643 114232 > geneve: 15053 95197 > vf_lag: 16242 107803 I noticed that the ovs 2.16 tests are doing 600k flows while 2.15 are doing 200k. So for 2.15 the test ends in 1.6s, way before revalidator threads start to kick in. So that we compare apples to apples more closely, can you please try 2.16 with 200k flows too?
I don't use 600k flows. I used 200k flows for all the test. PVP has bi-directional traffic so there are 2*200k flows are offloaded in total. Other tests have uni-directional traffic so that 200k flows are offloaded in total. I ran a test with flows=100k and it's better that it has the report for PVP, but the rate is still low. I found lots of traces below in console.log and no such traces for ovs2.15. Does it mean something? [ 1150.723188] openvswitch: cpu_id mismatch with handler threads [ 1185.857167] openvswitch: cpu_id mismatch with handler threads [ 1191.188495] openvswitch: cpu_id mismatch with handler threads [ 1191.194265] openvswitch: cpu_id mismatch with handler threads beaker job: https://beaker.engineering.redhat.com/jobs/5912846 console log: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/59128/5912846/10831669/console.log beaker job: https://beaker.engineering.redhat.com/jobs/5912849 console log: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/59128/5912849/10831674/console.log
(In reply to qding from comment #9) > I don't use 600k flows. I used 200k flows for all the test. PVP has > bi-directional traffic so there are 2*200k flows are offloaded in total. > Other tests have uni-directional traffic so that 200k flows are offloaded in > total. > > I ran a test with flows=100k and it's better that it has the report for PVP, Hmmm. Then it seems the flows are expiring during the test and getting reinstalled, leading to a higher total count of flows installed. > but the rate is still low. I found lots of traces below in console.log and > no such traces for ovs2.15. Does it mean something? > > [ 1150.723188] openvswitch: cpu_id mismatch with handler threads ... > console log: > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/ > 59128/5912849/10831674/console.log This is new to me. Seems quite impactful: [ 1597.803339] openvswitch: cpu_id mismatch with handler threads [ 1671.169115] runtest.sh (2445): drop_caches: 3 [ 1673.323111] ovs_dp_get_upcall_portid: 471825 callbacks suppressed [ 1673.329211] openvswitch: cpu_id mismatch with handler threads
(In reply to Marcelo Ricardo Leitner from comment #10) > > but the rate is still low. I found lots of traces below in console.log and > > no such traces for ovs2.15. Does it mean something? > > > > [ 1150.723188] openvswitch: cpu_id mismatch with handler threads > ... > > console log: > > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/ > > 59128/5912849/10831674/console.log > > This is new to me. Seems quite impactful: > [ 1597.803339] openvswitch: cpu_id mismatch with handler threads > [ 1671.169115] runtest.sh (2445): drop_caches: 3 > [ 1673.323111] ovs_dp_get_upcall_portid: 471825 callbacks suppressed > [ 1673.329211] openvswitch: cpu_id mismatch with handler threads These were introduced by b83d23a2a38b ("openvswitch: Introduce per-cpu upcall dispatch") Hi Mark, Is there a way that we can tell ovs to use the previous model just for trying? Also, we didn't configure n-handler-threads, so I'm not sure why ovs is complaining about mismatch in there. Maybe it didn't account for CPU pinning/isolation?
Hi Marcelo, The only way would be to use an older kernel or update the userspace code. Almost certainly what is happening here is that all the upcalls are getting sent to one handler thread which is limiting your performance. The way this is supposed to work is that there should be a handler thread for every kernel thread. This message indicates that this is not the case. Any idea why that would be the case? Mark
Thanks Mark. I'm not sure what could cause the mismatch between kernel thread and handler threads, though, maybe the traffic pattern is affecting it too then. AFAIR the test is using a varying MAC address and constant IPs. Then it may be lighting up only 1 NIC queue, which now maps to 1 ovs handler thread. qding, can you please confirm if that's still the case? And also the output of "ethtool -l <representors>". ovs is logging: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log 2021-09-20T01:16:43.337Z|00001|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6 2021-09-20T01:16:43.629Z|00001|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12 2021-09-20T01:16:43.650Z|00002|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6 2021-09-20T01:18:04.421Z|00002|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12 2021-09-20T01:18:04.703Z|00003|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12 2021-09-20T01:18:04.950Z|00003|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6 2021-09-20T01:18:05.003Z|00004|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12 2021-09-20T01:18:05.194Z|00004|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6 2021-09-20T01:18:05.331Z|00005|dpif_netlink(handler14)|WARN|system@ovs-system: lost packet on handler 12 2021-09-20T01:18:05.451Z|00005|dpif_netlink(handler8)|WARN|system@ovs-system: lost packet on handler 6 which could be one handler for each representor. But still, on the thread counts: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log says: 2021-09-20T01:15:32.751Z|00032|ofproto_dpif_upcall|INFO|Overriding n-handler-threads to 24, setting n-revalidator-threads to 7 2021-09-20T01:15:32.751Z|00033|ofproto_dpif_upcall|INFO|Starting 31 threads While the kernel has been configured with: https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/59128/5912849/10831674/console.log isolcpus=1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 ... [ 0.000000] smpboot: Allowing 48 CPUs, 0 hotplug CPUs That's $ echo 1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 | sed 's/,/\n/g' | wc -l 24 out of a 48 CPUs system. Seems ovs is starting more threads than it should?
(In reply to Marcelo Ricardo Leitner from comment #13) > Thanks Mark. I'm not sure what could cause the mismatch between kernel > thread and handler threads, though, maybe the traffic pattern is affecting > it too then. This could also be an issue. FYI: If there is a mismatch between the number of kernel threads (K) and number of user threads (U) where K > U, the kernel threads will distribute to the userspace threads in the following way [4] > AFAIR the test is using a varying MAC address and constant IPs. Then it may > be lighting up only 1 NIC queue, which now maps to 1 ovs handler thread. Yes and this is, now, expected behavior. Previously, OVS would (kind of) randomly distribute these upcalls to handler threads which was causing a thundering herd issue and packet ordering issues. Now, like in regular Linux networking, you can assign flows to queues and OVS will maintain that flow->queue mapping throughout the upcall path. Therefore, if your flows are not distributed well across NIC queues, it could cause the issue you have noted. > qding, can you please confirm if that's still the case? And also the output > of "ethtool -l <representors>". > > ovs is logging: > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/ > 58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log > 2021-09-20T01:16:43.337Z|00001|dpif_netlink(handler8)|WARN|system@ovs-system: > lost packet on handler 6 > 2021-09-20T01:16:43.629Z|00001|dpif_netlink(handler14)|WARN|system@ovs- > system: lost packet on handler 12 > 2021-09-20T01:16:43.650Z|00002|dpif_netlink(handler8)|WARN|system@ovs-system: > lost packet on handler 6 > 2021-09-20T01:18:04.421Z|00002|dpif_netlink(handler14)|WARN|system@ovs- > system: lost packet on handler 12 > 2021-09-20T01:18:04.703Z|00003|dpif_netlink(handler14)|WARN|system@ovs- > system: lost packet on handler 12 > 2021-09-20T01:18:04.950Z|00003|dpif_netlink(handler8)|WARN|system@ovs-system: > lost packet on handler 6 > 2021-09-20T01:18:05.003Z|00004|dpif_netlink(handler14)|WARN|system@ovs- > system: lost packet on handler 12 > 2021-09-20T01:18:05.194Z|00004|dpif_netlink(handler8)|WARN|system@ovs-system: > lost packet on handler 6 > 2021-09-20T01:18:05.331Z|00005|dpif_netlink(handler14)|WARN|system@ovs- > system: lost packet on handler 12 > 2021-09-20T01:18:05.451Z|00005|dpif_netlink(handler8)|WARN|system@ovs-system: > lost packet on handler 6 > which could be one handler for each representor. > > > But still, on the thread counts: > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/09/ > 58152/5815224/10668264/132146404/ovs-vswitchd_pvp.log > says: > 2021-09-20T01:15:32.751Z|00032|ofproto_dpif_upcall|INFO|Overriding > n-handler-threads to 24, setting n-revalidator-threads to 7 > 2021-09-20T01:15:32.751Z|00033|ofproto_dpif_upcall|INFO|Starting 31 threads > > While the kernel has been configured with: > https://beaker-archive.host.prod.eng.bos.redhat.com/beaker-logs/2021/10/ > 59128/5912849/10831674/console.log > isolcpus=1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 > ... > [ 0.000000] smpboot: Allowing 48 CPUs, 0 hotplug CPUs > > That's > $ echo 1,25,3,27,5,29,7,31,9,33,11,35,13,37,15,39,17,41,19,43,21,45,23,47 | > sed 's/,/\n/g' | wc -l > 24 > > out of a 48 CPUs system. Seems ovs is starting more threads than it should? When OVS starts it will ask the dpif_provider if it needs a certain number of handler threads [1]. The dpif provider can reply that it does and it can return that number of threads. In the 'dpif_netlink' case, it returns the number of CPU cores seen by userspace [2]. Using this information, OVS will spawn an appropriate number of handler threads and a correspondingly appropriate number of revalidator threads (following a heuristic that has been there for a long time and I don't know the origin). In your case, it detects 24 non-isolated cores and, therefore, creates 24 handler threads and (24 / 4 + 1) revalidator threads which equates to 24 + 7 = 31 threads. Looking at the output you show above, that looks to be the case. We don't allow the user to set the number of handler threads anymore as we don't believe it adds any value now. For example, if you set the number of threads X higher than the number of cores, when all the threads are scheduled to a core, at least X threads will be idle. [1] https://github.com/openvswitch/ovs/blob/a621ac5eafe38809116d65397618d1ce8559be53/ofproto/ofproto-dpif-upcall.c#L643 [2] https://github.com/openvswitch/ovs/blob/3486d81d17dadea72f0a580fdf00c92298d0f349/lib/dpif-netlink.c#L2728 [3] https://github.com/openvswitch/ovs/blob/a621ac5eafe38809116d65397618d1ce8559be53/ofproto/ofproto-dpif-upcall.c#L646 [4] https://github.com/torvalds/linux/blob/2f111a6fd5b5297b4e92f53798ca086f7c7d33a4/net/openvswitch/datapath.c#L1648
The patch introduced in comment #11 has side effects that are quite visible. It changes the system load balance (comment #14) AFAIK beyond of what is configurable, or at least would require sysadmin intervention to adjust it. OVS HWOL will not configure any special thread handling. It will rely on the NIC IRQs and vswitchd threads just as the non-HWOL setups will. On comment #18 we tried to match the number of IRQs to the number of kernel threads identified in comment #14, to no good result. This change was backported to rhel8 via https://bugzilla.redhat.com/show_bug.cgi?id=1992773 , yet AFAIK the userspace counterpart is not there, so no harm for now. But considering the load balance change above, this change is quite risky and I'm not sure how much exposure/testing it got. I need help from the OVS team here to double check this change and, if possible, suggest proper tunings that now need to be done if they decide to proceed with this change. Thanks!
Assigning to Michael as he is already working on this issue.
Also added Aaron on CC as he knows more details, but what I remembered these changes were required to avoid out-of-order upcall packet handling.
I suppose, there are more questions to the actual test setup and the NIC driver/HW here than to the per-cpu dispatch. It looks like that regardless of the number of queues configured on representors, packets are still received from a single queue. Could you confirm that interrupts from a VF representor are actually received on a single CPU code (1 core per representor)? If that's the case, then the OVS behavior seems correct to me, because if we will use the old way of upcall distribution, a lot of these packets will be processed out-of-order causing significant TCP performance issues and the thundering herd problem by waking up too many handler threads. For the test/driver/HW I have a few questions: 1. What is the traffic flow in this setup? Is it going from the outside, goes to VF (inside VM?), then goes back? 2. Assuming the traffic enters the setup from PF, how many queues configured on the PF? 3. When the packet is sent to the actual VF (not a representor) from the VM side, how many threads are doing that? If XPS in kernel is working, a single thread may only utilize a single Tx queue. 4. When the packet goes between VF and VF-rep, does the NIC perform RSS hashing and packet distribution across available Rx queues on the destination port, or it directly maps Tx queue of the source port to the Rx queue of the destination? Important if XPS in kernel is working.
There is definitely a bug in the way the userspace and kernelspace are programmed. In the isolcpus case, we have the following CPUs active: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 That means we build an array 24 elements long, and populate it with handler sockets: [sock0,sock1,sock2,....] Here's the problem, we reference everything by processor ID, so we will use 0,2,4... BUT, that means we will be skipping elements. ex: ksoftirqd cpu = 0, index to [sock0] ksoftirqd cpu = 2 (which should index to sock1), index to [sock2] ... ksoftirqd cpu = 24 (which should index to sock11), index to [sock0] and warn This means we effectively overload on all the even numbers and completely skip all the odd numbers. To fix this, userspace needs to provide a full map of all CPUs and populate it appropriately: step 1: create array that is sizeof(all cpus) rather than sizeof(active cpus) step 2: fill in all the cpus that are active, appropriately step 3: use some mechanism to fill in the inactive CPUs (incase they go active) step 4: send this to kernel I guess this will explain the performance drop, and the messages. WDYT?
(In reply to Aaron Conole from comment #23) > I guess this will explain the performance drop, and the messages. WDYT? Good point. I think, that explains the kernel message: 'openvswitch: cpu_id mismatch with handler threads' And the distribution definitely needs to be fixed. However, I'm not sure if that explains the performance drop, because results in comment #18 says that performance is low even if CPUs are not isolated.
Raising priority as this will become important with OCP 4.11 with OVS 2.16.
Nice! Then we're almost good here now. Summary due to private comments above: the test was using a fixed udp src port for the tunnel header, which caused the NIC to deliver all packets through the same queue. Previously, OVS had an internal, unwanted, RPS-like packet distribution system, which got removed, and then the fixed udp src port became very visible in tests. By fixing the src port so it better mimics real world usage, the NIC can spread it over multiple queues, and the numbers are much better. The only remaining low number is with VF LAG. Apparently the NIC is not distributing it well in this case. Can you please confirm it through IRQ counts/CPU usage?
Hi all Thank you @i.maximets @mleitner @qding for solving this issue! To summarize once again so we can close this BZ * "openvswitch: cpu_id mismatch with handler threads" error message has been resolved [Bug 2102449] * Fixing the test scripts to use multiqueue significantly improved the test results [Comments 44, 46] @qding, can we close this BZ if there are no further actions necessary for this?