Bug 2097864

Summary: openvswitch is consuming more CPU post upgrade to 4.10.14
Product: Red Hat Enterprise Linux Fast Datapath Reporter: siva kanakala <skanakal>
Component: ovsdbAssignee: Adrián Moreno <amorenoz>
Status: ASSIGNED --- QA Contact: Hekai Wang <hewang>
Severity: high Docs Contact:
Priority: medium    
Version: FDP 22.LCC: amorenoz, anbhat, ctrautma, eglottma, fleitner, fromani, i.turturica, jhsiao, jpradhan, miabbott, mmarkand, mmichels, mnguyen, msivak, pvalerio, ralongi, rravaiol, travier
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: amorenoz: needinfo-
mmarkand: needinfo? (amorenoz)
mmarkand: needinfo? (amorenoz)
mmarkand: needinfo? (amorenoz)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-27 17:27:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
inventory none

Comment 3 Flavio Leitner 2022-06-23 20:33:02 UTC
Hi,

The issue started with these messages below:
Jun 10 20:45:56 osp4-worker-1 ovs-vswitchd[3735]: ovs|00010|ofproto_dpif_upcall(handler94)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 10 20:45:56 osp4-worker-1 ovs-vswitchd[3735]: ovs|00009|ofproto_dpif_upcall(handler96)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 10 20:45:56 osp4-worker-1 ovs-vswitchd[3735]: ovs|00001|ofproto_dpif_upcall(handler108)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 10 20:45:56 osp4-worker-1 ovs-vswitchd[3735]: ovs|00010|ofproto_dpif_upcall(handler95)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 10 20:45:56 osp4-worker-1 ovs-vswitchd[3735]: ovs|00042|ofproto_dpif_upcall(handler93)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.

They are telling that OVS data path reached the maximum amount of installed flows,
which means additional flows would be handled by user space causing more CPU usage
there, delays and etc.

The coverage counters confirm the situation:
upcall_flow_limit_hit    27249.0/sec 24552.400/sec     3554.3997/sec   total: 22104478


system@ovs-system:
  flows         : (current 89359) (avg 82644) (max 207997) (limit 93500)
  dump duration : 366ms
  ufid enabled : true


I have some questions/requests:
1) Are you sure the test is still the same? Perhaps it is generating more networking flows?
2) Do you see similar messages in the previous version?
3) Can you provide the sosreport after running the test in the previous version?
   It would be helpful to compare before and after.
4) Are you limiting CPU power, tuning the OS somehow, reserving CPU cores, or anything
   that could reduce processing power?
5) Are you tuning the workload in a different way?
6) Is this test running on the same hardware as before?

Thanks,
fbl

Comment 5 i.turturica@f5.com 2022-06-23 21:53:58 UTC
Hi Flavio,

Thanks for looking into this. I'll try to answer the questions with what I know so far.

1) I'm not the same person who ran this CPS test on 4.7.x, but I trust the numbers we got last time.
2) Not sure, we were able to run the CPS test in the previous version without issues, so there was no reason to look for errors in the logs.
3) No. We don't have 4.7.x installed on the performance cluster anymore.
4) No
5) Not that I know of
6) Yes, same cluster, same hardware. Same traffic generator machine.

Can you run a similar CPS test in 4.7.x and 4.10.x and see if you notice any difference?

Comment 6 Flavio Leitner 2022-06-24 14:43:44 UTC
Hi,

Thanks for all the answers.

Can you provide more details on the CPS test? I am not familiar with that.

Are you testing IPv4 only, IPv6 only, or both?

Perhaps there is some extra or different workload running on the CPUs also running OVS threads (handler and revalidator).
It would be very helpful if you can identify the tasks, because if other tasks are consuming enough CPU that prevents
OVS to complete processing in time, then it will reduce the maximum number of allowed flows in the data path.

The command 'ovs-appctl_upcall/show' should show the current number of flows, maximum, and limit. It
would be great to see snapshots every 5s of that output from the start to the reproducing time.
It will tell us how the load changed over time from OVS perspective.
For example:
# ovs-apptctl upcall/show
system@ovs-system:
  flows         : (current 89359) (avg 82644) (max 207997) (limit 93500)


Can you tell if the test looks okay until that message "upcall: datapath reached the dynamic limit of 200000 flows." shows up?

Thanks,
fbl

Comment 7 i.turturica@f5.com 2022-06-24 19:16:26 UTC
The goal of a CPS test is find the maximum number of connections per second. The load generator tool has a number of virtual users (threads), each opens a new TCP connection, sends a HTTP/1.1 request for a 128 byte file, waits for a 200 OK response and closes the connection. Rinse and repeat.

This is IPv4 only.

While the test is running, nothing else on the cluster is running. There's only one namespace for router (TMM), and one for app pods.

Where are you running the "ovs-apptctl upcall/show" command? And where do you see the 200000 flows message?

Comment 8 Flavio Leitner 2022-06-27 17:15:45 UTC
So, the CPS test runs outside of the cluster and it connects to service IP addresses?

Is the CPS test tool open source? Or perhaps is it possible to provide the tool for us to run in our labs?
It is easier to try to reproduce the issue if we use the same tool.

The command is "ovs-appctl upcall/show" and it should run on the worker nodes where you should also see eventually the error message in journal log.
e.g.: found one sample in osp4-worker-1 sosreport.

Comment 9 Flavio Leitner 2022-06-27 17:51:02 UTC
Another thing is that we have optimized the components to reduce the amount of flows/increase efficiency from 4.7 to 4.10,
so I wonder if there could be different deployment configuration/features enabled, like ACLs.
If ACLs were not enabled before, then now they would make such a difference, I think.

Comment 10 i.turturica@f5.com 2022-06-27 18:11:48 UTC
I just ran the test again and I can see the warnings in the journalctl:

Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00158|ofproto_dpif_upcall(handler92)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00335|ofproto_dpif_upcall(handler89)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00140|ofproto_dpif_upcall(handler93)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00261|ofproto_dpif_upcall(handler90)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00011|ofproto_dpif_upcall(handler109)|WARN|upcall: datapath reached the dynamic limit of 200000 flows.
Jun 27 17:50:14 osp4-worker-1 ovs-vswitchd[3735]: ovs|01188|ofproto_dpif_upcall(revalidator110)|INFO|Spent an unreasonably long 2229ms dumping flows
Jun 27 17:50:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00021|ofproto_dpif_upcall(handler103)|WARN|upcall: datapath reached the dynamic limit of 115000 flows.
Jun 27 17:51:25 osp4-worker-1 ovs-vswitchd[3735]: ovs|00353|ofproto_dpif_upcall(handler89)|WARN|upcall: datapath reached the dynamic limit of 157500 flows.
Jun 27 17:51:45 osp4-worker-1 ovs-vswitchd[3735]: ovs|01197|ofproto_dpif_upcall(revalidator110)|INFO|Spent an unreasonably long 2786ms dumping flows
Jun 27 17:52:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00054|ofproto_dpif_upcall(handler95)|WARN|upcall: datapath reached the dynamic limit of 138812 flows.
Jun 27 17:52:31 osp4-worker-1 ovs-vswitchd[3735]: ovs|01208|ofproto_dpif_upcall(revalidator110)|INFO|Spent an unreasonably long 2877ms dumping flows
Jun 27 17:53:25 osp4-worker-1 ovs-vswitchd[3735]: ovs|00273|ofproto_dpif_upcall(handler90)|WARN|upcall: datapath reached the dynamic limit of 138179 flows.


This is an CPS ingress test, so yes, the connections are initiated from a machine outside the cluster.
No, the test tool is developed internally by a F5 team. I can ask to see if it can be shared.
Have you tested CPS with any opensource tools? I'm thinking TRex and Tsung, I know a couple of teammates who used them for other projects but I haven't tried them personally.

I'm not sure about the ACLs. I'm not aware of any ACLs that we enable in the deployment.

Comment 11 Flavio Leitner 2022-06-28 14:17:29 UTC
OK, it seems we have enough information to test and check the results.
However, it may happen that we don't succeed reproducing the issue due to differences in how to reproduce (tools, deployment setup).
Let us know if you can share the tool so that we can check the same thing.

Comment 12 Adrián Moreno 2022-07-04 17:47:26 UTC
Hello,

I'd like to ask a question. Looking at the sos report I see a rather sub-optimal cpu configuration:

From OVS's perspective most of the packets go
ens3f1 -> conntrack -> load-balanced (d3bfcbd63b66f8b and  bb917c41bf6517d) and back

According to udev, ens3f1 is /devices/pci0000:d7/0000:d7:00.0/0000:d8:00.1/net/ens3f1 which is:

d8:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
[...]
    NUMA node: 1

Looking at lscpu:
[...]
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  1
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
[...]
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55

So, most of the processing will happen in NUMA node 1.

However:
$ grep CPUAffinity etc/systemd/system.conf
#CPUAffinity=1 2
CPUAffinity=0 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54

OVS is set to an affinity mask that (mostly) selects CPUs in NUMA 0. This will cause lots of cross-NUMA packet copying.

Is this setup intentional?

Comment 14 Adrián Moreno 2022-07-05 12:51:03 UTC
Also, would you kindly confirm the exact 4.7.X version that was tested originally?

Comment 17 Adrián Moreno 2022-07-06 13:08:49 UTC
(In reply to i.turturica from comment #15)
> Hi Adrián,
> 
> The version was 4.7.24 according to the notes.
> 

Thanks for confirming.

> We've been using the same PerformanceProfile since early versions of 4.7.x:
> 
> apiVersion: performance.openshift.io/v2
> kind: PerformanceProfile
> metadata:
>   name: perf-profile-2m-dell-100g
> spec:
>   cpu:
>     isolated:
> 3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
>     reserved:
> 0,1,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,
> 52,54
>   hugepages:
>     defaultHugepagesSize: 2M
>     pages:
>     - count: 49152
>       node: 0
>       size: 2M
>     - count: 49152
>       node: 1
>       size: 2M
>   nodeSelector:
>     node-role.kubernetes.io/dell-workers-100g: ""
>   numa:
>     topologyPolicy: single-numa-node
>   realTimeKernel:
>     enabled: false
> 
> 
> Has any of this NUMA selection changed in 4.10.x, compared to 4.7?

We are going to verify if there is any significant functional difference between OCP versions from the PerformanceProfile point of view.
However, the NUMA selection seems pretty clear to me: OVS and all the kernel networking runs _only_ on NUMA 0 while all the traffic come in (and eventually go out) through NUMA 1.

Was this test originally (4.7) run on the exact same node and using the same network card?

Comment 18 i.turturica@f5.com 2022-07-06 18:01:37 UTC
I'm sure we used the same network card, but there's no way to tell now what nodes the pods were running on.

Comment 22 Francesco Romani 2022-07-11 10:14:31 UTC
(In reply to i.turturica from comment #15)
> Hi Adrián,
> 
> The version was 4.7.24 according to the notes.
> 
> We've been using the same PerformanceProfile since early versions of 4.7.x:
> 
> apiVersion: performance.openshift.io/v2
> kind: PerformanceProfile
> metadata:
>   name: perf-profile-2m-dell-100g
> spec:
>   cpu:
>     isolated:
> 3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
>     reserved:
> 0,1,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,
> 52,54
>   hugepages:
>     defaultHugepagesSize: 2M
>     pages:
>     - count: 49152
>       node: 0
>       size: 2M
>     - count: 49152
>       node: 1
>       size: 2M
>   nodeSelector:
>     node-role.kubernetes.io/dell-workers-100g: ""
>   numa:
>     topologyPolicy: single-numa-node
>   realTimeKernel:
>     enabled: false
> 
> 
> Has any of this NUMA selection changed in 4.10.x, compared to 4.7?

From performance-profile perspective, two things are quite surprising:
1. half of the CPUs are reserved, which is quite a lot. We usually recommend to reserve less CPUs. Please check if you need to reserve so many cpus
2. per https://bugzilla.redhat.com/show_bug.cgi?id=2097864#c12 all the reserved CPUs are on NUMA node 0 - you probably reserved the whole NUMA node 0, but this conflicts with the processing done by OVS using a NIC attached to NUMA node 1. Maybe you can reserve CPUs on both NUMA nodes to be more NUMA friendly?

Comment 24 Martin Sivák 2022-07-18 07:39:13 UTC
> The "reserved" CPUs are for infra, 0, 1 and all the NUMA-0 CPUs.
> The "isolated" CPUs are reserved for the guaranteed pods. 

This is not how the pinning works.

The reserved pool is used for the operating system services (anything managed by systemd - kubelet, crio, and OVS !!)

The isolated pool is used for ALL pods including OCP infrastructure, best-effort and burstable containers etc. The isolated cpus are merely ready for isolation and the k8s cpu manager takes the cpus to assign to guaranteed pods from this pool.


Please see both:

https://docs.openshift.com/container-platform/4.10/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-cpu-infra-container_cnf-master

and

https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#explicitly-reserved-cpu-list




Interrupt pinning is handled depending on the value of globallyDisableIRQBalancing:
- true (default for 4.6) - all interrupt handling is moved to the reserved pool, can exhaust the interrupt table
- false (preferred for 4.7+) - interrupts are dynamically removed from cpus that are assigned to a guaranteed container on container startup

https://docs.openshift.com/container-platform/4.10/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#managing-device-interrupt-processing-for-guaranteed-pod-isolated-cpus_cnf-master

Comment 25 Adrián Moreno 2022-07-25 10:42:20 UTC
> The "reserved" CPUs are for infra, 0, 1 and all the NUMA-0 CPUs.
> The "isolated" CPUs are reserved for the guaranteed pods (in our case, the TMM pods). All of them are on NUMA-1 and so are the NICs and therefore all the processing should happen on NUMA-1, which is correct.

I don't think that is what is happening here. All the packets are coming from the NIC (NUMA-1) are being handled by OVS(NUMA-0) and then sent to the pods (that may or may not run on NUMA-1). So we could have 2 cross-NUMA copies per packet that hits a pod.

Comment 26 Adrián Moreno 2022-07-26 10:53:40 UTC
Is it possible to get all the available information of the 4.7 deployment?

Comment 27 i.turturica@f5.com 2022-07-26 16:51:35 UTC
We don't have any 4.7 deployments in the lab, since everyone has already migrated to 4.10. I can tell you that 4.7 and 4.10 were installed similarly using the baremetal-deploy ansible playbook (https://github.com/openshift-kni/baremetal-deploy/tree/master/ansible-ipi-install). You should already have all the manifests from 4.10, the only major difference would be the Tuned patches that are not needed in 4.10 anymore. Example: https://gist.github.com/williamcaban/761ebb5b2dc6d9bcfa4978b83292ea01#file-tuned-patch-ran-du-ocp4-6-16-yaml

Comment 28 Adrián Moreno 2022-07-27 12:46:16 UTC
Do you, by any chance, have the configuration that was used to deploy the 4.7 cluster (if you used ansible-ipi-install this would be inventory files, variables, etc)?
Specifically I'm looking for a configuration option that would make a _huge_ difference here: Where you using OpenshiftSDN or OVNKubernetes CNI in your 4.7 cluster?

Comment 29 i.turturica@f5.com 2022-07-27 17:03:08 UTC
Created attachment 1899737 [details]
inventory

Attaching the ansible inventory file. Nothing in there changed between 4.7 and 4.10 (aside for the version variable).

This is the section where we're configuring the jumbo MTU, post-install:

oc annotate networks.operator.openshift.io cluster  'networkoperator.openshift.io/network-migration'=""
oc patch Network.operator.openshift.io cluster --type='merge' --patch '{ "spec": { "migration": {"networkType": "OVNKubernetes" } } }'
echo Waiting 1 minute...
sleep 60
oc patch networks.operator.openshift.io cluster --type merge  --patch '{"spec": {"defaultNetwork": {"ovnKubernetesConfig": {"mtu": 9000}}}}'
oc annotate networks.operator.openshift.io cluster  'networkoperator.openshift.io/network-migration'-
oc patch Network.operator.openshift.io cluster --type='merge' --patch '{ "spec": { "migration": null } }'

Comment 35 Martin Sivák 2022-09-15 12:12:38 UTC
Hmm

> the only
> major difference would be the Tuned patches that are not needed in 4.10
> anymore. Example:
> https://gist.github.com/williamcaban/761ebb5b2dc6d9bcfa4978b83292ea01#file-
> tuned-patch-ran-du-ocp4-6-16-yaml

That is a major difference:

- I am pretty certain we still do not set nohz_full by default in 4.10, neither we disable tickless iirc (nohz=off). 
- Also these kernel arguments are not default and can affect latency: intel_idle.max_cstate=0 processor.max_cstate=1
- on the other hand the PerfProfile does not use real time kernel it seems so low latency might not be what is needed
- The Tuned override also used group.rcub=0:f:11:*:rcub.* priority setting that was not fixed by default until very recently


Those differences combined could affect the node behavior indeed, but I can't tell for sure without seeing the two deployments side by side (well, must gather from both would be enough).

Comment 37 Srujan Reddy 2022-10-12 15:15:35 UTC
Hi, We are working on creating a new setup with 4.7 version with the same config that we used previously. As suggested https://access.redhat.com/support/cases/#/case/03239349/discussion?commentId=a0a6R00000Ta3M5QAJ, we are simultaneously working on adding the dpdk support to the application. Hopefully, we will have some data in next couple of weeks.