Bug 2097864
| Summary: | openvswitch is consuming more CPU post upgrade to 4.10.14 | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | siva kanakala <skanakal> | ||||
| Component: | ovsdb | Assignee: | Adrián Moreno <amorenoz> | ||||
| Status: | ASSIGNED --- | QA Contact: | Hekai Wang <hewang> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | FDP 22.L | CC: | amorenoz, anbhat, ctrautma, eglottma, fleitner, fromani, i.turturica, jhsiao, jpradhan, miabbott, mmarkand, mmichels, mnguyen, msivak, pvalerio, ralongi, rravaiol, travier | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | --- | Flags: | amorenoz:
needinfo-
mmarkand: needinfo? (amorenoz) mmarkand: needinfo? (amorenoz) mmarkand: needinfo? (amorenoz) |
||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-09-27 17:27:31 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Comment 3
Flavio Leitner
2022-06-23 20:33:02 UTC
Hi Flavio, Thanks for looking into this. I'll try to answer the questions with what I know so far. 1) I'm not the same person who ran this CPS test on 4.7.x, but I trust the numbers we got last time. 2) Not sure, we were able to run the CPS test in the previous version without issues, so there was no reason to look for errors in the logs. 3) No. We don't have 4.7.x installed on the performance cluster anymore. 4) No 5) Not that I know of 6) Yes, same cluster, same hardware. Same traffic generator machine. Can you run a similar CPS test in 4.7.x and 4.10.x and see if you notice any difference? Hi, Thanks for all the answers. Can you provide more details on the CPS test? I am not familiar with that. Are you testing IPv4 only, IPv6 only, or both? Perhaps there is some extra or different workload running on the CPUs also running OVS threads (handler and revalidator). It would be very helpful if you can identify the tasks, because if other tasks are consuming enough CPU that prevents OVS to complete processing in time, then it will reduce the maximum number of allowed flows in the data path. The command 'ovs-appctl_upcall/show' should show the current number of flows, maximum, and limit. It would be great to see snapshots every 5s of that output from the start to the reproducing time. It will tell us how the load changed over time from OVS perspective. For example: # ovs-apptctl upcall/show system@ovs-system: flows : (current 89359) (avg 82644) (max 207997) (limit 93500) Can you tell if the test looks okay until that message "upcall: datapath reached the dynamic limit of 200000 flows." shows up? Thanks, fbl The goal of a CPS test is find the maximum number of connections per second. The load generator tool has a number of virtual users (threads), each opens a new TCP connection, sends a HTTP/1.1 request for a 128 byte file, waits for a 200 OK response and closes the connection. Rinse and repeat. This is IPv4 only. While the test is running, nothing else on the cluster is running. There's only one namespace for router (TMM), and one for app pods. Where are you running the "ovs-apptctl upcall/show" command? And where do you see the 200000 flows message? So, the CPS test runs outside of the cluster and it connects to service IP addresses? Is the CPS test tool open source? Or perhaps is it possible to provide the tool for us to run in our labs? It is easier to try to reproduce the issue if we use the same tool. The command is "ovs-appctl upcall/show" and it should run on the worker nodes where you should also see eventually the error message in journal log. e.g.: found one sample in osp4-worker-1 sosreport. Another thing is that we have optimized the components to reduce the amount of flows/increase efficiency from 4.7 to 4.10, so I wonder if there could be different deployment configuration/features enabled, like ACLs. If ACLs were not enabled before, then now they would make such a difference, I think. I just ran the test again and I can see the warnings in the journalctl: Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00158|ofproto_dpif_upcall(handler92)|WARN|upcall: datapath reached the dynamic limit of 200000 flows. Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00335|ofproto_dpif_upcall(handler89)|WARN|upcall: datapath reached the dynamic limit of 200000 flows. Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00140|ofproto_dpif_upcall(handler93)|WARN|upcall: datapath reached the dynamic limit of 200000 flows. Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00261|ofproto_dpif_upcall(handler90)|WARN|upcall: datapath reached the dynamic limit of 200000 flows. Jun 27 17:49:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00011|ofproto_dpif_upcall(handler109)|WARN|upcall: datapath reached the dynamic limit of 200000 flows. Jun 27 17:50:14 osp4-worker-1 ovs-vswitchd[3735]: ovs|01188|ofproto_dpif_upcall(revalidator110)|INFO|Spent an unreasonably long 2229ms dumping flows Jun 27 17:50:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00021|ofproto_dpif_upcall(handler103)|WARN|upcall: datapath reached the dynamic limit of 115000 flows. Jun 27 17:51:25 osp4-worker-1 ovs-vswitchd[3735]: ovs|00353|ofproto_dpif_upcall(handler89)|WARN|upcall: datapath reached the dynamic limit of 157500 flows. Jun 27 17:51:45 osp4-worker-1 ovs-vswitchd[3735]: ovs|01197|ofproto_dpif_upcall(revalidator110)|INFO|Spent an unreasonably long 2786ms dumping flows Jun 27 17:52:22 osp4-worker-1 ovs-vswitchd[3735]: ovs|00054|ofproto_dpif_upcall(handler95)|WARN|upcall: datapath reached the dynamic limit of 138812 flows. Jun 27 17:52:31 osp4-worker-1 ovs-vswitchd[3735]: ovs|01208|ofproto_dpif_upcall(revalidator110)|INFO|Spent an unreasonably long 2877ms dumping flows Jun 27 17:53:25 osp4-worker-1 ovs-vswitchd[3735]: ovs|00273|ofproto_dpif_upcall(handler90)|WARN|upcall: datapath reached the dynamic limit of 138179 flows. This is an CPS ingress test, so yes, the connections are initiated from a machine outside the cluster. No, the test tool is developed internally by a F5 team. I can ask to see if it can be shared. Have you tested CPS with any opensource tools? I'm thinking TRex and Tsung, I know a couple of teammates who used them for other projects but I haven't tried them personally. I'm not sure about the ACLs. I'm not aware of any ACLs that we enable in the deployment. OK, it seems we have enough information to test and check the results. However, it may happen that we don't succeed reproducing the issue due to differences in how to reproduce (tools, deployment setup). Let us know if you can share the tool so that we can check the same thing. Hello,
I'd like to ask a question. Looking at the sos report I see a rather sub-optimal cpu configuration:
From OVS's perspective most of the packets go
ens3f1 -> conntrack -> load-balanced (d3bfcbd63b66f8b and bb917c41bf6517d) and back
According to udev, ens3f1 is /devices/pci0000:d7/0000:d7:00.0/0000:d8:00.1/net/ens3f1 which is:
d8:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
[...]
NUMA node: 1
Looking at lscpu:
[...]
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
[...]
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
So, most of the processing will happen in NUMA node 1.
However:
$ grep CPUAffinity etc/systemd/system.conf
#CPUAffinity=1 2
CPUAffinity=0 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54
OVS is set to an affinity mask that (mostly) selects CPUs in NUMA 0. This will cause lots of cross-NUMA packet copying.
Is this setup intentional?
Also, would you kindly confirm the exact 4.7.X version that was tested originally? (In reply to i.turturica from comment #15) > Hi Adrián, > > The version was 4.7.24 according to the notes. > Thanks for confirming. > We've been using the same PerformanceProfile since early versions of 4.7.x: > > apiVersion: performance.openshift.io/v2 > kind: PerformanceProfile > metadata: > name: perf-profile-2m-dell-100g > spec: > cpu: > isolated: > 3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 > reserved: > 0,1,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50, > 52,54 > hugepages: > defaultHugepagesSize: 2M > pages: > - count: 49152 > node: 0 > size: 2M > - count: 49152 > node: 1 > size: 2M > nodeSelector: > node-role.kubernetes.io/dell-workers-100g: "" > numa: > topologyPolicy: single-numa-node > realTimeKernel: > enabled: false > > > Has any of this NUMA selection changed in 4.10.x, compared to 4.7? We are going to verify if there is any significant functional difference between OCP versions from the PerformanceProfile point of view. However, the NUMA selection seems pretty clear to me: OVS and all the kernel networking runs _only_ on NUMA 0 while all the traffic come in (and eventually go out) through NUMA 1. Was this test originally (4.7) run on the exact same node and using the same network card? I'm sure we used the same network card, but there's no way to tell now what nodes the pods were running on. (In reply to i.turturica from comment #15) > Hi Adrián, > > The version was 4.7.24 according to the notes. > > We've been using the same PerformanceProfile since early versions of 4.7.x: > > apiVersion: performance.openshift.io/v2 > kind: PerformanceProfile > metadata: > name: perf-profile-2m-dell-100g > spec: > cpu: > isolated: > 3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55 > reserved: > 0,1,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50, > 52,54 > hugepages: > defaultHugepagesSize: 2M > pages: > - count: 49152 > node: 0 > size: 2M > - count: 49152 > node: 1 > size: 2M > nodeSelector: > node-role.kubernetes.io/dell-workers-100g: "" > numa: > topologyPolicy: single-numa-node > realTimeKernel: > enabled: false > > > Has any of this NUMA selection changed in 4.10.x, compared to 4.7? From performance-profile perspective, two things are quite surprising: 1. half of the CPUs are reserved, which is quite a lot. We usually recommend to reserve less CPUs. Please check if you need to reserve so many cpus 2. per https://bugzilla.redhat.com/show_bug.cgi?id=2097864#c12 all the reserved CPUs are on NUMA node 0 - you probably reserved the whole NUMA node 0, but this conflicts with the processing done by OVS using a NIC attached to NUMA node 1. Maybe you can reserve CPUs on both NUMA nodes to be more NUMA friendly? > The "reserved" CPUs are for infra, 0, 1 and all the NUMA-0 CPUs. > The "isolated" CPUs are reserved for the guaranteed pods. This is not how the pinning works. The reserved pool is used for the operating system services (anything managed by systemd - kubelet, crio, and OVS !!) The isolated pool is used for ALL pods including OCP infrastructure, best-effort and burstable containers etc. The isolated cpus are merely ready for isolation and the k8s cpu manager takes the cpus to assign to guaranteed pods from this pool. Please see both: https://docs.openshift.com/container-platform/4.10/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-cpu-infra-container_cnf-master and https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#explicitly-reserved-cpu-list Interrupt pinning is handled depending on the value of globallyDisableIRQBalancing: - true (default for 4.6) - all interrupt handling is moved to the reserved pool, can exhaust the interrupt table - false (preferred for 4.7+) - interrupts are dynamically removed from cpus that are assigned to a guaranteed container on container startup https://docs.openshift.com/container-platform/4.10/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#managing-device-interrupt-processing-for-guaranteed-pod-isolated-cpus_cnf-master > The "reserved" CPUs are for infra, 0, 1 and all the NUMA-0 CPUs.
> The "isolated" CPUs are reserved for the guaranteed pods (in our case, the TMM pods). All of them are on NUMA-1 and so are the NICs and therefore all the processing should happen on NUMA-1, which is correct.
I don't think that is what is happening here. All the packets are coming from the NIC (NUMA-1) are being handled by OVS(NUMA-0) and then sent to the pods (that may or may not run on NUMA-1). So we could have 2 cross-NUMA copies per packet that hits a pod.
Is it possible to get all the available information of the 4.7 deployment? We don't have any 4.7 deployments in the lab, since everyone has already migrated to 4.10. I can tell you that 4.7 and 4.10 were installed similarly using the baremetal-deploy ansible playbook (https://github.com/openshift-kni/baremetal-deploy/tree/master/ansible-ipi-install). You should already have all the manifests from 4.10, the only major difference would be the Tuned patches that are not needed in 4.10 anymore. Example: https://gist.github.com/williamcaban/761ebb5b2dc6d9bcfa4978b83292ea01#file-tuned-patch-ran-du-ocp4-6-16-yaml Do you, by any chance, have the configuration that was used to deploy the 4.7 cluster (if you used ansible-ipi-install this would be inventory files, variables, etc)? Specifically I'm looking for a configuration option that would make a _huge_ difference here: Where you using OpenshiftSDN or OVNKubernetes CNI in your 4.7 cluster? Created attachment 1899737 [details]
inventory
Attaching the ansible inventory file. Nothing in there changed between 4.7 and 4.10 (aside for the version variable).
This is the section where we're configuring the jumbo MTU, post-install:
oc annotate networks.operator.openshift.io cluster 'networkoperator.openshift.io/network-migration'=""
oc patch Network.operator.openshift.io cluster --type='merge' --patch '{ "spec": { "migration": {"networkType": "OVNKubernetes" } } }'
echo Waiting 1 minute...
sleep 60
oc patch networks.operator.openshift.io cluster --type merge --patch '{"spec": {"defaultNetwork": {"ovnKubernetesConfig": {"mtu": 9000}}}}'
oc annotate networks.operator.openshift.io cluster 'networkoperator.openshift.io/network-migration'-
oc patch Network.operator.openshift.io cluster --type='merge' --patch '{ "spec": { "migration": null } }'
Hmm
> the only
> major difference would be the Tuned patches that are not needed in 4.10
> anymore. Example:
> https://gist.github.com/williamcaban/761ebb5b2dc6d9bcfa4978b83292ea01#file-
> tuned-patch-ran-du-ocp4-6-16-yaml
That is a major difference:
- I am pretty certain we still do not set nohz_full by default in 4.10, neither we disable tickless iirc (nohz=off).
- Also these kernel arguments are not default and can affect latency: intel_idle.max_cstate=0 processor.max_cstate=1
- on the other hand the PerfProfile does not use real time kernel it seems so low latency might not be what is needed
- The Tuned override also used group.rcub=0:f:11:*:rcub.* priority setting that was not fixed by default until very recently
Those differences combined could affect the node behavior indeed, but I can't tell for sure without seeing the two deployments side by side (well, must gather from both would be enough).
Hi, We are working on creating a new setup with 4.7 version with the same config that we used previously. As suggested https://access.redhat.com/support/cases/#/case/03239349/discussion?commentId=a0a6R00000Ta3M5QAJ, we are simultaneously working on adding the dpdk support to the application. Hopefully, we will have some data in next couple of weeks. |