Created attachment 1752834 [details] nftables ruleset from infrastructure node In OVNKubernetes based clusters, there's a big http ingress performance drop in clusters with a relatively high number NodePort services (This performance drop is not present in OpenShiftSDN scenarios). Our first findings show that, this performance degradation grows proportionally with the number of NodePort services available in the cluster. It's important to note that this ingress performance degradation happens even when these NodePort services are not backing the ingress route we use for the test, which is the case for the examples along in this message. The examples below belong to a 3 worker node (m5.2xlarge) cluster with one ingress controller running on top of a infra node (m5.12xlarge). The client pod (mb) is running on a different node (m5.4xlarge) using HostNetwork, the test performed uses mb client against 1 http route backed by 1 ClusterIP service backed by 100 nginx pods serving static content. This is the client configuration: sh-4.2$ cat conf.json [ { "scheme": "http", "tls-session-reuse": true, "host": "nginx-http-perf.apps.rsevilla-ovn-4.7.perfscale.devcluster.openshift.com", "port": 80, "method": "GET", "path": "/1024.html", "delay": { "min": 0, "max":0 }, "keep-alive-requests": 100, "clients": 200 } ] ---- Baseline performance, no NodePort services present in the cluster ---- $ oc get svc -A | grep -c NodePort 0 # Number of lflows and OFFlows from OVN sbdb container sh-4.4# ovn-sbctl --no-leader-only dump-flows | wc -l 10617 sh-4.4 sh-4.4# ovs-ofctl dump-flows br-int | wc -l 25660 # Run the test sh-4.2$ mb -i conf.json -d 30 Time: 30.10s Sent: 369.80MiB, 12.29MiB/s Recv: 3.19GiB, 108.37MiB/s Hits: 2514855, 83552.04/s As you can see above we reached an average of 83552 reqs/sec The following perf trace was taken while running the workload and shows the top 10 kernel functions from the node running the ingress-controller sh-5.0# perf report --stdio | head -14 # Overhead Command Shared Object Symbol # ........ ............... ............................ ............................................................................... # 6.39% swapper [kernel.kallsyms] [k] masked_flow_lookup 3.81% haproxy [kernel.kallsyms] [k] masked_flow_lookup 2.68% swapper [kernel.kallsyms] [k] intel_idle 2.04% swapper [kernel.kallsyms] [k] nft_do_chain 1.96% haproxy [kernel.kallsyms] [k] do_syscall_64 1.28% haproxy [kernel.kallsyms] [k] nft_do_chain 1.26% swapper [kernel.kallsyms] [k] __nf_conntrack_find_get 1.25% haproxy [kernel.kallsyms] [k] entry_SYSCALL_64 1.04% haproxy [kernel.kallsyms] [k] syscall_return_via_sysret 0.83% swapper [kernel.kallsyms] [k] _raw_spin_lock ---- Degraded performance, 100 NodePort services present in the cluster ---- $ oc get svc -A | grep -c NodePort 100 # Number of lflows and OFFlows from OVN sbdb container sh-4.4# ovs-ofctl dump-flows br-int | wc -l 29664 sh-4.4# ovn-sbctl --no-leader-only dump-flows | wc -l 12617 # Run the test sh-4.2$ mb -i conf.json -d 30 Time: 30.10s Sent: 266.91MiB, 8.87MiB/s Recv: 2.30GiB, 78.22MiB/s Hits: 1815166, 60305.49/s Performance has drop to an avg. of 60305 requests/sec Now perf report shows a remarkable increase in the time spent in nftables related functions sh-5.0# perf report --stdio | head -14 # Overhead Command Shared Object Symbol # ........ ............... ............................ ............................................................................... # Overhead Command Shared Object Symbol 6.29% swapper [kernel.kallsyms] [k] nft_do_chain 5.26% haproxy [kernel.kallsyms] [k] nft_do_chain 4.10% swapper [kernel.kallsyms] [k] masked_flow_lookup 2.47% swapper [kernel.kallsyms] [k] __nft_match_eval.isra.4 2.39% swapper [kernel.kallsyms] [k] nft_meta_get_eval 2.32% haproxy [kernel.kallsyms] [k] nft_meta_get_eval 2.29% haproxy [kernel.kallsyms] [k] masked_flow_lookup 2.23% haproxy [kernel.kallsyms] [k] __nft_match_eval.isra.4 2.12% haproxy [kernel.kallsyms] [k] tcp_mt 2.09% swapper [kernel.kallsyms] [k] tcp_mt ---- Higher degraded performance, 500 NodePort services present in the cluster ---- If we repeat the same test after creating 500 NodePort services, the client reports a even lower performance $ oc get svc -A | grep -c NodePort 500 # Number of lflows and OFFlows from OVN sbdb container sh-4.4# ovn-sbctl --no-leader-only dump-flows | wc -l 33817 sh-4.4# ovs-ofctl dump-flows br-int | wc -l 75829 # Run the test sh-4.2$ mb -i conf.json -d 30 Time: 30.05s Sent: 119.55MiB, 3.98MiB/s Recv: 1.03GiB, 35.09MiB/s Hits: 813047, 27055.88/s The throughput now is 27055.88, which is the ~32% of the baseline throughput And the time spent in nftables has grown considerably sh-5.0# perf report --stdio | head -14 # Overhead Command Shared Object Symbol # ........ ............... ............................ ............................................................................... # Overhead Command Shared Object Symbol 12.26% swapper [kernel.kallsyms] [k] nft_do_chain 11.66% haproxy [kernel.kallsyms] [k] nft_do_chain 5.38% swapper [kernel.kallsyms] [k] __nft_match_eval.isra.4 4.99% haproxy [kernel.kallsyms] [k] __nft_match_eval.isra.4 4.84% swapper [kernel.kallsyms] [k] tcp_mt 4.84% swapper [kernel.kallsyms] [k] nft_meta_get_eval 4.54% haproxy [kernel.kallsyms] [k] nft_meta_get_eval 4.51% haproxy [kernel.kallsyms] [k] tcp_mt 2.38% swapper [kernel.kallsyms] [k] masked_flow_lookup 2.00% haproxy [kernel.kallsyms] [k] nft_match_eval 1.96% swapper [kernel.kallsyms] [k] nft_match_eval 1.24% haproxy [kernel.kallsyms] [k] masked_flow_lookup Attached to this BZ, you can find the nftables ruleset from the infrastructure node (running the haproxy router), and the latest perf profile (with 500 NodePort services)
Created attachment 1752841 [details] Perf profile from the infrastructure node while running test in 500 NodePort services scenarios
Working with Raul we've managed to figure out what the problem is. The cluster used for this scale assessment has a lot of NodePort services, and even though they're not targeted by the perf tests (which are in fact targeting an OCP route - which is DNAT-ed to the endpoints directly using HA proxy), ovnkube-node's iptables rule setup in "filter FORWARD" creates a lot of unnecessary lookups and packet matching in nftables leading to the spike in nft_do_chain. I will submit a patch soon with a fix, hence setting target release to 4.7. In case the patch takes time to integrate, I'll move it out to 4.8.
We are past Code Freeze on 4.7.0. I'm going to set this with a target of 4.8.0 to avoid confusion when the PR lands. We absolutely need a clone/backport of this for sure though. How far back does it need to go or is it impacted? 4.6?
Yes, this will most likely need to be back-ported to 4.6 once done. Excuse me for forgetting to set the target release to 4.8, I was hoping the PR would go in before code freeze.
This made it in with the latest downstream merge: https://github.com/openshift/ovn-kubernetes/pull/440 so setting to MODIFIED
Starting perfscale testing from my side. Will update with the results I get
The patch works fine and improves the performance in this scenario. In a similar cluster I got the following results. # Created 100 NodePort services targeting one pod each [root@ip-172-31-33-151 workloads]# oc get svc -A | grep -c NodePort 100 # mb config targeting a http route backing 100 nginx pods sh-4.4$ cat conf.json [ { "scheme": "http", "tls-session-reuse": true, "host": "nginx-http-scale-http.apps.rsevilla-ovn-48.perfscale.devcluster.openshift.com", "port": 80, "method": "GET", "path": "/1024.html", "delay": { "min": 0, "max":0 }, "keep-alive-requests": 100, "clients": 200 } ] # Run test for 30 seconds sh-4.4$ mb -i conf.json -d 30 Time: 30.05s Sent: 387.46MiB, 12.90MiB/s Recv: 3.23GiB, 110.18MiB/s Hits: 2552171, 84944.37/s # Run for 2 minutes sh-4.4$ mb -i conf.json -d 120 Time: 120.03s Sent: 1.50GiB, 12.76MiB/s Recv: 12.78GiB, 109.03MiB/s Hits: 10089151, 84053.10/s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438