Bug 1923157 - Ingress traffic performance drop due to NodePort services
Summary: Ingress traffic performance drop due to NodePort services
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Alexander Constantinescu
QA Contact: Kedar Kulkarni
URL:
Whiteboard:
Depends On:
Blocks: 1937238
TreeView+ depends on / blocked
 
Reported: 2021-02-01 13:37 UTC by Raul Sevilla
Modified: 2021-07-27 22:38 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1937238 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:37:35 UTC
Target Upstream Version:


Attachments (Terms of Use)
nftables ruleset from infrastructure node (160.50 KB, text/plain)
2021-02-01 13:37 UTC, Raul Sevilla
no flags Details
Perf profile from the infrastructure node while running test in 500 NodePort services scenarios (1.83 MB, application/octet-stream)
2021-02-01 13:39 UTC, Raul Sevilla
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:38:06 UTC

Description Raul Sevilla 2021-02-01 13:37:17 UTC
Created attachment 1752834 [details]
nftables ruleset from infrastructure node

In OVNKubernetes based clusters, there's a big http ingress performance drop in clusters with a relatively high number NodePort services (This performance drop is not present in OpenShiftSDN scenarios). Our first findings show that, this performance degradation grows proportionally with the number of NodePort services available in the cluster. It's important to note that this ingress performance degradation happens even when these NodePort services are not backing the ingress route we use for the test, which is the case for the examples along in this message.

The examples below belong to a 3 worker node (m5.2xlarge) cluster with one ingress controller running on top of a infra node (m5.12xlarge).
The client pod (mb) is running on a different node (m5.4xlarge) using HostNetwork, the test performed uses mb client against 1 http route backed by 1 ClusterIP service backed by 100 nginx pods serving static content.

This is the client configuration:

sh-4.2$ cat conf.json 
[  
  {
    "scheme": "http",
    "tls-session-reuse": true,
    "host": "nginx-http-perf.apps.rsevilla-ovn-4.7.perfscale.devcluster.openshift.com",
    "port": 80,
    "method": "GET",
    "path": "/1024.html",
    "delay": {
      "min": 0,
      "max":0 
    },
    "keep-alive-requests": 100,
    "clients": 200
  }
]



---- Baseline performance, no NodePort services present in the cluster ----


$ oc get svc -A | grep -c NodePort
0

# Number of lflows and OFFlows from OVN sbdb container
sh-4.4# ovn-sbctl --no-leader-only dump-flows  | wc -l 
10617
sh-4.4
sh-4.4# ovs-ofctl dump-flows br-int | wc -l
25660

# Run the test
sh-4.2$ mb -i conf.json -d 30
Time: 30.10s
Sent: 369.80MiB, 12.29MiB/s
Recv: 3.19GiB, 108.37MiB/s
Hits: 2514855, 83552.04/s


As you can see above we reached an average of 83552 reqs/sec

The following perf trace was taken while running the workload and shows the top 10 kernel functions from the node running the ingress-controller


sh-5.0# perf report --stdio | head -14
# Overhead  Command          Shared Object                 Symbol                                                                         
# ........  ...............  ............................  ...............................................................................
#
     6.39%  swapper          [kernel.kallsyms]             [k] masked_flow_lookup
     3.81%  haproxy          [kernel.kallsyms]             [k] masked_flow_lookup
     2.68%  swapper          [kernel.kallsyms]             [k] intel_idle
     2.04%  swapper          [kernel.kallsyms]             [k] nft_do_chain
     1.96%  haproxy          [kernel.kallsyms]             [k] do_syscall_64
     1.28%  haproxy          [kernel.kallsyms]             [k] nft_do_chain
     1.26%  swapper          [kernel.kallsyms]             [k] __nf_conntrack_find_get
     1.25%  haproxy          [kernel.kallsyms]             [k] entry_SYSCALL_64
     1.04%  haproxy          [kernel.kallsyms]             [k] syscall_return_via_sysret
     0.83%  swapper          [kernel.kallsyms]             [k] _raw_spin_lock



---- Degraded performance, 100 NodePort services present in the cluster ----

$ oc get svc -A | grep -c NodePort
100

# Number of lflows and OFFlows from OVN sbdb container
sh-4.4# ovs-ofctl dump-flows br-int | wc -l
29664
sh-4.4# ovn-sbctl --no-leader-only dump-flows  | wc -l 
12617

# Run the test
sh-4.2$ mb -i conf.json -d 30
Time: 30.10s
Sent: 266.91MiB, 8.87MiB/s
Recv: 2.30GiB, 78.22MiB/s
Hits: 1815166, 60305.49/s

Performance has drop to an avg. of 60305 requests/sec


Now perf report shows a remarkable increase in the time spent in nftables related functions

sh-5.0# perf report --stdio | head -14
# Overhead  Command          Shared Object                 Symbol                                                                         
# ........  ...............  ............................  ...............................................................................
#
Overhead  Command          Shared Object                 Symbol                                                                                                                                                                               
   6.29%  swapper          [kernel.kallsyms]             [k] nft_do_chain
   5.26%  haproxy          [kernel.kallsyms]             [k] nft_do_chain
   4.10%  swapper          [kernel.kallsyms]             [k] masked_flow_lookup
   2.47%  swapper          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   2.39%  swapper          [kernel.kallsyms]             [k] nft_meta_get_eval
   2.32%  haproxy          [kernel.kallsyms]             [k] nft_meta_get_eval
   2.29%  haproxy          [kernel.kallsyms]             [k] masked_flow_lookup
   2.23%  haproxy          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   2.12%  haproxy          [kernel.kallsyms]             [k] tcp_mt
   2.09%  swapper          [kernel.kallsyms]             [k] tcp_mt


---- Higher degraded performance, 500 NodePort services present in the cluster ----

If we repeat the same test after creating 500 NodePort services, the client reports a even lower performance

$ oc get svc -A | grep -c NodePort
500


# Number of lflows and OFFlows from OVN sbdb container
sh-4.4# ovn-sbctl --no-leader-only dump-flows  | wc -l 
33817
sh-4.4# ovs-ofctl dump-flows br-int | wc -l
75829


# Run the test
sh-4.2$ mb -i conf.json -d 30
Time: 30.05s
Sent: 119.55MiB, 3.98MiB/s
Recv: 1.03GiB, 35.09MiB/s
Hits: 813047, 27055.88/s

The throughput now is 27055.88, which is the ~32% of the baseline throughput


And the time spent in nftables has grown considerably 

sh-5.0# perf report --stdio | head -14
# Overhead  Command          Shared Object                 Symbol                                                                         
# ........  ...............  ............................  ...............................................................................
#
Overhead  Command          Shared Object                 Symbol                       
  12.26%  swapper          [kernel.kallsyms]             [k] nft_do_chain
  11.66%  haproxy          [kernel.kallsyms]             [k] nft_do_chain
   5.38%  swapper          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   4.99%  haproxy          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   4.84%  swapper          [kernel.kallsyms]             [k] tcp_mt
   4.84%  swapper          [kernel.kallsyms]             [k] nft_meta_get_eval
   4.54%  haproxy          [kernel.kallsyms]             [k] nft_meta_get_eval
   4.51%  haproxy          [kernel.kallsyms]             [k] tcp_mt
   2.38%  swapper          [kernel.kallsyms]             [k] masked_flow_lookup
   2.00%  haproxy          [kernel.kallsyms]             [k] nft_match_eval
   1.96%  swapper          [kernel.kallsyms]             [k] nft_match_eval
   1.24%  haproxy          [kernel.kallsyms]             [k] masked_flow_lookup


Attached to this BZ, you can find the nftables ruleset from the infrastructure node (running the haproxy router), and the latest perf profile (with 500 NodePort services)

Comment 1 Raul Sevilla 2021-02-01 13:39:23 UTC
Created attachment 1752841 [details]
Perf profile from the infrastructure node while running test in 500 NodePort services scenarios

Comment 2 Alexander Constantinescu 2021-02-02 18:21:50 UTC
Working with Raul we've managed to figure out what the problem is.

The cluster used for this scale assessment has a lot of NodePort services, and even though they're not targeted by the perf tests (which are in fact targeting an OCP route - which is DNAT-ed to the endpoints directly using HA proxy), ovnkube-node's iptables rule setup in "filter FORWARD" creates a lot of unnecessary lookups and packet matching in nftables leading to the spike in nft_do_chain.  

I will submit a patch soon with a fix, hence setting target release to 4.7. In case the patch takes time to integrate, I'll move it out to 4.8.

Comment 3 Stephen Cuppett 2021-02-08 14:22:57 UTC
We are past Code Freeze on 4.7.0. I'm going to set this with a target of 4.8.0 to avoid confusion when the PR lands. 

We absolutely need a clone/backport of this for sure though. How far back does it need to go or is it impacted? 4.6?

Comment 5 Alexander Constantinescu 2021-02-09 09:27:21 UTC
Yes, this will most likely need to be back-ported to 4.6 once done. Excuse me for forgetting to set the target release to 4.8, I was hoping the PR would go in before code freeze.

Comment 6 Alexander Constantinescu 2021-03-05 09:09:02 UTC
This made it in with the latest downstream merge: https://github.com/openshift/ovn-kubernetes/pull/440 so setting to MODIFIED

Comment 8 Raul Sevilla 2021-03-05 15:06:23 UTC
Starting perfscale testing from my side. Will update with the results I get

Comment 9 Raul Sevilla 2021-03-10 00:09:55 UTC
The patch works fine and improves the performance in this scenario.

In a similar cluster I got the following results.

# Created 100 NodePort services targeting one pod each
[root@ip-172-31-33-151 workloads]# oc get svc -A | grep  -c NodePort
100


# mb config targeting a http route backing 100 nginx pods
sh-4.4$ cat conf.json 
[  
  {
    "scheme": "http",
    "tls-session-reuse": true,
    "host": "nginx-http-scale-http.apps.rsevilla-ovn-48.perfscale.devcluster.openshift.com",
    "port": 80,
    "method": "GET",
    "path": "/1024.html",
    "delay": {
      "min": 0,
      "max":0 
    },
    "keep-alive-requests": 100,
    "clients": 200
  }
]


# Run test for 30 seconds
sh-4.4$ mb -i conf.json -d 30
Time: 30.05s
Sent: 387.46MiB, 12.90MiB/s
Recv: 3.23GiB, 110.18MiB/s
Hits: 2552171, 84944.37/s


# Run for 2 minutes
sh-4.4$ mb -i conf.json -d 120
Time: 120.03s
Sent: 1.50GiB, 12.76MiB/s
Recv: 12.78GiB, 109.03MiB/s
Hits: 10089151, 84053.10/s

Comment 12 errata-xmlrpc 2021-07-27 22:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.