Bug 1923157

Summary: Ingress traffic performance drop due to NodePort services
Product: OpenShift Container Platform Reporter: Raul Sevilla <rsevilla>
Component: NetworkingAssignee: Alexander Constantinescu <aconstan>
Networking sub component: ovn-kubernetes QA Contact: Kedar Kulkarni <kkulkarn>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, anbhat, jnordell, mark.d.gray, rbrattai, rsevilla, scuppett, sreber, vrutkovs
Version: 4.7Keywords: Performance, TestBlocker
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1937238 (view as bug list) Environment:
Last Closed: 2021-07-27 22:37:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1937238    
Attachments:
Description Flags
nftables ruleset from infrastructure node
none
Perf profile from the infrastructure node while running test in 500 NodePort services scenarios none

Description Raul Sevilla 2021-02-01 13:37:17 UTC
Created attachment 1752834 [details]
nftables ruleset from infrastructure node

In OVNKubernetes based clusters, there's a big http ingress performance drop in clusters with a relatively high number NodePort services (This performance drop is not present in OpenShiftSDN scenarios). Our first findings show that, this performance degradation grows proportionally with the number of NodePort services available in the cluster. It's important to note that this ingress performance degradation happens even when these NodePort services are not backing the ingress route we use for the test, which is the case for the examples along in this message.

The examples below belong to a 3 worker node (m5.2xlarge) cluster with one ingress controller running on top of a infra node (m5.12xlarge).
The client pod (mb) is running on a different node (m5.4xlarge) using HostNetwork, the test performed uses mb client against 1 http route backed by 1 ClusterIP service backed by 100 nginx pods serving static content.

This is the client configuration:

sh-4.2$ cat conf.json 
[  
  {
    "scheme": "http",
    "tls-session-reuse": true,
    "host": "nginx-http-perf.apps.rsevilla-ovn-4.7.perfscale.devcluster.openshift.com",
    "port": 80,
    "method": "GET",
    "path": "/1024.html",
    "delay": {
      "min": 0,
      "max":0 
    },
    "keep-alive-requests": 100,
    "clients": 200
  }
]



---- Baseline performance, no NodePort services present in the cluster ----


$ oc get svc -A | grep -c NodePort
0

# Number of lflows and OFFlows from OVN sbdb container
sh-4.4# ovn-sbctl --no-leader-only dump-flows  | wc -l 
10617
sh-4.4
sh-4.4# ovs-ofctl dump-flows br-int | wc -l
25660

# Run the test
sh-4.2$ mb -i conf.json -d 30
Time: 30.10s
Sent: 369.80MiB, 12.29MiB/s
Recv: 3.19GiB, 108.37MiB/s
Hits: 2514855, 83552.04/s


As you can see above we reached an average of 83552 reqs/sec

The following perf trace was taken while running the workload and shows the top 10 kernel functions from the node running the ingress-controller


sh-5.0# perf report --stdio | head -14
# Overhead  Command          Shared Object                 Symbol                                                                         
# ........  ...............  ............................  ...............................................................................
#
     6.39%  swapper          [kernel.kallsyms]             [k] masked_flow_lookup
     3.81%  haproxy          [kernel.kallsyms]             [k] masked_flow_lookup
     2.68%  swapper          [kernel.kallsyms]             [k] intel_idle
     2.04%  swapper          [kernel.kallsyms]             [k] nft_do_chain
     1.96%  haproxy          [kernel.kallsyms]             [k] do_syscall_64
     1.28%  haproxy          [kernel.kallsyms]             [k] nft_do_chain
     1.26%  swapper          [kernel.kallsyms]             [k] __nf_conntrack_find_get
     1.25%  haproxy          [kernel.kallsyms]             [k] entry_SYSCALL_64
     1.04%  haproxy          [kernel.kallsyms]             [k] syscall_return_via_sysret
     0.83%  swapper          [kernel.kallsyms]             [k] _raw_spin_lock



---- Degraded performance, 100 NodePort services present in the cluster ----

$ oc get svc -A | grep -c NodePort
100

# Number of lflows and OFFlows from OVN sbdb container
sh-4.4# ovs-ofctl dump-flows br-int | wc -l
29664
sh-4.4# ovn-sbctl --no-leader-only dump-flows  | wc -l 
12617

# Run the test
sh-4.2$ mb -i conf.json -d 30
Time: 30.10s
Sent: 266.91MiB, 8.87MiB/s
Recv: 2.30GiB, 78.22MiB/s
Hits: 1815166, 60305.49/s

Performance has drop to an avg. of 60305 requests/sec


Now perf report shows a remarkable increase in the time spent in nftables related functions

sh-5.0# perf report --stdio | head -14
# Overhead  Command          Shared Object                 Symbol                                                                         
# ........  ...............  ............................  ...............................................................................
#
Overhead  Command          Shared Object                 Symbol                                                                                                                                                                               
   6.29%  swapper          [kernel.kallsyms]             [k] nft_do_chain
   5.26%  haproxy          [kernel.kallsyms]             [k] nft_do_chain
   4.10%  swapper          [kernel.kallsyms]             [k] masked_flow_lookup
   2.47%  swapper          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   2.39%  swapper          [kernel.kallsyms]             [k] nft_meta_get_eval
   2.32%  haproxy          [kernel.kallsyms]             [k] nft_meta_get_eval
   2.29%  haproxy          [kernel.kallsyms]             [k] masked_flow_lookup
   2.23%  haproxy          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   2.12%  haproxy          [kernel.kallsyms]             [k] tcp_mt
   2.09%  swapper          [kernel.kallsyms]             [k] tcp_mt


---- Higher degraded performance, 500 NodePort services present in the cluster ----

If we repeat the same test after creating 500 NodePort services, the client reports a even lower performance

$ oc get svc -A | grep -c NodePort
500


# Number of lflows and OFFlows from OVN sbdb container
sh-4.4# ovn-sbctl --no-leader-only dump-flows  | wc -l 
33817
sh-4.4# ovs-ofctl dump-flows br-int | wc -l
75829


# Run the test
sh-4.2$ mb -i conf.json -d 30
Time: 30.05s
Sent: 119.55MiB, 3.98MiB/s
Recv: 1.03GiB, 35.09MiB/s
Hits: 813047, 27055.88/s

The throughput now is 27055.88, which is the ~32% of the baseline throughput


And the time spent in nftables has grown considerably 

sh-5.0# perf report --stdio | head -14
# Overhead  Command          Shared Object                 Symbol                                                                         
# ........  ...............  ............................  ...............................................................................
#
Overhead  Command          Shared Object                 Symbol                       
  12.26%  swapper          [kernel.kallsyms]             [k] nft_do_chain
  11.66%  haproxy          [kernel.kallsyms]             [k] nft_do_chain
   5.38%  swapper          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   4.99%  haproxy          [kernel.kallsyms]             [k] __nft_match_eval.isra.4
   4.84%  swapper          [kernel.kallsyms]             [k] tcp_mt
   4.84%  swapper          [kernel.kallsyms]             [k] nft_meta_get_eval
   4.54%  haproxy          [kernel.kallsyms]             [k] nft_meta_get_eval
   4.51%  haproxy          [kernel.kallsyms]             [k] tcp_mt
   2.38%  swapper          [kernel.kallsyms]             [k] masked_flow_lookup
   2.00%  haproxy          [kernel.kallsyms]             [k] nft_match_eval
   1.96%  swapper          [kernel.kallsyms]             [k] nft_match_eval
   1.24%  haproxy          [kernel.kallsyms]             [k] masked_flow_lookup


Attached to this BZ, you can find the nftables ruleset from the infrastructure node (running the haproxy router), and the latest perf profile (with 500 NodePort services)

Comment 1 Raul Sevilla 2021-02-01 13:39:23 UTC
Created attachment 1752841 [details]
Perf profile from the infrastructure node while running test in 500 NodePort services scenarios

Comment 2 Alexander Constantinescu 2021-02-02 18:21:50 UTC
Working with Raul we've managed to figure out what the problem is.

The cluster used for this scale assessment has a lot of NodePort services, and even though they're not targeted by the perf tests (which are in fact targeting an OCP route - which is DNAT-ed to the endpoints directly using HA proxy), ovnkube-node's iptables rule setup in "filter FORWARD" creates a lot of unnecessary lookups and packet matching in nftables leading to the spike in nft_do_chain.  

I will submit a patch soon with a fix, hence setting target release to 4.7. In case the patch takes time to integrate, I'll move it out to 4.8.

Comment 3 Stephen Cuppett 2021-02-08 14:22:57 UTC
We are past Code Freeze on 4.7.0. I'm going to set this with a target of 4.8.0 to avoid confusion when the PR lands. 

We absolutely need a clone/backport of this for sure though. How far back does it need to go or is it impacted? 4.6?

Comment 5 Alexander Constantinescu 2021-02-09 09:27:21 UTC
Yes, this will most likely need to be back-ported to 4.6 once done. Excuse me for forgetting to set the target release to 4.8, I was hoping the PR would go in before code freeze.

Comment 6 Alexander Constantinescu 2021-03-05 09:09:02 UTC
This made it in with the latest downstream merge: https://github.com/openshift/ovn-kubernetes/pull/440 so setting to MODIFIED

Comment 8 Raul Sevilla 2021-03-05 15:06:23 UTC
Starting perfscale testing from my side. Will update with the results I get

Comment 9 Raul Sevilla 2021-03-10 00:09:55 UTC
The patch works fine and improves the performance in this scenario.

In a similar cluster I got the following results.

# Created 100 NodePort services targeting one pod each
[root@ip-172-31-33-151 workloads]# oc get svc -A | grep  -c NodePort
100


# mb config targeting a http route backing 100 nginx pods
sh-4.4$ cat conf.json 
[  
  {
    "scheme": "http",
    "tls-session-reuse": true,
    "host": "nginx-http-scale-http.apps.rsevilla-ovn-48.perfscale.devcluster.openshift.com",
    "port": 80,
    "method": "GET",
    "path": "/1024.html",
    "delay": {
      "min": 0,
      "max":0 
    },
    "keep-alive-requests": 100,
    "clients": 200
  }
]


# Run test for 30 seconds
sh-4.4$ mb -i conf.json -d 30
Time: 30.05s
Sent: 387.46MiB, 12.90MiB/s
Recv: 3.23GiB, 110.18MiB/s
Hits: 2552171, 84944.37/s


# Run for 2 minutes
sh-4.4$ mb -i conf.json -d 120
Time: 120.03s
Sent: 1.50GiB, 12.76MiB/s
Recv: 12.78GiB, 109.03MiB/s
Hits: 10089151, 84053.10/s

Comment 12 errata-xmlrpc 2021-07-27 22:37:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438