Bug 1950283 - [SCALE] network policies at scale have a long delay until enforcement
Summary: [SCALE] network policies at scale have a long delay until enforcement
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Andrew Stoycos
QA Contact: Mike Fiedler
URL:
Whiteboard:
: 1932781 (view as bug list)
Depends On: 1959058 1959406 1961852 1968569
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-16 10:11 UTC by Andy Bartlett
Modified: 2021-09-21 23:43 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-21 23:43:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
7 NPs together used for scale up test (5.71 KB, text/plain)
2021-04-28 17:36 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 529 0 None open Bug 1950283: policy: release LSP lock before adding pods to default-deny portgroups. 2021-05-10 16:17:46 UTC
Github ovn-org ovn-kubernetes pull 2205 0 None closed policy: release LSP lock before adding pods to default-deny portgroups. 2021-05-10 13:34:15 UTC
Red Hat Bugzilla 1945725 1 urgent CLOSED ovn-controller not ready due to error "ovs_list_is_empty(&f->list_node) failed in flood_remove_flows_for_sb_uuid" 2021-04-29 07:34:08 UTC

Internal Links: 1953680

Description Andy Bartlett 2021-04-16 10:11:12 UTC
Description of problem:

My customer has done extensive testing on OVN kubernetes and discovered the following:

After creating around 1300 pods (with around 60.000 flows ) with the most optimized deployment we see that ovn pods get's context switched in their logs:
~~~
oc logs ovnkube-node-cnfl6 -n openshift-ovn-kubernetes  -c ovn-controller
2021-04-16T05:47:33Z|05655|timeval|WARN|Unreasonably long 1413ms poll interval (3ms user, 33ms system)
2021-04-16T05:47:33Z|05656|timeval|WARN|faults: 60 minor, 46 major
2021-04-16T05:47:33Z|05657|timeval|WARN|disk: 4672 reads, 0 writes
2021-04-16T05:47:33Z|05658|timeval|WARN|context switches: 49 voluntary, 83 involuntary
2021-04-16T05:47:33Z|05659|coverage|INFO|Skipping details of duplicate event coverage for hash=f64fcb5e
2021-04-16T05:47:44Z|05660|timeval|WARN|Unreasonably long 11145ms poll interval (774ms user, 89ms system)
2021-04-16T05:47:44Z|05661|timeval|WARN|faults: 86 minor, 149 major
2021-04-16T05:47:44Z|05662|timeval|WARN|disk: 23448 reads, 0 writes
2021-04-16T05:47:44Z|05663|timeval|WARN|context switches: 153 voluntary, 1326 involuntary
2021-04-16T05:47:45Z|05664|timeval|WARN|Unreasonably long 1117ms poll interval (2ms user, 24ms system)
2021-04-16T05:47:45Z|05665|timeval|WARN|faults: 29 minor, 66 major
2021-04-16T05:47:45Z|05666|timeval|WARN|disk: 10560 reads, 0 writes
2021-04-16T05:47:45Z|05667|timeval|WARN|context switches: 69 voluntary, 62 involuntary
2021-04-16T05:47:47Z|05668|timeval|WARN|Unreasonably long 1642ms poll interval (1ms user, 34ms system)
2021-04-16T05:47:47Z|05669|timeval|WARN|faults: 47 minor, 79 major
2021-04-16T05:47:47Z|05670|timeval|WARN|disk: 14256 reads, 0 writes
2021-04-16T05:47:47Z|05671|timeval|WARN|context switches: 82 voluntary, 75 involuntary
2021-04-16T05:48:14Z|05672|timeval|WARN|Unreasonably long 1134ms poll interval (0ms user, 22ms system)
2021-04-16T05:48:14Z|05673|timeval|WARN|faults: 61 minor, 56 major
2021-04-16T05:48:14Z|05674|timeval|WARN|disk: 10312 reads, 0 writes
2021-04-16T05:48:14Z|05675|timeval|WARN|context switches: 63 voluntary, 35 involuntary
2021-04-16T05:48:14Z|05676|coverage|INFO|Dropped 3 log messages in last 30 seconds (most recently, 27 seconds ago) due to excessive rate
2021-04-16T05:48:14Z|05677|coverage|INFO|Skipping details of duplicate event coverage for hash=f64fcb5e
2021-04-16T05:48:42Z|05678|timeval|WARN|Unreasonably long 2446ms poll interval (677ms user, 23ms system)
2021-04-16T05:48:42Z|05679|timeval|WARN|faults: 64 minor, 75 major
2021-04-16T05:48:42Z|05680|timeval|WARN|disk: 10888 reads, 0 writes
2021-04-16T05:48:42Z|05681|timeval|WARN|context switches: 80 voluntary, 595 involuntary
~~~

We could optimize the networkpolicy by changing:
~~~
spec:
  ingress:
  - from:
    - ipBlock:
        cidr: 10.3.0.2/23
  - from:
    - ipBlock:
        cidr: 10.1.2.2/23
  - from:
    - ipBlock:
        cidr: 10.0.2.2/23
  podSelector:
    matchLabels:
      example.networkpolicy.allow-ingress: "true"
  policyTypes:
  - Ingress
~~~

to:

~~~ 
spec:
  ingress:
  - from:
    - ipBlock:
        cidr: 10.3.0.2/23
    - ipBlock:
        cidr: 10.1.2.2/23
    - ipBlock:
        cidr: 10.0.2.2/23
  podSelector:
    matchLabels:
      example.networkpolicy.allow-ingress: "true"
  policyTypes:
  - Ingress
~~~

Version-Release number of selected component (if applicable):

OCP 4.7.3

How reproducible:

100%

Steps to Reproduce:
1.
2.
3.

Actual results:

Errors at over 50k OVS flows with 1400 pods, this is with 9 worker nodes.


Expected results:

A much higher pod limit.


Additional info:

Comment 8 Tim Rozet 2021-04-21 20:23:52 UTC
Took an initial pass through the code looking at the difference between the unoptimized vs optimized policies listed. The main difference between these policies is the reduction of number of To or From blocks. In ovn-kubernetes, having multiple To/From clauses may result in multiple handlers (threads) being added if using a podSelector/NamespaceSelector in the ingress or egress clause. In this case the policies are only using IPBlock. With IPBlock only, it shouldn't matter how many To/From clauses there are as we just iterate through all of them, batch them together and create a single ACL in OVN which applies to a port group.

From there whenever a pod is created it gets added to the OVN port group. From the ovn-kube side it doesn't look like there is a bottleneck in this scenario. I think it is more likely a bottleneck in OVN, but going to try to reproduce this locally while I wait for must-gather.

Comment 10 Mike Fiedler 2021-04-23 18:20:47 UTC
I can reproduce the WARN messages about unreasonably long poll intervals and involuntary context switches, but I am not seeing any functional issues.   I can easily scale a 9 compute cluster up to 1500 deployments/pods/svc and a 12 compute cluster to 2000 pods.   All pods go to Running, pod startup latency is fairly constant.  Pod annotation latency creeps up slightly while all pods are starting but goes back to normal.   Do we know any more about functional issues for the original reporter?

My cluster is 9 and then 12 computes on AWS
1 namespace with 5 NetworkPolicy similar to the one in the description
Create 1500 and 2000 pods and watch the ovn-controller logs.   Around 1000 pods I start seeing messages like below, but pods start ok and ingress for the services works as expected
OVS flows go from 8.5K -> 90K after the creation of the deployments/pods
no leader changes

Ideas on what I can do differently to see the customer's functional issue?

2021-04-23T16:59:46Z|02435|timeval|WARN|Unreasonably long 1078ms poll interval (1066ms user, 1ms system)
2021-04-23T16:59:46Z|02436|timeval|WARN|context switches: 0 voluntary, 28 involuntary
2021-04-23T16:59:47Z|02437|timeval|WARN|Unreasonably long 1019ms poll interval (1009ms user, 0ms system)
2021-04-23T16:59:47Z|02438|timeval|WARN|context switches: 0 voluntary, 11 involuntary
2021-04-23T16:59:47Z|02439|poll_loop|INFO|Dropped 9 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate
2021-04-23T16:59:47Z|02440|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (95% CPU usage)
2021-04-23T16:59:49Z|02441|timeval|WARN|Unreasonably long 1148ms poll interval (1136ms user, 0ms system)
2021-04-23T16:59:49Z|02442|timeval|WARN|context switches: 0 voluntary, 21 involuntary


I will continue to add namespaces and policies to try to trigger a functional issue.   

network_logs and network_ovn_trace must-gather location will be added to a private comment.

Comment 12 Mike Fiedler 2021-04-23 21:05:37 UTC
Second test of 6 namespaces each with 4 NetworkPolicy and 2400 nodes (400/namespace).   110K flows and again saw the WARN messages, but no functional issues with the pods.

Comment 13 Dan Winship 2021-04-26 17:20:10 UTC
So... it seems like they originally created the policy in the form that ovn-kubernetes mistakenly doesn't support, and saw that it wasn't working, and then saw a bunch of ovn warnings, and assumed that the warnings were related to the NetworkPolicy failure. Then they "optimized" the NetworkPolicy, rewriting it into a form that ovn-kubernetes does correctly support, and then things worked.

So is there actually any bug here other than the ipblock-parsing bug? It seems like they just misinterpreted the original failure as being scale-related when it wasn't.

Comment 18 Mike Fiedler 2021-04-28 01:36:56 UTC
I see the context switch and long polling messages in the ovn-controller logs just starting up 2K pods in a namespace with no NetworkPolicy active.

In my cluster, I see 8-10 second times when creating and deleting the policy for it to go into and out of effect.

Scaling up from 500 pods to 1000 pods was 2x longer with the policy in place.  1m 6s vs 33s with no policy.

Comment 19 Mike Fiedler 2021-04-28 13:34:47 UTC
I ran one additional test today and will pause there until I get a request to do something more.

Compared scale up from 500 pods to 2000 pods with and without customer network policy
   without policy :  1m 51s for all pods to go running.  Average cni request ADD latency across nodes was ~ 40s (eyeball from grafana)
   with policy: 4m 28s for all pods to go running.  Average cni request ADD latency across nodes was ~1m (eyeball from grafana)

With 2000 pods running, 
   apply the customer policy (1st policy in description of this bz): 8-10 s for policy to into effect
   delete the customer policy : 2m for policy to go out of effect

I will add network_logs must-gather in a follow up comment.

Comment 22 Mike Fiedler 2021-04-28 17:36:43 UTC
Created attachment 1776830 [details]
7  NPs together used for scale up test

Tested the 500 pod -> 2000 pod scaleup one more time with the 7 networkpolicies from Andy's zip (attached).   

Scale up time was 14m 58s vs 1m 51s for no networkpolicies.   CNI request ADD latency for run was ~1m 50s

Comment 24 Andrew Stoycos 2021-05-07 14:10:42 UTC
*** Bug 1932781 has been marked as a duplicate of this bug. ***

Comment 30 Casey Callendrello 2021-05-18 20:45:55 UTC
Not directly related, but I found another bit of lock contention when scaling up many pods with an already-existing NP. Filed 1961852 to fix it.

Comment 31 Casey Callendrello 2021-06-07 15:09:04 UTC
One last fix for this, then things should look a lot better: https://github.com/ovn-org/ovn-kubernetes/pull/2249


Note You need to log in before you can comment on or make changes to this bug.