Description of problem: My customer has done extensive testing on OVN kubernetes and discovered the following: After creating around 1300 pods (with around 60.000 flows ) with the most optimized deployment we see that ovn pods get's context switched in their logs: ~~~ oc logs ovnkube-node-cnfl6 -n openshift-ovn-kubernetes -c ovn-controller 2021-04-16T05:47:33Z|05655|timeval|WARN|Unreasonably long 1413ms poll interval (3ms user, 33ms system) 2021-04-16T05:47:33Z|05656|timeval|WARN|faults: 60 minor, 46 major 2021-04-16T05:47:33Z|05657|timeval|WARN|disk: 4672 reads, 0 writes 2021-04-16T05:47:33Z|05658|timeval|WARN|context switches: 49 voluntary, 83 involuntary 2021-04-16T05:47:33Z|05659|coverage|INFO|Skipping details of duplicate event coverage for hash=f64fcb5e 2021-04-16T05:47:44Z|05660|timeval|WARN|Unreasonably long 11145ms poll interval (774ms user, 89ms system) 2021-04-16T05:47:44Z|05661|timeval|WARN|faults: 86 minor, 149 major 2021-04-16T05:47:44Z|05662|timeval|WARN|disk: 23448 reads, 0 writes 2021-04-16T05:47:44Z|05663|timeval|WARN|context switches: 153 voluntary, 1326 involuntary 2021-04-16T05:47:45Z|05664|timeval|WARN|Unreasonably long 1117ms poll interval (2ms user, 24ms system) 2021-04-16T05:47:45Z|05665|timeval|WARN|faults: 29 minor, 66 major 2021-04-16T05:47:45Z|05666|timeval|WARN|disk: 10560 reads, 0 writes 2021-04-16T05:47:45Z|05667|timeval|WARN|context switches: 69 voluntary, 62 involuntary 2021-04-16T05:47:47Z|05668|timeval|WARN|Unreasonably long 1642ms poll interval (1ms user, 34ms system) 2021-04-16T05:47:47Z|05669|timeval|WARN|faults: 47 minor, 79 major 2021-04-16T05:47:47Z|05670|timeval|WARN|disk: 14256 reads, 0 writes 2021-04-16T05:47:47Z|05671|timeval|WARN|context switches: 82 voluntary, 75 involuntary 2021-04-16T05:48:14Z|05672|timeval|WARN|Unreasonably long 1134ms poll interval (0ms user, 22ms system) 2021-04-16T05:48:14Z|05673|timeval|WARN|faults: 61 minor, 56 major 2021-04-16T05:48:14Z|05674|timeval|WARN|disk: 10312 reads, 0 writes 2021-04-16T05:48:14Z|05675|timeval|WARN|context switches: 63 voluntary, 35 involuntary 2021-04-16T05:48:14Z|05676|coverage|INFO|Dropped 3 log messages in last 30 seconds (most recently, 27 seconds ago) due to excessive rate 2021-04-16T05:48:14Z|05677|coverage|INFO|Skipping details of duplicate event coverage for hash=f64fcb5e 2021-04-16T05:48:42Z|05678|timeval|WARN|Unreasonably long 2446ms poll interval (677ms user, 23ms system) 2021-04-16T05:48:42Z|05679|timeval|WARN|faults: 64 minor, 75 major 2021-04-16T05:48:42Z|05680|timeval|WARN|disk: 10888 reads, 0 writes 2021-04-16T05:48:42Z|05681|timeval|WARN|context switches: 80 voluntary, 595 involuntary ~~~ We could optimize the networkpolicy by changing: ~~~ spec: ingress: - from: - ipBlock: cidr: 10.3.0.2/23 - from: - ipBlock: cidr: 10.1.2.2/23 - from: - ipBlock: cidr: 10.0.2.2/23 podSelector: matchLabels: example.networkpolicy.allow-ingress: "true" policyTypes: - Ingress ~~~ to: ~~~ spec: ingress: - from: - ipBlock: cidr: 10.3.0.2/23 - ipBlock: cidr: 10.1.2.2/23 - ipBlock: cidr: 10.0.2.2/23 podSelector: matchLabels: example.networkpolicy.allow-ingress: "true" policyTypes: - Ingress ~~~ Version-Release number of selected component (if applicable): OCP 4.7.3 How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Errors at over 50k OVS flows with 1400 pods, this is with 9 worker nodes. Expected results: A much higher pod limit. Additional info:
Took an initial pass through the code looking at the difference between the unoptimized vs optimized policies listed. The main difference between these policies is the reduction of number of To or From blocks. In ovn-kubernetes, having multiple To/From clauses may result in multiple handlers (threads) being added if using a podSelector/NamespaceSelector in the ingress or egress clause. In this case the policies are only using IPBlock. With IPBlock only, it shouldn't matter how many To/From clauses there are as we just iterate through all of them, batch them together and create a single ACL in OVN which applies to a port group. From there whenever a pod is created it gets added to the OVN port group. From the ovn-kube side it doesn't look like there is a bottleneck in this scenario. I think it is more likely a bottleneck in OVN, but going to try to reproduce this locally while I wait for must-gather.
I can reproduce the WARN messages about unreasonably long poll intervals and involuntary context switches, but I am not seeing any functional issues. I can easily scale a 9 compute cluster up to 1500 deployments/pods/svc and a 12 compute cluster to 2000 pods. All pods go to Running, pod startup latency is fairly constant. Pod annotation latency creeps up slightly while all pods are starting but goes back to normal. Do we know any more about functional issues for the original reporter? My cluster is 9 and then 12 computes on AWS 1 namespace with 5 NetworkPolicy similar to the one in the description Create 1500 and 2000 pods and watch the ovn-controller logs. Around 1000 pods I start seeing messages like below, but pods start ok and ingress for the services works as expected OVS flows go from 8.5K -> 90K after the creation of the deployments/pods no leader changes Ideas on what I can do differently to see the customer's functional issue? 2021-04-23T16:59:46Z|02435|timeval|WARN|Unreasonably long 1078ms poll interval (1066ms user, 1ms system) 2021-04-23T16:59:46Z|02436|timeval|WARN|context switches: 0 voluntary, 28 involuntary 2021-04-23T16:59:47Z|02437|timeval|WARN|Unreasonably long 1019ms poll interval (1009ms user, 0ms system) 2021-04-23T16:59:47Z|02438|timeval|WARN|context switches: 0 voluntary, 11 involuntary 2021-04-23T16:59:47Z|02439|poll_loop|INFO|Dropped 9 log messages in last 5 seconds (most recently, 1 seconds ago) due to excessive rate 2021-04-23T16:59:47Z|02440|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (95% CPU usage) 2021-04-23T16:59:49Z|02441|timeval|WARN|Unreasonably long 1148ms poll interval (1136ms user, 0ms system) 2021-04-23T16:59:49Z|02442|timeval|WARN|context switches: 0 voluntary, 21 involuntary I will continue to add namespaces and policies to try to trigger a functional issue. network_logs and network_ovn_trace must-gather location will be added to a private comment.
Second test of 6 namespaces each with 4 NetworkPolicy and 2400 nodes (400/namespace). 110K flows and again saw the WARN messages, but no functional issues with the pods.
So... it seems like they originally created the policy in the form that ovn-kubernetes mistakenly doesn't support, and saw that it wasn't working, and then saw a bunch of ovn warnings, and assumed that the warnings were related to the NetworkPolicy failure. Then they "optimized" the NetworkPolicy, rewriting it into a form that ovn-kubernetes does correctly support, and then things worked. So is there actually any bug here other than the ipblock-parsing bug? It seems like they just misinterpreted the original failure as being scale-related when it wasn't.
I see the context switch and long polling messages in the ovn-controller logs just starting up 2K pods in a namespace with no NetworkPolicy active. In my cluster, I see 8-10 second times when creating and deleting the policy for it to go into and out of effect. Scaling up from 500 pods to 1000 pods was 2x longer with the policy in place. 1m 6s vs 33s with no policy.
I ran one additional test today and will pause there until I get a request to do something more. Compared scale up from 500 pods to 2000 pods with and without customer network policy without policy : 1m 51s for all pods to go running. Average cni request ADD latency across nodes was ~ 40s (eyeball from grafana) with policy: 4m 28s for all pods to go running. Average cni request ADD latency across nodes was ~1m (eyeball from grafana) With 2000 pods running, apply the customer policy (1st policy in description of this bz): 8-10 s for policy to into effect delete the customer policy : 2m for policy to go out of effect I will add network_logs must-gather in a follow up comment.
Created attachment 1776830 [details] 7 NPs together used for scale up test Tested the 500 pod -> 2000 pod scaleup one more time with the 7 networkpolicies from Andy's zip (attached). Scale up time was 14m 58s vs 1m 51s for no networkpolicies. CNI request ADD latency for run was ~1m 50s
*** Bug 1932781 has been marked as a duplicate of this bug. ***
Not directly related, but I found another bit of lock contention when scaling up many pods with an already-existing NP. Filed 1961852 to fix it.
One last fix for this, then things should look a lot better: https://github.com/ovn-org/ovn-kubernetes/pull/2249