Description of problem: When scaling up the cluster to higher node counts - 100-250 nodes in this case, it's taking 4420.27s with OVNkubernetes as the network plugin while OpenShiftSDN just took 582s. Cloud throttling is not the case here since all of the nodes got successfully provisioned and the majority of the time was spent on initializing/waiting for them to transition from NotReady to Ready state. Snapshot of the metrics: https://snapshot.raintank.io/dashboard/snapshot/y0xshWE8J42Fh1YaytIPheOT1nSLjRId. Checking the logs on the node, we are seeing that the readiness probe for OVN is not ready for a while: Apr 08 19:54:43 ip-10-0-244-51 hyperkube[1718]: I0408 19:54:43.344306 1718 prober.go:117] Readiness probe for "ovnkube-node-slzpg_openshift-ovn-kubernetes(65cc039a-2be0-447f-88a1-0aad090ed8e6):ovnkube-node" failed (failure): Apr 08 19:54:43 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:43.360663 1718 pod_workers.go:191] Error syncing pod 657783b0-4f3b-4c81-88d5-b3399db70184 ("network-check-target-hq276_openshift-network-diagnostics(657783b0-4f3b-4c81-88d5-b> Apr 08 19:54:44 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:44.360911 1718 pod_workers.go:191] Error syncing pod 0e4f50d8-8d64-415e-9418-2cf4aef0dffd ("network-metrics-daemon-zvmzq_openshift-multus(0e4f50d8-8d64-415e-9418-2cf4aef0dffd> Apr 08 19:54:45 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:45.361429 1718 pod_workers.go:191] Error syncing pod 657783b0-4f3b-4c81-88d5-b3399db70184 ("network-check-target-hq276_openshift-network-diagnostics(657783b0-4f3b-4c81-88d5-b> Apr 08 19:54:45 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:45.733135 1718 kubelet.go:2190] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuratio> Also we observed that the ovnkubemaster CPU usage spiked up to 4 cores during the scaleup with no objects running on the cluster other than the system pods. We suspect that the ovn-controller is taking long time to create all the flows and it's going to get worse as the number of objects/nodes increase. Logs including the journal on the nodes, must-gather can be found here: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-ovn/scaleup-taking-long-time/ Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-08-005413 How reproducible: Reproduced it twice on the same cluster Steps to Reproduce: 1. Install a cluster using 4.8.0-0.nightly-2021-04-08-005413 payload with OVN as the network plugin. 2. Scale up the cluster to higher node counts ( 100 -250 ) 3. Measure the time taken by the nodes to get initialized/transition from NotReady to Ready state. Actual results: Cluster node scaleup from 100-250 took long time with ovnkubemaster CPU usage consuming up to 4 cores. Expected results: Cluster node scaleup time doesn't take long time similar to OpenShiftSDN and is not heavy with the resource usage.
For that specific node, ovnkube-node cannot start (and thus can't write the CNI config file that kubelet needs) because it's waiting for the master to set up the management port and gateway. http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-ovn/scaleup-taking-long-time/must-gather.local.8035632782890018539/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f5fe9f3e165e286809f178f81314e10c35910e4f8783e8835c9b282310a1bf19/namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-slzpg/ovnkube-node/ovnkube-node/logs/previous.log That master isn't doing that presumably because it's throttling requests due to scale: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-ovn/scaleup-taking-long-time/must-gather.local.8035632782890018539/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f5fe9f3e165e286809f178f81314e10c35910e4f8783e8835c9b282310a1bf19/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-8xl2r/ovnkube-master/ovnkube-master/logs/current.log 2021-04-08T21:19:18.385077428Z I0408 21:19:18.385030 1 request.go:591] Throttling request took 554.939041ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-221-113.us-west-2.compute.internal 2021-04-08T21:19:18.424603453Z I0408 21:19:18.424565 1 request.go:591] Throttling request took 554.658583ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-197-40.us-west-2.compute.internal 2021-04-08T21:19:18.465571559Z I0408 21:19:18.465527 1 request.go:591] Throttling request took 555.503692ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-158-116.us-west-2.compute.internal 2021-04-08T21:19:18.504773754Z I0408 21:19:18.504713 1 request.go:591] Throttling request took 554.606836ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-186-203.us-west-2.compute.internal 2021-04-08T21:19:18.544961052Z I0408 21:19:18.544915 1 request.go:591] Throttling request took 555.593164ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-141-102.us-west-2.compute.internal 2021-04-08T21:19:18.585033263Z I0408 21:19:18.584989 1 request.go:591] Throttling request took 555.770779ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-210-38.us-west-2.compute.internal Unfortunately the master logged so much that the time in question from the node isn't reflected in the master logs. So perhaps fixing hte throttling issue would help.
*** This bug has been marked as a duplicate of bug 1958972 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days