Bug 1947861 - Cluster node scaleup at higher node counts with OVNKubernetes is taking ~9x more time when compared to OpenShiftSDN
Summary: Cluster node scaleup at higher node counts with OVNKubernetes is taking ~9x m...
Keywords:
Status: CLOSED DUPLICATE of bug 1958972
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.8.0
Assignee: Mohamed Mahmoud
QA Contact: Anurag saxena
URL:
Whiteboard: aos-scalability-48
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-09 12:40 UTC by Naga Ravi Chaitanya Elluri
Modified: 2023-09-15 01:04 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-12 19:06:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Naga Ravi Chaitanya Elluri 2021-04-09 12:40:26 UTC
Description of problem:
When scaling up the cluster to higher node counts - 100-250 nodes in this case, it's taking 4420.27s with OVNkubernetes as the network plugin while OpenShiftSDN just took 582s. Cloud throttling is not the case here since all of the nodes got successfully provisioned and the majority of the time was spent on initializing/waiting for them to transition from NotReady to Ready state. Snapshot of the metrics: https://snapshot.raintank.io/dashboard/snapshot/y0xshWE8J42Fh1YaytIPheOT1nSLjRId.

Checking the logs on the node, we are seeing that the readiness probe for OVN is not ready for a while:

Apr 08 19:54:43 ip-10-0-244-51 hyperkube[1718]: I0408 19:54:43.344306    1718 prober.go:117] Readiness probe for "ovnkube-node-slzpg_openshift-ovn-kubernetes(65cc039a-2be0-447f-88a1-0aad090ed8e6):ovnkube-node" failed (failure):
Apr 08 19:54:43 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:43.360663    1718 pod_workers.go:191] Error syncing pod 657783b0-4f3b-4c81-88d5-b3399db70184 ("network-check-target-hq276_openshift-network-diagnostics(657783b0-4f3b-4c81-88d5-b>
Apr 08 19:54:44 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:44.360911    1718 pod_workers.go:191] Error syncing pod 0e4f50d8-8d64-415e-9418-2cf4aef0dffd ("network-metrics-daemon-zvmzq_openshift-multus(0e4f50d8-8d64-415e-9418-2cf4aef0dffd>
Apr 08 19:54:45 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:45.361429    1718 pod_workers.go:191] Error syncing pod 657783b0-4f3b-4c81-88d5-b3399db70184 ("network-check-target-hq276_openshift-network-diagnostics(657783b0-4f3b-4c81-88d5-b>
Apr 08 19:54:45 ip-10-0-244-51 hyperkube[1718]: E0408 19:54:45.733135    1718 kubelet.go:2190] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuratio> 

Also we observed that the ovnkubemaster CPU usage spiked up to 4 cores during the scaleup with no objects running on the cluster other than the system pods. We suspect that the ovn-controller is taking long time to create all the flows and it's going to get worse as the number of objects/nodes increase.

Logs including the journal on the nodes, must-gather can be found here: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-ovn/scaleup-taking-long-time/

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-08-005413

How reproducible:
Reproduced it twice on the same cluster

Steps to Reproduce:
1. Install a cluster using 4.8.0-0.nightly-2021-04-08-005413 payload with OVN as the network plugin.
2. Scale up the cluster to higher node counts ( 100 -250 )
3. Measure the time taken by the nodes to get initialized/transition from NotReady to Ready state.

Actual results:
Cluster node scaleup from 100-250 took long time with ovnkubemaster CPU usage consuming up to 4 cores.


Expected results:
Cluster node scaleup time doesn't take long time similar to OpenShiftSDN and is not heavy with the resource usage.

Comment 1 Dan Williams 2021-04-30 14:03:28 UTC
For that specific node, ovnkube-node cannot start (and thus can't write the CNI config file that kubelet needs) because it's waiting for the master to set up the management port and gateway.

http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-ovn/scaleup-taking-long-time/must-gather.local.8035632782890018539/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f5fe9f3e165e286809f178f81314e10c35910e4f8783e8835c9b282310a1bf19/namespaces/openshift-ovn-kubernetes/pods/ovnkube-node-slzpg/ovnkube-node/ovnkube-node/logs/previous.log

That master isn't doing that presumably because it's throttling requests due to scale:

http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/large-scale/4.8-ovn/scaleup-taking-long-time/must-gather.local.8035632782890018539/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f5fe9f3e165e286809f178f81314e10c35910e4f8783e8835c9b282310a1bf19/namespaces/openshift-ovn-kubernetes/pods/ovnkube-master-8xl2r/ovnkube-master/ovnkube-master/logs/current.log

2021-04-08T21:19:18.385077428Z I0408 21:19:18.385030       1 request.go:591] Throttling request took 554.939041ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-221-113.us-west-2.compute.internal
2021-04-08T21:19:18.424603453Z I0408 21:19:18.424565       1 request.go:591] Throttling request took 554.658583ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-197-40.us-west-2.compute.internal
2021-04-08T21:19:18.465571559Z I0408 21:19:18.465527       1 request.go:591] Throttling request took 555.503692ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-158-116.us-west-2.compute.internal
2021-04-08T21:19:18.504773754Z I0408 21:19:18.504713       1 request.go:591] Throttling request took 554.606836ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-186-203.us-west-2.compute.internal
2021-04-08T21:19:18.544961052Z I0408 21:19:18.544915       1 request.go:591] Throttling request took 555.593164ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-141-102.us-west-2.compute.internal
2021-04-08T21:19:18.585033263Z I0408 21:19:18.584989       1 request.go:591] Throttling request took 555.770779ms, request: GET:https://api-int.perf-ovn.perfscale.devcluster.openshift.com:6443/api/v1/nodes/ip-10-0-210-38.us-west-2.compute.internal

Unfortunately the master logged so much that the time in question from the node isn't reflected in the master logs. So perhaps fixing hte throttling issue would help.

Comment 4 Mohamed Mahmoud 2021-05-12 19:06:24 UTC

*** This bug has been marked as a duplicate of bug 1958972 ***

Comment 5 Red Hat Bugzilla 2023-09-15 01:04:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.