When openshift-sdn is configured with a hostPrefix: 23 there can be 512 machines. If the cluster is scaled beyond this the sdn pods crash loop with I1126 19:36:56.745779 9085 node.go:385] Starting openshift-sdn network plugin W1126 19:36:56.748191 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:36:57.750734 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:36:59.253661 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:37:01.506704 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:37:04.884562 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:37:09.949598 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:37:17.545562 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:37:28.938639 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:37:46.027031 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:38:11.659115 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... W1126 19:38:50.105344 9085 subnets.go:159] Could not find an allocated subnet for node: ip-10-0-134-38.ec2.internal, Waiting... F1126 19:38:50.105373 9085 cmd.go:111] Failed to start sdn: failed to get subnet for this host: ip-10-0-134-38.ec2.internal, error: timed out waiting for the condition but no alert or other top level signal is raised to the admin "you need to provision more nodes". Both openshift-sdn and ovn-kube need to report a metric which is "remaining capacity" and then have alerts when the threshold is crossed. Nodes report not ready due to "CNI default network not found" which is a terrible error and multus needs to improve communication of failures.
Aniket hi, I saw the fixed PR is only OVN side. no need for sdn ?
Verified the V4 version of the metrics and alerts on 4.7.0-0.nightly-2021-01-05-055003 on an OVN cluster on AWS available subnet metric correct for hostPrefix = 26|23|20 allocated subnet metric correct as number of machines increased and decreased alarms fired when 80% of the subnets were allocated exceeding the allocatable number of subnets results in an alarm with 100% allocated and the surplus nodes are stuck NotReady with this reason: Ready False Tue, 05 Jan 2021 20:34:18 +0000 Tue, 05 Jan 2021 20:33:07 +0000 KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/ cni/net.d/. Has your network provider started?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633