+++ This bug was initially created as a clone of Bug #1777040 +++
In the linked bug we pushed a cluster past what openshift-sdn could support, but multus only displayed a generic error in the Kubelet status ("CNI default network not found"). This is a pretty critical message to get right for an admin to know where to look, so there are a few improvements that need to be made:
1. The message should more clearly indicate what the heck a default CNI network means - I suggest something like "Default networking provider for this node has not been registered, unable to assign IPs to pods"
2. It would be good to consider whether multus has enough info to tell us where we should be looking next (the pod or provider) - if not, that's ok, but this is a usability chokepoint. Whether that's changes to CNI (so that partial failure could be reported) or whether it's something the network plugin should be telling multus about at a kube level, or whether it's some other mechanism, it doesn't matter so long as you don't have to infer *where* the default network provider is.
3. There should be a metric and alert for this sort of scenario. In the linked bug the networking provider should be reporting a capacity error (can't allocate any more subnets). But multus should probably be reporting a metric on whether the default network is available, and if more detail is available, why.
Setting to urgent because the message is not explicable to a normal admin, and a normal admin could hit it. It should be crystal clear that the admin needs to go debug the cluster networking provider and where that is.
--- Additional comment from Tomofumi Hayashi on 2019-12-01 23:29:22 EST ---
Is the error message actually "Missing CNI default network", instead of "CNI default network not found"?
--- Additional comment from Tomofumi Hayashi on 2019-12-04 09:09:46 EST ---