In the linked bug we pushed a cluster past what openshift-sdn could support, but multus only displayed a generic error in the Kubelet status ("CNI default network not found"). This is a pretty critical message to get right for an admin to know where to look, so there are a few improvements that need to be made: 1. The message should more clearly indicate what the heck a default CNI network means - I suggest something like "Default networking provider for this node has not been registered, unable to assign IPs to pods" 2. It would be good to consider whether multus has enough info to tell us where we should be looking next (the pod or provider) - if not, that's ok, but this is a usability chokepoint. Whether that's changes to CNI (so that partial failure could be reported) or whether it's something the network plugin should be telling multus about at a kube level, or whether it's some other mechanism, it doesn't matter so long as you don't have to infer *where* the default network provider is. 3. There should be a metric and alert for this sort of scenario. In the linked bug the networking provider should be reporting a capacity error (can't allocate any more subnets). But multus should probably be reporting a metric on whether the default network is available, and if more detail is available, why. Setting to urgent because the message is not explicable to a normal admin, and a normal admin could hit it. It should be crystal clear that the admin needs to go debug the cluster networking provider and where that is.
Is the error message actually "Missing CNI default network", instead of "CNI default network not found"?
https://github.com/cri-o/ocicni/pull/66
@Weibin please help check if this bug can be verified?
PR66 of ocicni is merged, but cri-o does not update vendored package yet, hence this fix is not released yet.
Tested on 4.4.0-0.nightly-2020-01-06-072200 Still see old message "NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network" Should see the new message like https://github.com/cri-o/ocicni/pull/66/files#diff-8619c46869a984d0a789a8642c1b7654R64 Wait for crio update vendored package.
Still wait for https://github.com/cri-o/cri-o/pull/3160
@weibin Above PR has been merged. could you help check if this bug can be verified? thanks
https://github.com/cri-o/cri-o/pull/3160 is merged, but verification failed on 4.4.0-0.nightly-2020-02-04-101225, still see the old log error: Feb 04 16:30:55 ip-10-0-141-246 hyperkube[1671]: I0204 16:30:55.294828 1671 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-dns", Name:"dns-default-5fgzj", UID:"55da9c88-b51d-4545-8483-d76bfe450be7", APIVersion:"v1", ResourceVersion:"12947", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network
@Aniket, I looked OCP's cri-o repo[1] as following. The repo uses cri-o 1.17[1] and cri-o 1.17 does not contain the fixes[2]. The PR[3] introduces its fix into upstream so it should be fixed in OCP 4.5.0, that introduce 1.18 as [4]. [1]:http://pkgs.devel.redhat.com/cgit/rpms/cri-o/tree/cri-o.spec?h=rhaos-4.4-rhel-8#n30 [2]: https://github.com/cri-o/cri-o/blob/release-1.17/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L57 [3]: https://github.com/cri-o/cri-o/commit/437fb7356bd3667794238d3f26a073da75a0a391 [4]: https://github.com/cri-o/cri-o/blob/release-1.18/vendor/github.com/cri-o/ocicni/pkg/ocicni/ocicni.go#L64
@tomo -- Is this in 4.5? Can we move it to ON_QE yet?
*** Bug 1809327 has been marked as a duplicate of this bug. ***
@ben, rhaos-4.5-rhel-8 and rhaos-4.5-rhel-7 have the fix now, hence we can move it to ON_QE.
Tested and verified in 4.5.0-0.nightly-2020-05-11-114800, the logs show fixed warning messages nwo: May 11 14:40:48 welian-ch6sv-w-b-77g8l.c.openshift-qe.internal hyperkube[1503]: I0511 14:40:48.865749 1503 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-monitoring", Name:"prometheus-k8s-1", UID:"7846227b-30cd-4e41-ae90-927c2a4a48ae", APIVersion:"v1", ResourceVersion:"20128", FieldPath:""}): type: 'Warning' reason: 'NetworkNotReady' network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409